How to resolve the algorithm Merge and aggregate datasets step by step in the Haskell programming language
How to resolve the algorithm Merge and aggregate datasets step by step in the Haskell programming language
Table of Contents
Problem Statement
Merge and aggregate datasets
Merge and aggregate two datasets as provided in .csv files into a new resulting dataset. Use the appropriate methods and data structures depending on the programming language. Use the most common libraries only when built-in functionality is not sufficient.
Either load the data from the .csv files or create the required data structures hard-coded.
patients.csv file contents:
visits.csv file contents:
Create a resulting dataset in-memory or output it to screen or file, whichever is appropriate for the programming language at hand. Merge and group per patient id and last name, get the maximum visit date, and get the sum and average of the scores per patient to get the resulting dataset.
Note that the visit date is purposefully provided as ISO format, so that it could also be processed as text and sorted alphabetically to determine the maximum date.
This task is aimed in particular at programming languages that are used in data science and data processing, such as F#, Python, R, SPSS, MATLAB etc.
Let's start with the solution:
Step by Step solution about How to resolve the algorithm Merge and aggregate datasets step by step in the Haskell programming language
This Haskell code processes a database of patients and their visits. It reads data from CSV files, parses it into a structured representation, and performs various operations to manipulate and analyze the data. Let's break down the code step by step:
-
Data Types and Instances:
- The code defines several data types and instances for working with the patient database.
DB
represents the database as a list ofPatient
records. It hasSemigroup
andMonoid
instances for combining multiple databases.Patient
represents an individual patient, including their patient ID (pid
), name (name
), visit dates (visits
), and scores (scores
).Patient
also hasSemigroup
andMonoid
instances for combining patient records.
-
Data Parsing:
readDB
reads a CSV file and parses it into aDB
record usingreadPatient
.readPatient
parses a single CSV record into aPatient
record. It extracts values from the CSV fields and converts them to the appropriate data types.
-
CSV Parsing:
readCSV
splits a CSV file into a list of tuples, where each tuple represents a key-value pair from a CSV record.splitBy
is a helper function that splits a string into a list of substrings based on a delimiter character.
-
Database Normalization:
normalize
normalizes a list ofPatient
records by grouping them by patient ID and merging duplicate entries. This ensures that each patient has a single, combined record.
-
Data Tabulation:
tabulateDB
converts theDB
record into a tabular format for printing. It takes aDB
record, a list of header labels (header
), and a list of column functions (cols
), and generates a formatted table.pad
is a helper function that pads a string with spaces to a specified width.
-
Main Execution:
- The
main
function reads two CSV files,patients.csv
andvisits.csv
, intoDB
records usingreadDB
. - It then combines the two databases using the
<>
operator and tabulates the combined database usingtabulateDB
. - The resulting table is printed to the console.
- The
-
Helper Functions:
mean
calculates the mean (average) of a list of numeric values.fromMaybe
returns the value inside aMaybe
if it isJust
, otherwise it returns the default value.maybeToList
converts aMaybe
value to a list, with an empty list as the default forNothing
.sequence
applies a list of functions to a single value, returning a list of the results.transpose
transposes a list of lists, swapping rows and columns.intercalate
inserts a separator character between the elements of a list and returns a single string.
Source code in the haskell programming language
import Data.List
import Data.Maybe
import System.IO (readFile)
import Text.Read (readMaybe)
import Control.Applicative ((<|>))
------------------------------------------------------------
newtype DB = DB { entries :: [Patient] }
deriving Show
instance Semigroup DB where
DB a <> DB b = normalize $ a <> b
instance Monoid DB where
mempty = DB []
normalize :: [Patient] -> DB
normalize = DB
. map mconcat
. groupBy (\x y -> pid x == pid y)
. sortOn pid
------------------------------------------------------------
data Patient = Patient { pid :: String
, name :: Maybe String
, visits :: [String]
, scores :: [Float] }
deriving Show
instance Semigroup Patient where
Patient p1 n1 v1 s1 <> Patient p2 n2 v2 s2 =
Patient (fromJust $ Just p1 <|> Just p2)
(n1 <|> n2)
(v1 <|> v2)
(s1 <|> s2)
instance Monoid Patient where
mempty = Patient mempty mempty mempty mempty
------------------------------------------------------------
readDB :: String -> DB
readDB = normalize
. mapMaybe readPatient
. readCSV
readPatient r = do
i <- lookup "PATIENT_ID" r
let n = lookup "LASTNAME" r
let d = lookup "VISIT_DATE" r >>= readDate
let s = lookup "SCORE" r >>= readMaybe
return $ Patient i n (maybeToList d) (maybeToList s)
where
readDate [] = Nothing
readDate d = Just d
readCSV :: String -> [(String, String)]
readCSV txt = zip header <$> body
where
header:body = splitBy ',' <$> lines txt
splitBy ch = unfoldr go
where
go [] = Nothing
go s = Just $ drop 1 <$> span (/= ch) s
tabulateDB (DB ps) header cols = intercalate "|" <$> body
where
body = transpose $ zipWith pad width table
table = transpose $ header : map showPatient ps
showPatient p = sequence cols p
width = maximum . map length <$> table
pad n col = (' ' :) . take (n+1) . (++ repeat ' ') <$> col
main = do
a <- readDB <$> readFile "patients.csv"
b <- readDB <$> readFile "visits.csv"
mapM_ putStrLn $ tabulateDB (a <> b) header fields
where
header = [ "PATIENT_ID", "LASTNAME", "VISIT_DATE"
, "SCORES SUM","SCORES AVG"]
fields = [ pid
, fromMaybe [] . name
, \p -> case visits p of {[] -> []; l -> last l}
, \p -> case scores p of {[] -> []; s -> show (sum s)}
, \p -> case scores p of {[] -> []; s -> show (mean s)} ]
mean lst = sum lst / genericLength lst
You may also check:How to resolve the algorithm MD5/Implementation step by step in the RPG programming language
You may also check:How to resolve the algorithm Terminal control/Cursor positioning step by step in the Pascal programming language
You may also check:How to resolve the algorithm Cumulative standard deviation step by step in the Scala programming language
You may also check:How to resolve the algorithm Array concatenation step by step in the Babel programming language
You may also check:How to resolve the algorithm Conditional structures step by step in the Unison programming language