How to resolve the algorithm Merge and aggregate datasets step by step in the Haskell programming language

Published on 7 June 2024 03:52 AM

How to resolve the algorithm Merge and aggregate datasets step by step in the Haskell programming language

Table of Contents

Problem Statement

Merge and aggregate datasets

Merge and aggregate two datasets as provided in   .csv   files into a new resulting dataset. Use the appropriate methods and data structures depending on the programming language. Use the most common libraries only when built-in functionality is not sufficient.

Either load the data from the   .csv   files or create the required data structures hard-coded.

patients.csv   file contents:

visits.csv   file contents:

Create a resulting dataset in-memory or output it to screen or file, whichever is appropriate for the programming language at hand. Merge and group per patient id and last name,   get the maximum visit date,   and get the sum and average of the scores per patient to get the resulting dataset.

Note that the visit date is purposefully provided as ISO format,   so that it could also be processed as text and sorted alphabetically to determine the maximum date.

This task is aimed in particular at programming languages that are used in data science and data processing, such as F#, Python, R, SPSS, MATLAB etc.

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Merge and aggregate datasets step by step in the Haskell programming language

This Haskell code processes a database of patients and their visits. It reads data from CSV files, parses it into a structured representation, and performs various operations to manipulate and analyze the data. Let's break down the code step by step:

  1. Data Types and Instances:

    • The code defines several data types and instances for working with the patient database.
    • DB represents the database as a list of Patient records. It has Semigroup and Monoid instances for combining multiple databases.
    • Patient represents an individual patient, including their patient ID (pid), name (name), visit dates (visits), and scores (scores). Patient also has Semigroup and Monoid instances for combining patient records.
  2. Data Parsing:

    • readDB reads a CSV file and parses it into a DB record using readPatient.
    • readPatient parses a single CSV record into a Patient record. It extracts values from the CSV fields and converts them to the appropriate data types.
  3. CSV Parsing:

    • readCSV splits a CSV file into a list of tuples, where each tuple represents a key-value pair from a CSV record.
    • splitBy is a helper function that splits a string into a list of substrings based on a delimiter character.
  4. Database Normalization:

    • normalize normalizes a list of Patient records by grouping them by patient ID and merging duplicate entries. This ensures that each patient has a single, combined record.
  5. Data Tabulation:

    • tabulateDB converts the DB record into a tabular format for printing. It takes a DB record, a list of header labels (header), and a list of column functions (cols), and generates a formatted table.
    • pad is a helper function that pads a string with spaces to a specified width.
  6. Main Execution:

    • The main function reads two CSV files, patients.csv and visits.csv, into DB records using readDB.
    • It then combines the two databases using the <> operator and tabulates the combined database using tabulateDB.
    • The resulting table is printed to the console.
  7. Helper Functions:

    • mean calculates the mean (average) of a list of numeric values.
    • fromMaybe returns the value inside a Maybe if it is Just, otherwise it returns the default value.
    • maybeToList converts a Maybe value to a list, with an empty list as the default for Nothing.
    • sequence applies a list of functions to a single value, returning a list of the results.
    • transpose transposes a list of lists, swapping rows and columns.
    • intercalate inserts a separator character between the elements of a list and returns a single string.

Source code in the haskell programming language

import Data.List
import Data.Maybe
import System.IO (readFile)
import Text.Read (readMaybe)
import Control.Applicative ((<|>))

------------------------------------------------------------

newtype DB = DB { entries :: [Patient] }
  deriving Show

instance Semigroup DB where
  DB a <> DB b = normalize $ a <> b

instance Monoid DB where
  mempty = DB []

normalize :: [Patient] -> DB
normalize = DB
            . map mconcat 
            . groupBy (\x y -> pid x == pid y)
            . sortOn pid
 
------------------------------------------------------------

data Patient = Patient { pid :: String
                       , name :: Maybe String
                       , visits :: [String]
                       , scores :: [Float] }
  deriving Show

instance Semigroup Patient where
  Patient p1 n1 v1 s1 <> Patient p2 n2 v2 s2 =
    Patient (fromJust $ Just p1 <|> Just p2)
            (n1 <|> n2)
            (v1 <|> v2)
            (s1 <|> s2)

instance Monoid Patient where
  mempty = Patient mempty mempty mempty mempty
    
------------------------------------------------------------

readDB :: String  -> DB
readDB = normalize
         . mapMaybe readPatient
         . readCSV

readPatient r = do
  i <- lookup "PATIENT_ID" r
  let n = lookup "LASTNAME" r
  let d = lookup "VISIT_DATE" r >>= readDate
  let s = lookup "SCORE" r >>= readMaybe
  return $ Patient i n (maybeToList d) (maybeToList s)
  where
    readDate [] = Nothing
    readDate d = Just d

readCSV :: String -> [(String, String)]
readCSV txt = zip header <$> body
  where
    header:body = splitBy ',' <$> lines txt
    splitBy ch = unfoldr go
      where
        go [] = Nothing
        go s  = Just $ drop 1 <$> span (/= ch) s


tabulateDB (DB ps) header cols = intercalate "|" <$> body
  where
    body = transpose $ zipWith pad width table
    table = transpose $ header : map showPatient ps
    showPatient p = sequence cols p
    width = maximum . map length <$> table
    pad n col = (' ' :) . take (n+1) . (++ repeat ' ') <$> col

main = do
  a <- readDB <$> readFile "patients.csv"
  b <- readDB <$> readFile "visits.csv"
  mapM_ putStrLn $ tabulateDB (a <> b) header fields
  where
    header = [ "PATIENT_ID", "LASTNAME", "VISIT_DATE"
             , "SCORES SUM","SCORES AVG"]
    fields = [ pid
             , fromMaybe [] . name
             , \p -> case visits p of {[] -> []; l -> last l}
             , \p -> case scores p of {[] -> []; s -> show (sum s)}
             , \p -> case scores p of {[] -> []; s -> show (mean s)} ]

    mean lst = sum lst / genericLength lst


  

You may also check:How to resolve the algorithm MD5/Implementation step by step in the RPG programming language
You may also check:How to resolve the algorithm Terminal control/Cursor positioning step by step in the Pascal programming language
You may also check:How to resolve the algorithm Cumulative standard deviation step by step in the Scala programming language
You may also check:How to resolve the algorithm Array concatenation step by step in the Babel programming language
You may also check:How to resolve the algorithm Conditional structures step by step in the Unison programming language