How to resolve the algorithm FASTA format step by step in the Haskell programming language

Published on 7 June 2024 03:52 AM

How to resolve the algorithm FASTA format step by step in the Haskell programming language

Table of Contents

Problem Statement

In bioinformatics, long character strings are often encoded in a format called FASTA.
A FASTA file can contain several strings, each identified by a name marked by a > (greater than) character at the beginning of the line.

Write a program that reads a FASTA file such as: Note that a high-quality implementation will not hold the entire file in memory at once; real FASTA files can be multiple gigabytes in size.

Let's start with the solution:

Step by Step solution about How to resolve the algorithm FASTA format step by step in the Haskell programming language

  1. First code:
    • The function parseFasta takes a file name as an argument and reads the file using the readFile function.
    • The readFasta function parses the file contents into a list of tuples, where each tuple contains a sequence name and a sequence code.
    • The groupBy function is used to group the lines of the file by whether they start with a > character (indicating the start of a new sequence) or not.
    • The pair function is used to combine the grouped lines into tuples.
    • The mapM_ function is used to apply the putStrLn function to each tuple, which prints the sequence name and code to the standard output.
  2. Second code:
    • The function parseFasta takes a file name as an argument and reads the file using the readFile function.
    • The readP_to_S function is used to parse the file contents using the readFasta parser.
    • The readFasta parser is defined using the ReadP library, which provides a convenient way to define parsers for structured data.
    • The pair parser combines the name and code parsers into a single parser that returns a tuple.
    • The name parser matches a > character followed by a sequence of alphanumeric characters and underscores.
    • The code parser matches a sequence of lines, each of which contains a sequence of alphabetic characters.
    • The newline parser matches a newline character.

Source code in the haskell programming language

import Data.List ( groupBy )

parseFasta :: FilePath -> IO ()
parseFasta fileName = do
  file <- readFile fileName
  let pairedFasta = readFasta $ lines file
  mapM_ (\(name, code) -> putStrLn $ name ++ ": " ++ code) pairedFasta

readFasta :: [String] -> [(String, String)]
readFasta = pair . map concat . groupBy (\x y -> notName x && notName y)
 where
  notName :: String -> Bool
  notName = (/=) '>' . head

  pair :: [String] -> [(String, String)]
  pair []           = []
  pair (x : y : xs) = (drop 1 x, y) : pair xs


import Text.ParserCombinators.ReadP
import Control.Applicative ( (<|>) )
import Data.Char ( isAlpha, isAlphaNum )

parseFasta :: FilePath -> IO ()
parseFasta fileName = do
  file <- readFile fileName
  let pairs = fst . last . readP_to_S readFasta $ file
  mapM_ (\(name, code) -> putStrLn $ name ++ ": " ++ code) pairs


readFasta :: ReadP [(String, String)]
readFasta = many pair <* eof
 where
  pair    = (,) <$> name <*> code
  name    = char '>' *> many (satisfy isAlphaNum <|> char '_') <* newline
  code    = concat <$> many (many (satisfy isAlpha) <* newline)
  newline = char '\n'


  

You may also check:How to resolve the algorithm Rosetta Code/Rank languages by number of users step by step in the zkl programming language
You may also check:How to resolve the algorithm Keyboard input/Keypress check step by step in the Axe programming language
You may also check:How to resolve the algorithm Gaussian elimination step by step in the PL/I programming language
You may also check:How to resolve the algorithm Date format step by step in the Ruby programming language
You may also check:How to resolve the algorithm Remove duplicate elements step by step in the Bracmat programming language