How to resolve the algorithm FASTA format step by step in the Haskell programming language
Published on 7 June 2024 03:52 AM
How to resolve the algorithm FASTA format step by step in the Haskell programming language
Table of Contents
Problem Statement
In bioinformatics, long character strings are often encoded in a format called FASTA.
A FASTA file can contain several strings, each identified by a name marked by a > (greater than) character at the beginning of the line.
Write a program that reads a FASTA file such as: Note that a high-quality implementation will not hold the entire file in memory at once; real FASTA files can be multiple gigabytes in size.
Let's start with the solution:
Step by Step solution about How to resolve the algorithm FASTA format step by step in the Haskell programming language
- First code:
- The function
parseFasta
takes a file name as an argument and reads the file using thereadFile
function. - The
readFasta
function parses the file contents into a list of tuples, where each tuple contains a sequence name and a sequence code. - The
groupBy
function is used to group the lines of the file by whether they start with a>
character (indicating the start of a new sequence) or not. - The
pair
function is used to combine the grouped lines into tuples. - The
mapM_
function is used to apply theputStrLn
function to each tuple, which prints the sequence name and code to the standard output.
- The function
- Second code:
- The function
parseFasta
takes a file name as an argument and reads the file using thereadFile
function. - The
readP_to_S
function is used to parse the file contents using thereadFasta
parser. - The
readFasta
parser is defined using theReadP
library, which provides a convenient way to define parsers for structured data. - The
pair
parser combines the name and code parsers into a single parser that returns a tuple. - The
name
parser matches a>
character followed by a sequence of alphanumeric characters and underscores. - The
code
parser matches a sequence of lines, each of which contains a sequence of alphabetic characters. - The
newline
parser matches a newline character.
- The function
Source code in the haskell programming language
import Data.List ( groupBy )
parseFasta :: FilePath -> IO ()
parseFasta fileName = do
file <- readFile fileName
let pairedFasta = readFasta $ lines file
mapM_ (\(name, code) -> putStrLn $ name ++ ": " ++ code) pairedFasta
readFasta :: [String] -> [(String, String)]
readFasta = pair . map concat . groupBy (\x y -> notName x && notName y)
where
notName :: String -> Bool
notName = (/=) '>' . head
pair :: [String] -> [(String, String)]
pair [] = []
pair (x : y : xs) = (drop 1 x, y) : pair xs
import Text.ParserCombinators.ReadP
import Control.Applicative ( (<|>) )
import Data.Char ( isAlpha, isAlphaNum )
parseFasta :: FilePath -> IO ()
parseFasta fileName = do
file <- readFile fileName
let pairs = fst . last . readP_to_S readFasta $ file
mapM_ (\(name, code) -> putStrLn $ name ++ ": " ++ code) pairs
readFasta :: ReadP [(String, String)]
readFasta = many pair <* eof
where
pair = (,) <$> name <*> code
name = char '>' *> many (satisfy isAlphaNum <|> char '_') <* newline
code = concat <$> many (many (satisfy isAlpha) <* newline)
newline = char '\n'
You may also check:How to resolve the algorithm Rosetta Code/Rank languages by number of users step by step in the zkl programming language
You may also check:How to resolve the algorithm Keyboard input/Keypress check step by step in the Axe programming language
You may also check:How to resolve the algorithm Gaussian elimination step by step in the PL/I programming language
You may also check:How to resolve the algorithm Date format step by step in the Ruby programming language
You may also check:How to resolve the algorithm Remove duplicate elements step by step in the Bracmat programming language