How to resolve the algorithm Bioinformatics/Global alignment step by step in the Nim programming language
How to resolve the algorithm Bioinformatics/Global alignment step by step in the Nim programming language
Table of Contents
Problem Statement
Global alignment is designed to search for highly similar regions in two or more DNA sequences, where the sequences appear in the same order and orientation, fitting the sequences in as pieces in a puzzle. Current DNA sequencers find the sequence for multiple small segments of DNA which have mostly randomly formed by splitting a much larger DNA molecule into shorter segments. When re-assembling such segments of DNA sequences into a larger sequence to form, for example, the DNA coding for the relevant gene, the overlaps between multiple shorter sequences are commonly used to decide how the longer sequence is to be assembled. For example, "AAGATGGA", GGAGCGCATC", and "ATCGCAATAAGGA" can be assembled into the sequence "AAGATGGAGCGCATCGCAATAAGGA" by noting that "GGA" is at the tail of the first string and head of the second string and "ATC" likewise is at the tail of the second and head of the third string. When looking for the best global alignment in the output strings produced by DNA sequences, there are typically a large number of such overlaps among a large number of sequences. In such a case, the ordering that results in the shortest common superstring is generrally preferred. Finding such a supersequence is an NP-hard problem, and many algorithms have been proposed to shorten calculations, especially when many very long sequences are matched. The shortest common superstring as used in bioinfomatics here differs from the string task Shortest_common_supersequence. In that task, a supersequence may have other characters interposed as long as the characters of each subsequence appear in order, so that (abcbdab, abdcaba) -> abdcabdab. In this task, (abcbdab, abdcaba) -> abcbdabdcaba.
Let's start with the solution:
Step by Step solution about How to resolve the algorithm Bioinformatics/Global alignment step by step in the Nim programming language
Source code in the nim programming language
import algorithm, sequtils, strformat, strutils, tables
const ACGT = ['A', 'C', 'G', 'T'] # Four DNA bases.
iterator permutations(slist: seq[string]): seq[string] =
var slist = sorted(slist)
yield slist
while slist.nextPermutation():
yield slist
proc printCounts(dnaSeq: string) =
## Given a DNA sequence, report the sequence, length and base counts.
let counts = dnaSeq.toCountTable()
echo &"\nNucleotide counts for {dnaSeq}:\n"
for base in ACGT:
echo &"{($base):>10} {counts[base]:11}"
var others = 0
for base in counts.keys:
if base notin ACGT: inc others, counts[base]
echo &" Other {others:11}"
echo &" ————————————————————"
echo &" Total length {dnaSeq.len: 7}"
func headTailOverlap(s1, s2: string): int =
## Return the position in "s1" of the start of overlap
## of tail of string "s1" with head of string "s2".
var start = 0
while true:
start = s1.find(s2[0], start)
if start < 0: return 0
if s2.startsWith(s1[start..^1]): return s1.len - start
inc start
proc deduplicate(slist: seq[string]): seq[string] =
## Remove duplicates and strings contained within a larger string from a list of strings.
let slist = sequtils.deduplicate(slist)
for i, s1 in slist:
block check:
for j, s2 in slist:
if j != i and s1 in s2:
break check
# "s1" is not contained in another string.
result.add s1
func shortestCommonSuperstring(slist: seq[string]): string =
## Return shortest common superstring of a list of strings.
let slist = slist.deduplicate()
result = slist.join()
for perm in slist.permutations():
var sup = perm[0]
for i in 0..<slist.high:
let overlapPos = headTailOverlap(perm[i], perm[i+1])
sup &= perm[i+1][overlapPos..^1]
if sup.len < result.len: result = sup
const TestSequences = [
@["TA", "AAG", "TA", "GAA", "TA"],
@["CATTAGGG", "ATTAG", "GGG", "TA"],
@["AAGAUGGA", "GGAGCGCAUC", "AUCGCAAUAAGGA"],
@["ATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTAT",
"GGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGT",
"CTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA",
"TGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
"AACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT",
"GCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTC",
"CGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCT",
"TGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
"CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGC",
"GATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATT",
"TTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
"CTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA",
"TCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGA"]]
for test in TestSequences:
let scs = test.shortestCommonSuperstring
scs.printCounts()
You may also check:How to resolve the algorithm Conditional structures step by step in the V programming language
You may also check:How to resolve the algorithm Rot-13 step by step in the AutoHotkey programming language
You may also check:How to resolve the algorithm Generate lower case ASCII alphabet step by step in the Swift programming language
You may also check:How to resolve the algorithm Largest proper divisor of n step by step in the Delphi programming language
You may also check:How to resolve the algorithm Hello world/Text step by step in the MUMPS programming language