How to resolve the algorithm Bioinformatics/Global alignment step by step in the Nim programming language

Published on 12 May 2024 09:40 PM

How to resolve the algorithm Bioinformatics/Global alignment step by step in the Nim programming language

Table of Contents

Problem Statement

Global alignment is designed to search for highly similar regions in two or more DNA sequences, where the sequences appear in the same order and orientation, fitting the sequences in as pieces in a puzzle. Current DNA sequencers find the sequence for multiple small segments of DNA which have mostly randomly formed by splitting a much larger DNA molecule into shorter segments. When re-assembling such segments of DNA sequences into a larger sequence to form, for example, the DNA coding for the relevant gene, the overlaps between multiple shorter sequences are commonly used to decide how the longer sequence is to be assembled. For example, "AAGATGGA", GGAGCGCATC", and "ATCGCAATAAGGA" can be assembled into the sequence "AAGATGGAGCGCATCGCAATAAGGA" by noting that "GGA" is at the tail of the first string and head of the second string and "ATC" likewise is at the tail of the second and head of the third string. When looking for the best global alignment in the output strings produced by DNA sequences, there are typically a large number of such overlaps among a large number of sequences. In such a case, the ordering that results in the shortest common superstring is generrally preferred. Finding such a supersequence is an NP-hard problem, and many algorithms have been proposed to shorten calculations, especially when many very long sequences are matched. The shortest common superstring as used in bioinfomatics here differs from the string task Shortest_common_supersequence. In that task, a supersequence may have other characters interposed as long as the characters of each subsequence appear in order, so that (abcbdab, abdcaba) -> abdcabdab. In this task, (abcbdab, abdcaba) -> abcbdabdcaba.

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Bioinformatics/Global alignment step by step in the Nim programming language

Source code in the nim programming language

import algorithm, sequtils, strformat, strutils, tables

const ACGT = ['A', 'C', 'G', 'T']   # Four DNA bases.

iterator permutations(slist: seq[string]): seq[string] =
  var slist = sorted(slist)
  yield slist
  while slist.nextPermutation():
    yield slist


proc printCounts(dnaSeq: string) =
  ## Given a DNA sequence, report the sequence, length and base counts.
  let counts = dnaSeq.toCountTable()
  echo &"\nNucleotide counts for {dnaSeq}:\n"
  for base in ACGT:
    echo &"{($base):>10} {counts[base]:11}"
  var others = 0
  for base in counts.keys:
    if base notin ACGT: inc others, counts[base]
  echo &"     Other {others:11}"
  echo &"  ————————————————————"
  echo &"  Total length {dnaSeq.len: 7}"


func headTailOverlap(s1, s2: string): int =
  ## Return the position in "s1" of the start of overlap
  ## of tail of string "s1" with head of string "s2".
  var start = 0
  while true:
    start = s1.find(s2[0], start)
    if start < 0: return 0
    if s2.startsWith(s1[start..^1]): return s1.len - start
    inc start


proc deduplicate(slist: seq[string]): seq[string] =
  ## Remove duplicates and strings contained within a larger string from a list of strings.
  let slist = sequtils.deduplicate(slist)
  for i, s1 in slist:
    block check:
      for j, s2 in slist:
        if j != i and s1 in s2:
          break check
      # "s1" is not contained in another string.
      result.add s1


func shortestCommonSuperstring(slist: seq[string]): string =
  ## Return shortest common superstring of a list of strings.

  let slist = slist.deduplicate()
  result = slist.join()
  for perm in slist.permutations():
    var sup = perm[0]
    for i in 0..<slist.high:
      let overlapPos = headTailOverlap(perm[i], perm[i+1])
      sup &= perm[i+1][overlapPos..^1]
    if sup.len < result.len: result = sup


const TestSequences = [
  @["TA", "AAG", "TA", "GAA", "TA"],
  @["CATTAGGG", "ATTAG", "GGG", "TA"],
  @["AAGAUGGA", "GGAGCGCAUC", "AUCGCAAUAAGGA"],
  @["ATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTAT",
    "GGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGT",
    "CTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA",
    "TGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
    "AACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT",
    "GCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTC",
    "CGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCT",
    "TGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
    "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGC",
    "GATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATT",
    "TTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
    "CTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA",
    "TCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGA"]]

for test in TestSequences:
  let scs = test.shortestCommonSuperstring
  scs.printCounts()


  

You may also check:How to resolve the algorithm Conditional structures step by step in the V programming language
You may also check:How to resolve the algorithm Rot-13 step by step in the AutoHotkey programming language
You may also check:How to resolve the algorithm Generate lower case ASCII alphabet step by step in the Swift programming language
You may also check:How to resolve the algorithm Largest proper divisor of n step by step in the Delphi programming language
You may also check:How to resolve the algorithm Hello world/Text step by step in the MUMPS programming language