How to resolve the algorithm Bioinformatics/Global alignment step by step in the jq programming language

Published on 12 May 2024 09:40 PM
#Jq

How to resolve the algorithm Bioinformatics/Global alignment step by step in the jq programming language

Table of Contents

Problem Statement

Global alignment is designed to search for highly similar regions in two or more DNA sequences, where the sequences appear in the same order and orientation, fitting the sequences in as pieces in a puzzle. Current DNA sequencers find the sequence for multiple small segments of DNA which have mostly randomly formed by splitting a much larger DNA molecule into shorter segments. When re-assembling such segments of DNA sequences into a larger sequence to form, for example, the DNA coding for the relevant gene, the overlaps between multiple shorter sequences are commonly used to decide how the longer sequence is to be assembled. For example, "AAGATGGA", GGAGCGCATC", and "ATCGCAATAAGGA" can be assembled into the sequence "AAGATGGAGCGCATCGCAATAAGGA" by noting that "GGA" is at the tail of the first string and head of the second string and "ATC" likewise is at the tail of the second and head of the third string. When looking for the best global alignment in the output strings produced by DNA sequences, there are typically a large number of such overlaps among a large number of sequences. In such a case, the ordering that results in the shortest common superstring is generrally preferred. Finding such a supersequence is an NP-hard problem, and many algorithms have been proposed to shorten calculations, especially when many very long sequences are matched. The shortest common superstring as used in bioinfomatics here differs from the string task Shortest_common_supersequence. In that task, a supersequence may have other characters interposed as long as the characters of each subsequence appear in order, so that (abcbdab, abdcaba) -> abdcabdab. In this task, (abcbdab, abdcaba) -> abcbdabdcaba.

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Bioinformatics/Global alignment step by step in the jq programming language

Source code in the jq programming language

### Generic helper functions

# bag-of-words
def bow(stream): 
  reduce stream as $word ({}; .[($word|tostring)] += 1);

def permutations:
  if length == 0 then []
  else
    range(0;length) as $i
    | [.[$i]] + (del(.[$i])|permutations)
  end ;

# Give a synoptic view of the input string,
# highlighting the occurrence of ACGTU letters
def synopsis:
  ["A", "C", "G", "T", "U"] as $standard
  | . as $seq
  | bow(explode | map([.]|implode)[]) as $bases
  | ("Nucleotide counts for \($seq):\n"),
    (($standard + ($bases|keys - $standard))[] | "\(.): \($bases[.]//0)"),
    "__",
    "Σ: \($seq|length)" ;

# If the strings, $s1 and $s2, overlap by at least $minimumoverlap characters,
# return { i1: <index in $s1 where overlap starts>,  overlap: <overlapping string>},
# otherwise, return null
def overlap_info($s1; $s2; $minimumoverlap):
  first( range(0; $s1|length + 1 - $minimumoverlap) as $i1
         | $s1[$i1:] as $overlap
         | select($s2 | startswith($overlap))
	 | {$i1, $overlap} ) // null ;

# Input: an array of strings
# Remove duplicates and strings contained within a larger string
def deduplicate:
  unique
  | . as $arr
  | reduce range(0;length) as $i ([];
      $arr[$i] as $s1
      | if any( $arr[] | select(. != $s1); index($s1))
        then .
	else . + [$s1]
	end);

# Given an array of deduplicated strings, attempt to find a superstring
# composed of these strings in the same order;
# return it if found, else null.
def relevant($min):
  . as $in
  | length as $length
  | {s: .[0], i:0}
  | until (.s == null or .i >= $length - 1;
       .i as $i
       # Since the strings have been deduplicated we can use $in[$i]:
       | overlap_info($in[$i]; $in[$i+1]; $min) as $overlap
       | if $overlap then .s += $in[$i+1][$overlap.overlap|length:]
         else .s = null
	 end
       | .i += 1 )
   | .s ;
     
# Input: an array of strings
# Return shortest common superstring
def shortest_common_superstring:
  deduplicate as $ss
  | reduce ($ss | permutations) as $perm ({shortestsuper: ($ss | add) };
      ($perm | relevant(1)) as $candidate
      | if $candidate and ($candidate|length) < (.shortestsuper|length)
        then .shortestsuper = $candidate
        else . end)
  | .shortestsuper;

def examples:
 [
  ["TA", "AAG", "TA", "GAA", "TA"],

  ["CATTAGGG", "ATTAG", "GGG", "TA"],

  ["AAGAUGGA", "GGAGCGCAUC", "AUCGCAAUAAGGA"],

  ["ATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTAT",
   "GGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGT",
   "CTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA",
   "TGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
   "AACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT",
   "GCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTC",
   "CGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCT",
   "TGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
   "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGC",
   "GATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATT",
   "TTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
   "CTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA",
   "TCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGA"]
  ];

def tasks:
  def t: shortest_common_superstring | synopsis;

  examples
  | . as $examples
  | range(0;length) as $i
  |  "Task \($i+1):", ($examples[$i]|t), "";

tasks

  

You may also check:How to resolve the algorithm Middle three digits step by step in the RPL programming language
You may also check:How to resolve the algorithm Mouse position step by step in the MiniScript programming language
You may also check:How to resolve the algorithm Loops/Infinite step by step in the RPL programming language
You may also check:How to resolve the algorithm Write float arrays to a text file step by step in the zkl programming language
You may also check:How to resolve the algorithm Special variables step by step in the 6502 Assembly programming language