How to resolve the algorithm Bioinformatics/Global alignment step by step in the Java programming language

Problem Statement
Step by Step Solution
Sourcecode

Problem Statement

Global alignment is designed to search for highly similar regions in two or more DNA sequences, where the sequences appear in the same order and orientation, fitting the sequences in as pieces in a puzzle. Current DNA sequencers find the sequence for multiple small segments of DNA which have mostly randomly formed by splitting a much larger DNA molecule into shorter segments. When re-assembling such segments of DNA sequences into a larger sequence to form, for example, the DNA coding for the relevant gene, the overlaps between multiple shorter sequences are commonly used to decide how the longer sequence is to be assembled. For example, "AAGATGGA", GGAGCGCATC", and "ATCGCAATAAGGA" can be assembled into the sequence "AAGATGGAGCGCATCGCAATAAGGA" by noting that "GGA" is at the tail of the first string and head of the second string and "ATC" likewise is at the tail of the second and head of the third string. When looking for the best global alignment in the output strings produced by DNA sequences, there are typically a large number of such overlaps among a large number of sequences. In such a case, the ordering that results in the shortest common superstring is generrally preferred. Finding such a supersequence is an NP-hard problem, and many algorithms have been proposed to shorten calculations, especially when many very long sequences are matched. The shortest common superstring as used in bioinfomatics here differs from the string task Shortest_common_supersequence. In that task, a supersequence may have other characters interposed as long as the characters of each subsequence appear in order, so that (abcbdab, abdcaba) -> abdcabdab. In this task, (abcbdab, abdcaba) -> abcbdabdcaba.

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Bioinformatics/Global alignment step by step in the Java programming language

BioinformaticsGlobalAlignment.java is a Java program that finds the shortest common superstrings of a list of strings. A superstring of a set of strings is a string that contains all of the strings in the set as substrings. The shortest common superstring of a set of strings is the shortest string that is a superstring of all of the strings in the set.

The program first deduplicates the list of strings, removes any duplicate strings or strings that are substrings of other strings. It then finds all permutations of the deduplicated list. For each permutation, it concatenates the strings in the permutation, removing any common suffixes and prefixes. The resulting string is a candidate for the shortest common superstring. The program keeps track of the shortest candidate and returns a set of all of the shortest candidates.

Finally, the program prints a report for each shortest common superstring. The report includes the nucleotide counts for the superstring and the total number of nucleotides in the superstring.

Here is a breakdown of the program:

The main method reads a list of test sequences from the command line and passes the list to the shortestCommonSuperstrings method. The shortestCommonSuperstrings method returns a set of all of the shortest common superstrings of the list of strings. The main method then prints a report for each shortest common superstring.
The shortestCommonSuperstrings method takes a list of strings as input and returns a set of all of the shortest common superstrings of the list of strings. The method first deduplicates the list of strings, removes any duplicate strings or strings that are substrings of other strings. It then finds all permutations of the deduplicated list. For each permutation, it concatenates the strings in the permutation, removing any common suffixes and prefixes. The resulting string is a candidate for the shortest common superstring. The program keeps track of the shortest candidate and returns a set of all of the shortest candidates.
The deduplicate method takes a list of strings as input and returns a list of deduplicated strings. The method first creates a set of all of the unique strings in the list. It then iterates over the unique strings and removes any string that is a substring of another string in the set. The resulting set of strings is returned.
The concatenate method takes two strings as input and returns a concatenated string. The method removes the longest suffix of the first string that matches a prefix of the second string before concatenating the two strings. The resulting string is returned.
The permutations method takes a list of strings as input and returns a list of all of the permutations of the list of strings. The method uses a recursive algorithm to generate all of the permutations.
The printReport method takes a string as input and prints a report for the string. The report includes the nucleotide counts for the string and the total number of nucleotides in the string.

Source code in the java programming language

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.stream.Collectors;

public final class BioinformaticsGlobalAlignment {

	public static void main(String[] aArgs) {
		List<List<String>> testSequences = Arrays.asList(
			Arrays.asList( "TA", "AAG", "TA", "GAA", "TA" ),
			Arrays.asList( "CATTAGGG", "ATTAG", "GGG", "TA" ),
			Arrays.asList( "AAGAUGGA", "GGAGCGCAUC", "AUCGCAAUAAGGA" ),
			Arrays.asList( "ATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTAT",
				"GGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGT",
				"CTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA",
				"TGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
				"AACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT",
				"GCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTC",
				"CGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCT",
				"TGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
				"CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGC",
				"GATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATT",
				"TTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
				"CTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA",
				"TCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGA" )
		);
		
		for ( List<String> test : testSequences ) {
			for ( String superstring : shortestCommonSuperstrings(test) ) {
				printReport(superstring);
			}
		}
	}
	
	// Return a set containing all of the shortest common superstrings of the given list of strings. 
	private static Set<String> shortestCommonSuperstrings(List<String> aList) {
		List<String> deduplicated = deduplicate(aList);
		Set<String> shortest = new HashSet<String>();
		shortest.add(String.join("", deduplicated));
		int shortestLength = aList.stream().mapToInt( s -> s.length() ).sum();
			  
		for ( List<String> permutation : permutations(deduplicated) ) {
			String candidate = permutation.stream().reduce("", (a, b) -> concatenate(a, b) );
	        if ( candidate.length() < shortestLength ) {
	            shortest.clear();
	            shortest.add(candidate);
	            shortestLength = candidate.length();
	        } else if ( candidate.length() == shortestLength ) {
	            shortest.add(candidate);
	        }
		}
		return shortest;		
	}

	// Remove duplicate strings and strings which are substrings of other strings in the given list.
	private static List<String> deduplicate(List<String> aList) {
		List<String> unique = aList.stream().distinct().collect(Collectors.toList());
		List<String> result = new ArrayList<String>(unique);
		List<String> markedForRemoval = new ArrayList<String>();
		for ( String testWord : result ) {
			for ( String word : unique ) {
				if ( ! word.equals(testWord) && word.contains(testWord) ) {
					markedForRemoval.add(testWord);
				}
			}
		}
		result.removeAll(markedForRemoval);
		return result;
	}	
	
	// Return aBefore concatenated with aAfter, removing the longest suffix of aBefore that matches a prefix of aAfter.
	private static String concatenate(String aBefore, String aAfter) {	
	    for ( int i = 0; i < aBefore.length(); i++ ) {
	        if ( aAfter.startsWith(aBefore.substring(i, aBefore.length())) ) {
	            return aBefore.substring(0, i) + aAfter;
	        }
	    }
	    return aBefore + aAfter;
	}
	
	// Return all permutations of the given list of strings.
	private static List<List<String>> permutations(List<String> aList) {
		int[] indexes = new int[aList.size()];
	    List<List<String>> result = new ArrayList<List<String>>();
	    result.add( new ArrayList<String>(aList) );
	    int i = 0;
		while ( i < aList.size() ) {
		    if ( indexes[i] < i ) {
		    	final int j = ( i % 2 == 0 ) ? 0 : indexes[i];
		    	String temp = aList.get(j);
		    	aList.set(j, aList.get(i));
		    	aList.set(i, temp);
		        result.add( new ArrayList<String>(aList) );
		        indexes[i]++;
		        i = 0;
		    } else {
		        indexes[i] = 0;
		        i += 1;
		    }
		}	    
		return result;
	}
	
	// Print a report of the given string to the standard output device.
	private static void printReport(String aText) {
		char[] nucleotides = new char[] {'A', 'C', 'G', 'T' };
		Map<Character, Integer> bases = new HashMap<Character, Integer>();
		for ( char base : nucleotides ) {
			bases.put(base, 0);
		}
		
		for ( char ch : aText.toCharArray() ) {
			bases.merge(ch, 1, Integer::sum);
		}
		final int total = bases.values().stream().reduce(0, Integer::sum);
		
		System.out.print("Nucleotide counts for: " + ( ( aText.length() > 50 ) ? System.lineSeparator() : "") );
		System.out.println(aText);
		System.out.print(String.format("%s%d%s%d%s%d%s%d",
			"Bases: A: ", bases.get('A'), ", C: ", bases.get('C'), ", G: ", bases.get('G'), ", T: ", bases.get('T')));
		System.out.println(", total: " + total + System.lineSeparator());
	}

}

You may also check:How to resolve the algorithm Count the coins step by step in the Raku programming language
You may also check:How to resolve the algorithm Fibonacci word step by step in the Fōrmulæ programming language
You may also check:How to resolve the algorithm Nonoblock step by step in the Elixir programming language
You may also check:How to resolve the algorithm Empty string step by step in the DWScript programming language
You may also check:How to resolve the algorithm N-smooth numbers step by step in the Go programming language

How to resolve the algorithm Bioinformatics/Global alignment step by step in the Java programming language

Table of Contents

Problem Statement

Step by Step solution about How to resolve the algorithm Bioinformatics/Global alignment step by step in the Java programming language

Source code in the java programming language