How to resolve the algorithm Sorensen–Dice coefficient step by step in the Java programming language

Problem Statement
Step by Step Solution
Sourcecode

Problem Statement

The Sørensen–Dice coefficient, also known as the Sørensen–Dice index (or sdi, or sometimes by one of the individual names: sorensen or dice,) is a statistic used to gauge the similarity of two poulation samples. The original use was in botany, indexing similarity between populations of flora and fauna in different areas, but it has uses in other fields as well. It can be used as a text similarity function somewhat similar to the Levenshtein edit distance function, though it's strength lies in a different area. Levenshtein is good for finding misspellings, but relies on the tested word / phrase being pretty similar to the desired one, and can be very slow for long words / phrases. Sørensen–Dice is more useful for 'fuzzy' matching partial, and poorly spelled words / phrases, possibly in improper order. There are several different methods to tokenize objects for Sørensen–Dice comparisons. The most typical tokenizing scheme for text is to break the words up into bi-grams: groups of two consecutive letters. For instance, the word 'differ' would be tokenized to the group: Different implementations may do slightly different transforms. For our purposes, fold case so that all characters are the same case, split words, and ignore white-space, but keep punctuation. The phrase "Don't Panic!" will be tokenized to the bi-grams: Sørensen–Dice measures the similarity of two groups by dividing twice the intersection token count by the total token count of both groups. For items(objects, populations) A and B: The Sørensen–Dice coefficient is a "percent similarity" between the two populations between 0.0 and 1.0. SDI can by used for spellchecking, but it's not really good at it, especially for short words. Where it really shines is for fuzzy matching of short phrases like book or movie titles. It may not return exactly what you are looking for, but often gets remarkably close with some pretty poor inputs.

How you get the task names is peripheral to the task. You can web-scrape them or download them to a file, whatever. If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations it is acceptable to use that with a pointer to where it may be obtained.

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Sorensen–Dice coefficient step by step in the Java programming language

This Java program takes a list of tasks and a list of tests, and for each test, it calculates the Sorensen-Dice coefficient between the test and each task. The Sorensen-Dice coefficient is a measure of similarity between two strings that is calculated by dividing the number of bigrams (sequences of two characters) that are common to both strings by the total number of bigrams in both strings.

The program first reads the list of tasks and tests from files. Then, for each test, it creates a bigram map for the test and for each task. A bigram map is a map from bigrams to the number of times that bigram occurs in the corresponding string.

The program then calculates the Sorensen-Dice coefficient between the bigram map for the test and the bigram map for each task. The Sorensen-Dice coefficient is calculated using the following formula:

Sorensen-Dice coefficient = 2 * (|A ∩ B|) / (|A| + |B|)

where |A| and |B| are the number of bigrams in strings A and B, respectively, and |A ∩ B| is the number of bigrams that are common to both A and B.

After calculating the Sorensen-Dice coefficient for each task, the program prints the top five tasks with the highest Sorensen-Dice coefficients for each test.

Here is a detailed breakdown of the code:

The main method reads the list of tasks and tests from files and then iterates over the tests.
For each test, the main method creates a bigram map for the test and for each task.
The main method then calculates the Sorensen-Dice coefficient between the bigram map for the test and the bigram map for each task.
The main method prints the top five tasks with the highest Sorensen-Dice coefficients for each test.
The sorensenDice method takes two bigram maps as input and returns the Sorensen-Dice coefficient between the two maps.
The createBigrams method takes a string as input and returns a bigram map for the string.
The size method takes a bigram map as input and returns the number of bigrams in the map.

Source code in the java programming language

import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public final class SorensenDiceCoefficient {

	public static void main(String[] args) throws IOException {
		List<String> tasks = Files.readAllLines(Path.of("Rosetta Code Tasks.dat"), StandardCharsets.UTF_8);
		
		List<String> tests = List.of(
			"Primordial primes", "Sunkist-Giuliani formula", "Sieve of Euripides", "Chowder numbers" );	
		
		record TaskValue(String task, double value) {}
			
		for ( String test : tests ) {
			List<TaskValue> taskValues = new ArrayList<TaskValue>();	
			Map<String, Integer> bigramsTest = createBigrams(test);
			for ( String task : tasks ) {
				double value = sorensenDice(bigramsTest, createBigrams(task));
				taskValues.add( new TaskValue(task, value) );
			}
			
			Collections.sort(taskValues, (one, two) -> Double.compare(two.value, one.value));
						
			System.out.println(test + ":");
			for ( int i = 0; i < 5; i++ ) {
				System.out.println(String.format("%s%.4f%s%s",
					"    ", taskValues.get(i).value, " ", taskValues.get(i).task));
			}
			System.out.println();
		}
	}
	
	private static double sorensenDice(Map<String, Integer> bigramsOne, Map<String, Integer> bigramsTwo) {
		int intersectionSize = 0;
		for ( Map.Entry<String, Integer> entry : bigramsOne.entrySet() ) {
			if ( bigramsTwo.keySet().contains(entry.getKey()) ) {
				intersectionSize += Math.min(entry.getValue(), bigramsTwo.get(entry.getKey()));
			}
		}	    
	    return 2.0 * intersectionSize / ( size(bigramsOne) + size(bigramsTwo) );
	}
	
	private static Map<String, Integer> createBigrams(String text) {
		Map<String, Integer> result = new HashMap<String, Integer>();
		for ( String word : text.toLowerCase().split(" ") ) {
			if ( word.length() == 1 ) {
				result.merge(word, 1, Integer::sum);
			} else {
				for ( int i = 0; i < word.length() - 1; i++ ) {
					result.merge(word.substring(i, i + 2), 1, Integer::sum);
				}
			}
		}
		return result;
	}
	
	private static int size(Map<String, Integer> map) {
		return map.values().stream().mapToInt(Integer::intValue).sum();
	}

}

You may also check:How to resolve the algorithm Box the compass step by step in the Visual Basic .NET programming language
You may also check:How to resolve the algorithm Zero to the zero power step by step in the Openscad programming language
You may also check:How to resolve the algorithm Identity matrix step by step in the NetRexx programming language
You may also check:How to resolve the algorithm Rate counter step by step in the Liberty BASIC programming language
You may also check:How to resolve the algorithm Stair-climbing puzzle step by step in the Raku programming language

How to resolve the algorithm Sorensen–Dice coefficient step by step in the Java programming language

Table of Contents

Problem Statement

Step by Step solution about How to resolve the algorithm Sorensen–Dice coefficient step by step in the Java programming language

Source code in the java programming language