How to resolve the algorithm Jaro similarity step by step in the Java programming language
How to resolve the algorithm Jaro similarity step by step in the Java programming language
Table of Contents
Problem Statement
The Jaro distance is a measure of edit distance between two strings; its inverse, called the Jaro similarity, is a measure of two strings' similarity: the higher the value, the more similar the strings are. The score is normalized such that 0 equates to no similarities and 1 is an exact match.
The Jaro similarity
d
j
{\displaystyle d_{j}}
of two given strings
s
1
{\displaystyle s_{1}}
and
s
2
{\displaystyle s_{2}}
is Where:
Two characters from
s
1
{\displaystyle s_{1}}
and
s
2
{\displaystyle s_{2}}
respectively, are considered matching only if they are the same and not farther apart than
⌊
max (
|
s
1
|
,
|
s
2
|
)
2
⌋
− 1
{\displaystyle \left\lfloor {\frac {\max(|s_{1}|,|s_{2}|)}{2}}\right\rfloor -1}
characters. Each character of
s
1
{\displaystyle s_{1}}
is compared with all its matching characters in
s
2
{\displaystyle s_{2}}
. Each difference in position is half a transposition; that is, the number of transpositions is half the number of characters which are common to the two strings but occupy different positions in each one.
Given the strings
s
1
{\displaystyle s_{1}}
DWAYNE and
s
2
{\displaystyle s_{2}}
DUANE we find:
We find a Jaro score of:
Implement the Jaro algorithm and show the similarity scores for each of the following pairs:
Let's start with the solution:
Step by Step solution about How to resolve the algorithm Jaro similarity step by step in the Java programming language
The code snippet you provided is an implementation of the Jaro distance algorithm in Java.
The Jaro distance is a measure of similarity between two strings. It is similar to the Levenshtein distance, but it is less sensitive to transpositions.
The Jaro distance is calculated as follows:
- Find the number of matching characters between the two strings.
- Find the number of transpositions between the two strings.
- Calculate the Jaro distance as follows:
jaro = (matches / s_len) + (matches / t_len) + ((matches - transpositions/2) / matches) / 3
where:
- s_len is the length of the first string
- t_len is the length of the second string
- matches is the number of matching characters
- transpositions is the number of transpositions
The Jaro distance is a value between 0 and 1. A value of 0 indicates that the two strings are completely different, while a value of 1 indicates that the two strings are identical.
The following table shows the Jaro distance for some pairs of strings:
Strings | Jaro distance |
---|---|
MARTHA, MARHTA | 0.9444 |
DIXON, DICKSONX | 0.8556 |
JELLYFISH, SMELLYFISH | 0.4333 |
The Jaro distance can be used to find similar strings in a database, or to measure the similarity of two text documents.
Source code in the java programming language
public class JaroDistance {
public static double jaro(String s, String t) {
int s_len = s.length();
int t_len = t.length();
if (s_len == 0 && t_len == 0) return 1;
int match_distance = Integer.max(s_len, t_len) / 2 - 1;
boolean[] s_matches = new boolean[s_len];
boolean[] t_matches = new boolean[t_len];
int matches = 0;
int transpositions = 0;
for (int i = 0; i < s_len; i++) {
int start = Integer.max(0, i-match_distance);
int end = Integer.min(i+match_distance+1, t_len);
for (int j = start; j < end; j++) {
if (t_matches[j]) continue;
if (s.charAt(i) != t.charAt(j)) continue;
s_matches[i] = true;
t_matches[j] = true;
matches++;
break;
}
}
if (matches == 0) return 0;
int k = 0;
for (int i = 0; i < s_len; i++) {
if (!s_matches[i]) continue;
while (!t_matches[k]) k++;
if (s.charAt(i) != t.charAt(k)) transpositions++;
k++;
}
return (((double)matches / s_len) +
((double)matches / t_len) +
(((double)matches - transpositions/2.0) / matches)) / 3.0;
}
public static void main(String[] args) {
System.out.println(jaro( "MARTHA", "MARHTA"));
System.out.println(jaro( "DIXON", "DICKSONX"));
System.out.println(jaro("JELLYFISH", "SMELLYFISH"));
}
}
You may also check:How to resolve the algorithm Euler's sum of powers conjecture step by step in the Picat programming language
You may also check:How to resolve the algorithm Loops/Downward for step by step in the MAXScript programming language
You may also check:How to resolve the algorithm Prime decomposition step by step in the Scheme programming language
You may also check:How to resolve the algorithm Spelling of ordinal numbers step by step in the J programming language
You may also check:How to resolve the algorithm Currency step by step in the Bracmat programming language