How to resolve the algorithm Jaro similarity step by step in the C programming language
How to resolve the algorithm Jaro similarity step by step in the C programming language
Table of Contents
Problem Statement
The Jaro distance is a measure of edit distance between two strings; its inverse, called the Jaro similarity, is a measure of two strings' similarity: the higher the value, the more similar the strings are. The score is normalized such that 0 equates to no similarities and 1 is an exact match.
The Jaro similarity
d
j
{\displaystyle d_{j}}
of two given strings
s
1
{\displaystyle s_{1}}
and
s
2
{\displaystyle s_{2}}
is Where:
Two characters from
s
1
{\displaystyle s_{1}}
and
s
2
{\displaystyle s_{2}}
respectively, are considered matching only if they are the same and not farther apart than
⌊
max (
|
s
1
|
,
|
s
2
|
)
2
⌋
− 1
{\displaystyle \left\lfloor {\frac {\max(|s_{1}|,|s_{2}|)}{2}}\right\rfloor -1}
characters. Each character of
s
1
{\displaystyle s_{1}}
is compared with all its matching characters in
s
2
{\displaystyle s_{2}}
. Each difference in position is half a transposition; that is, the number of transpositions is half the number of characters which are common to the two strings but occupy different positions in each one.
Given the strings
s
1
{\displaystyle s_{1}}
DWAYNE and
s
2
{\displaystyle s_{2}}
DUANE we find:
We find a Jaro score of:
Implement the Jaro algorithm and show the similarity scores for each of the following pairs:
Let's start with the solution:
Step by Step solution about How to resolve the algorithm Jaro similarity step by step in the C programming language
The provided C code implements the Jaro distance algorithm to calculate the similarity between two strings. The Jaro distance is a measure of how similar two strings are, with a higher value indicating a greater similarity. The code includes a main function that demonstrates the use of the jaro function by comparing several pairs of strings and printing the resulting distance.
Here's a breakdown of the code:
-
Header Files: The code includes several header files:
<stdlib.h>: Provides functions for memory management and type conversions.<string.h>: Provides functions for string manipulation.<ctype.h>: Provides functions for character-based operations.<stdio.h>: Provides functions for input and output.
-
Macros: The code defines two macros:
TRUE: A constant representing a logical true value (1).FALSE: A constant representing a logical false value (0).
-
maxandminMacros: These macros define the maximum and minimum functions respectively, which are used to determine the boundaries for matching characters. -
jaroFunction: This is the core function that implements the Jaro distance algorithm. It takes two constant character arrays (str1andstr2) as input and returns a double representing the Jaro distance between the two strings. Here's a summary of what the function does:- It calculates the lengths of the input strings (
str1_lenandstr2_len). - It handles special cases where both strings are empty (returns 1) or only one is empty (returns 0).
- It calculates the maximum allowed distance between two characters to be considered a match (
match_distance). - It initializes two arrays,
str1_matchesandstr2_matches, which are used to track matches between characters in the two strings. - It iterates through the characters of
str1and, for each character, searches for a match within a specified distance (match_distance) instr2. If a match is found, it sets the corresponding elements instr1_matchesandstr2_matchestoTRUE. - If no matches are found, it returns 0.
- It counts the number of matches (
matches) and the number of transpositions (transpositions). Transpositions occur when two characters in a match are in different positions in the two strings. - It frees the allocated memory for
str1_matchesandstr2_matches. - It calculates and returns the Jaro distance using the formula:
(matches / str1_len + matches / str2_len + (matches - transpositions) / matches) / 3.0.
- It calculates the lengths of the input strings (
-
mainFunction: Themainfunction demonstrates the use of thejarofunction by comparing several pairs of strings and printing the resulting distance:- It prints the Jaro distance between "MARTHA" and "MARHTA," which is 0.944444.
- It prints the Jaro distance between "DIXON" and "DICKSONX," which is 0.822222.
- It prints the Jaro distance between "JELLYFISH" and "SMELLYFISH," which is 0.766667.
Source code in the c programming language
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <stdio.h>
#define TRUE 1
#define FALSE 0
#define max(a, b) ((a) > (b) ? (a) : (b))
#define min(a, b) ((a) < (b) ? (a) : (b))
double jaro(const char *str1, const char *str2) {
// length of the strings
int str1_len = strlen(str1);
int str2_len = strlen(str2);
// if both strings are empty return 1
// if only one of the strings is empty return 0
if (str1_len == 0) return str2_len == 0 ? 1.0 : 0.0;
// max distance between two chars to be considered matching
// floor() is ommitted due to integer division rules
int match_distance = (int) max(str1_len, str2_len)/2 - 1;
// arrays of bools that signify if that char in the matching string has a match
int *str1_matches = calloc(str1_len, sizeof(int));
int *str2_matches = calloc(str2_len, sizeof(int));
// number of matches and transpositions
double matches = 0.0;
double transpositions = 0.0;
// find the matches
for (int i = 0; i < str1_len; i++) {
// start and end take into account the match distance
int start = max(0, i - match_distance);
int end = min(i + match_distance + 1, str2_len);
for (int k = start; k < end; k++) {
// if str2 already has a match continue
if (str2_matches[k]) continue;
// if str1 and str2 are not
if (str1[i] != str2[k]) continue;
// otherwise assume there is a match
str1_matches[i] = TRUE;
str2_matches[k] = TRUE;
matches++;
break;
}
}
// if there are no matches return 0
if (matches == 0) {
free(str1_matches);
free(str2_matches);
return 0.0;
}
// count transpositions
int k = 0;
for (int i = 0; i < str1_len; i++) {
// if there are no matches in str1 continue
if (!str1_matches[i]) continue;
// while there is no match in str2 increment k
while (!str2_matches[k]) k++;
// increment transpositions
if (str1[i] != str2[k]) transpositions++;
k++;
}
// divide the number of transpositions by two as per the algorithm specs
// this division is valid because the counted transpositions include both
// instances of the transposed characters.
transpositions /= 2.0;
// free the allocated memory
free(str1_matches);
free(str2_matches);
// return the Jaro distance
return ((matches / str1_len) +
(matches / str2_len) +
((matches - transpositions) / matches)) / 3.0;
}
int main() {
printf("%f\n", jaro("MARTHA", "MARHTA"));
printf("%f\n", jaro("DIXON", "DICKSONX"));
printf("%f\n", jaro("JELLYFISH", "SMELLYFISH"));
}
You may also check:How to resolve the algorithm Sorting Algorithms/Circle Sort step by step in the C programming language
You may also check:How to resolve the algorithm Bitwise IO step by step in the C programming language
You may also check:How to resolve the algorithm Numerical integration/Gauss-Legendre Quadrature step by step in the C programming language
You may also check:How to resolve the algorithm Order by pair comparisons step by step in the C programming language
You may also check:How to resolve the algorithm Burrows–Wheeler transform step by step in the C programming language