How to resolve the algorithm Jaro similarity step by step in the C programming language

Published on 7 June 2024 03:52 AM
#C

How to resolve the algorithm Jaro similarity step by step in the C programming language

Table of Contents

Problem Statement

The Jaro distance is a measure of edit distance between two strings; its inverse, called the Jaro similarity, is a measure of two strings' similarity: the higher the value, the more similar the strings are. The score is normalized such that   0   equates to no similarities and   1   is an exact match.

The Jaro similarity

d

j

{\displaystyle d_{j}}

of two given strings

s

1

{\displaystyle s_{1}}

and

s

2

{\displaystyle s_{2}}

is Where:

Two characters from

s

1

{\displaystyle s_{1}}

and

s

2

{\displaystyle s_{2}}

respectively, are considered matching only if they are the same and not farther apart than

max (

|

s

1

|

,

|

s

2

|

)

2

− 1

{\displaystyle \left\lfloor {\frac {\max(|s_{1}|,|s_{2}|)}{2}}\right\rfloor -1}

characters. Each character of

s

1

{\displaystyle s_{1}}

is compared with all its matching characters in

s

2

{\displaystyle s_{2}}

. Each difference in position is half a transposition; that is, the number of transpositions is half the number of characters which are common to the two strings but occupy different positions in each one.

Given the strings

s

1

{\displaystyle s_{1}}

DWAYNE   and

s

2

{\displaystyle s_{2}}

DUANE   we find:

We find a Jaro score of:

Implement the Jaro algorithm and show the similarity scores for each of the following pairs:

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Jaro similarity step by step in the C programming language

The provided C code implements the Jaro distance algorithm to calculate the similarity between two strings. The Jaro distance is a measure of how similar two strings are, with a higher value indicating a greater similarity. The code includes a main function that demonstrates the use of the jaro function by comparing several pairs of strings and printing the resulting distance.

Here's a breakdown of the code:

  1. Header Files: The code includes several header files:

    • <stdlib.h>: Provides functions for memory management and type conversions.
    • <string.h>: Provides functions for string manipulation.
    • <ctype.h>: Provides functions for character-based operations.
    • <stdio.h>: Provides functions for input and output.
  2. Macros: The code defines two macros:

    • TRUE: A constant representing a logical true value (1).
    • FALSE: A constant representing a logical false value (0).
  3. max and min Macros: These macros define the maximum and minimum functions respectively, which are used to determine the boundaries for matching characters.

  4. jaro Function: This is the core function that implements the Jaro distance algorithm. It takes two constant character arrays (str1 and str2) as input and returns a double representing the Jaro distance between the two strings. Here's a summary of what the function does:

    • It calculates the lengths of the input strings (str1_len and str2_len).
    • It handles special cases where both strings are empty (returns 1) or only one is empty (returns 0).
    • It calculates the maximum allowed distance between two characters to be considered a match (match_distance).
    • It initializes two arrays, str1_matches and str2_matches, which are used to track matches between characters in the two strings.
    • It iterates through the characters of str1 and, for each character, searches for a match within a specified distance (match_distance) in str2. If a match is found, it sets the corresponding elements in str1_matches and str2_matches to TRUE.
    • If no matches are found, it returns 0.
    • It counts the number of matches (matches) and the number of transpositions (transpositions). Transpositions occur when two characters in a match are in different positions in the two strings.
    • It frees the allocated memory for str1_matches and str2_matches.
    • It calculates and returns the Jaro distance using the formula: (matches / str1_len + matches / str2_len + (matches - transpositions) / matches) / 3.0.
  5. main Function: The main function demonstrates the use of the jaro function by comparing several pairs of strings and printing the resulting distance:

    • It prints the Jaro distance between "MARTHA" and "MARHTA," which is 0.944444.
    • It prints the Jaro distance between "DIXON" and "DICKSONX," which is 0.822222.
    • It prints the Jaro distance between "JELLYFISH" and "SMELLYFISH," which is 0.766667.

Source code in the c programming language

#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <stdio.h>

#define TRUE    1
#define FALSE   0

#define max(a, b) ((a) > (b) ? (a) : (b))
#define min(a, b) ((a) < (b) ? (a) : (b))

double jaro(const char *str1, const char *str2) {
    // length of the strings
    int str1_len = strlen(str1);
    int str2_len = strlen(str2);

    // if both strings are empty return 1
    // if only one of the strings is empty return 0
    if (str1_len == 0) return str2_len == 0 ? 1.0 : 0.0;

    // max distance between two chars to be considered matching
    // floor() is ommitted due to integer division rules
    int match_distance = (int) max(str1_len, str2_len)/2 - 1;

    // arrays of bools that signify if that char in the matching string has a match
    int *str1_matches = calloc(str1_len, sizeof(int));
    int *str2_matches = calloc(str2_len, sizeof(int));

    // number of matches and transpositions
    double matches = 0.0;
    double transpositions = 0.0;

    // find the matches
    for (int i = 0; i < str1_len; i++) {
        // start and end take into account the match distance
        int start = max(0, i - match_distance);
        int end = min(i + match_distance + 1, str2_len);

        for (int k = start; k < end; k++) {
            // if str2 already has a match continue
            if (str2_matches[k]) continue;
            // if str1 and str2 are not
            if (str1[i] != str2[k]) continue;
            // otherwise assume there is a match
            str1_matches[i] = TRUE;
            str2_matches[k] = TRUE;
            matches++;
            break;
        }
    }

    // if there are no matches return 0
    if (matches == 0) {
        free(str1_matches);
        free(str2_matches);
        return 0.0;
    }

    // count transpositions
    int k = 0;
    for (int i = 0; i < str1_len; i++) {
        // if there are no matches in str1 continue
        if (!str1_matches[i]) continue;
        // while there is no match in str2 increment k
        while (!str2_matches[k]) k++;
        // increment transpositions
        if (str1[i] != str2[k]) transpositions++;
        k++;
    }

    // divide the number of transpositions by two as per the algorithm specs
    // this division is valid because the counted transpositions include both
    // instances of the transposed characters.
    transpositions /= 2.0;

    // free the allocated memory
    free(str1_matches);
    free(str2_matches);

    // return the Jaro distance
    return ((matches / str1_len) +
        (matches / str2_len) +
        ((matches - transpositions) / matches)) / 3.0;
}

int main() {
    printf("%f\n", jaro("MARTHA",    "MARHTA"));
    printf("%f\n", jaro("DIXON",     "DICKSONX"));
    printf("%f\n", jaro("JELLYFISH", "SMELLYFISH"));
}


  

You may also check:How to resolve the algorithm Sorting Algorithms/Circle Sort step by step in the C programming language
You may also check:How to resolve the algorithm Bitwise IO step by step in the C programming language
You may also check:How to resolve the algorithm Numerical integration/Gauss-Legendre Quadrature step by step in the C programming language
You may also check:How to resolve the algorithm Order by pair comparisons step by step in the C programming language
You may also check:How to resolve the algorithm Burrows–Wheeler transform step by step in the C programming language