How to resolve the algorithm Jaro-Winkler distance step by step in the Julia programming language

Published on 22 June 2024 08:30 PM

How to resolve the algorithm Jaro-Winkler distance step by step in the Julia programming language

Table of Contents

Problem Statement

The Jaro-Winkler distance is a metric for measuring the edit distance between words. It is similar to the more basic Levenstein distance but the Jaro distance also accounts for transpositions between letters in the words. With the Winkler modification to the Jaro metric, the Jaro-Winkler distance also adds an increase in similarity for words which start with the same letters (prefix). The Jaro-Winkler distance is a modification of the Jaro similarity metric, which measures the similarity between two strings. The Jaro similarity is 1.0 when strings are identical and 0 when strings have no letters in common. Distance measures such as the Jaro distance or Jaro-Winkler distance, on the other hand, are 0 when strings are identical and 1 when they have no letters in common. The Jaro similarity between two strings s1 and s2, simj, is defined as Where:

The Winkler modification to Jaro is to check for identical prefixes of the strings. If we define the number of initial (prefix) characters in common as: l = the length of a common prefix between strings, up to 4 characters and, additionally, select a multiplier (Winkler suggested 0.1) for the relative importance of the prefix for the word similarity: p   =   0.1 The Jaro-Winkler similarity can then be defined as simw = simj + lp(1 - simj) Where: Winkler suggested this be 0.1.

The Jaro-Winkler distance between strings, which is 0.0 for identical strings, is then defined as dw = 1 - simw String metrics such as Jaro-Winkler distance are useful in applications such as spelling checkers, because letter transpositions are common typing errors and humans tend to misspell the middle portions of words more often than their beginnings. This may help a spelling checker program to generate better alternatives for misspelled word replacement. Using a dictionary of your choice and the following list of 9 commonly misspelled words: "accomodate", "definately", "goverment​", "occured", "publically", "recieve​", "seperate", "untill", "wich​"

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Jaro-Winkler distance step by step in the Julia programming language

The provided Julia code defines functions for computing the Jaro-Winkler distance between two strings and finding close words in a dictionary based on that distance. Here's a detailed explanation of the code:

  1. Data Initialization:

    • The code assumes you have a text file named "linuxwords.txt" that contains a list of words. It reads the contents of this file and splits each line into words using the read and split functions. The resulting array of strings is stored in the words constant.
  2. Jaro-Winkler Distance Function:

    • The jarowinklerdistance function takes two strings, s1 and s2, as input and computes the Jaro-Winkler distance between them. Here's how it works:
      • It checks if the length of s1 is less than the length of s2. If so, it swaps s1 and s2 to ensure s1 is the longer string.
      • It calculates the lengths of s1 and s2 and stores them in len1 and len2, respectively.
      • It sets a maximum allowed distance (delta) to half the length of s2 minus one.
      • It initializes an array flag of Bool values with a length equal to len2, all initially set to false. This array will be used to mark potential transpositions.
      • It initializes an empty array ch1_match to store characters from s1 that match characters in s2.
      • It iterates through the characters of s1 using a loop that includes the index i and the character ch1.
      • For each character ch1, it iterates through the characters of s2 using a nested loop that includes the index j and the character ch2.
      • Within the nested loop, it checks if ch1 equals ch2, j is within the range [i - delta, i + delta], and flag[j] is false. If all these conditions are met, it means a potential match has been found.
      • If a match is found, it sets flag[j] to true, adds ch1 to the ch1_match array, and breaks out of the nested loop.
      • After iterating through all characters in s1, it counts the number of matches using matches = length(ch1_match).
      • If matches is zero, it means no matches were found, so it returns 1.0 (indicating a maximum distance).
      • It calculates the number of transpositions by iterating through the characters of s2 and checking if the corresponding flag[j] is true and whether ch2 does not match the character in ch1_match at the same index. It increments the transpositions count accordingly.
      • Finally, it computes the Jaro distance as jaro = (matches / len1 + matches / len2 + (matches - transpositions/2) / matches) / 3.0.
      • It also calculates the common prefix length between s1 and s2 using commonprefix.
      • The function returns 1 - (jaro + commonprefix * 0.1 * (1 - jaro)), which is the Jaro-Winkler distance between s1 and s2.
  3. Close Words Function:

    • The closewords function takes three arguments: s (the input string), maxdistance (the maximum allowed Jaro-Winkler distance), and maxtoreturn (the maximum number of words to return). Here's how it works:
      • It initializes an empty array arr that will store tuples of words and their corresponding Jaro-Winkler distances.
      • It iterates through the words in the words array and calculates the Jaro-Winkler distance between each word and the input string s using the jarowinklerdistance function.
      • If the distance is less than or equal to maxdistance, it adds a tuple (w, jw) to the arr array, where w is the word and jw is the distance.
      • It sorts the arr array in ascending order of Jaro-Winkler distances using the sort! function.
      • It returns the first maxtoreturn elements of the sorted arr array.
  4. Usage:

    • The code defines an array s containing words with common misspellings, such as "accomodate," "definately," "goverment," etc.
    • It iterates through the words in s and prints the closest words in the dictionary with a Jaro-Winkler distance less than 0.15.
    • For each input word, it prints a list of up to five close words along with their corresponding Jaro-Winkler distances.

Source code in the julia programming language

# download("http://users.cs.duke.edu/~ola/ap/linuxwords", "linuxwords.txt")
const words = read("linuxwords.txt", String) |> split .|> strip

function jarowinklerdistance(s1, s2)
    if length(s1) < length(s2)
        s1, s2 = s2, s1
    end
    len1, len2 = length(s1), length(s2)
    len2 == 0 && return 0.0
    delta = max(0, len2 ÷ 2 - 1)
    flag = zeros(Bool, len2)  # flags for possible transpositions, begin as false
    ch1_match = eltype(s1)[]
    for (i, ch1) in enumerate(s1)
        for (j, ch2) in enumerate(s2)
            if (j <= i + delta) && (j >= i - delta) && (ch1 == ch2) && !flag[j]
                flag[j] = true
                push!(ch1_match, ch1)
                break
            end
        end
    end
    matches = length(ch1_match)
    matches == 0 && return 1.0
    transpositions, i = 0, 0
    for (j, ch2) in enumerate(s2)
        if flag[j]
            i += 1
            transpositions += (ch2 != ch1_match[i])
        end
    end
    jaro = (matches / len1 + matches / len2 + (matches - transpositions/2) / matches) / 3.0
    commonprefix = count(i -> s1[i] == s2[i], 1:min(len2, 4))
    return 1 - (jaro + commonprefix * 0.1 * (1 - jaro))
end

function closewords(s, maxdistance, maxtoreturn)
    jw = 0.0
    arr = [(w, jw) for w in words if (jw = jarowinklerdistance(s, w)) <= maxdistance]
    sort!(arr, lt=(x, y) -> x[2] < y[2])
    return length(arr) <= maxtoreturn ? arr : arr[1:maxtoreturn]
end

for s in ["accomodate", "definately", "goverment", "occured", "publically",
    "recieve", "seperate", "untill", "wich"]
    println("\nClose dictionary words ( distance < 0.15 using Jaro-Winkler distance) to '$s' are:")
    println("    Word      |  Distance")
    for (w, jw) in closewords(s, 0.15, 5)
        println(rpad(w, 14), "| ", Float16(jw))
    end
end


  

You may also check:How to resolve the algorithm Show ASCII table step by step in the JavaScript programming language
You may also check:How to resolve the algorithm Perfect numbers step by step in the Common Lisp programming language
You may also check:How to resolve the algorithm Monty Hall problem step by step in the X++ programming language
You may also check:How to resolve the algorithm Operator precedence step by step in the COBOL programming language
You may also check:How to resolve the algorithm Literals/String step by step in the Visual Basic programming language