How to resolve the algorithm Jaro-Winkler distance step by step in the jq programming language

Published on 12 May 2024 09:40 PM
#Jq

How to resolve the algorithm Jaro-Winkler distance step by step in the jq programming language

Table of Contents

Problem Statement

The Jaro-Winkler distance is a metric for measuring the edit distance between words. It is similar to the more basic Levenstein distance but the Jaro distance also accounts for transpositions between letters in the words. With the Winkler modification to the Jaro metric, the Jaro-Winkler distance also adds an increase in similarity for words which start with the same letters (prefix). The Jaro-Winkler distance is a modification of the Jaro similarity metric, which measures the similarity between two strings. The Jaro similarity is 1.0 when strings are identical and 0 when strings have no letters in common. Distance measures such as the Jaro distance or Jaro-Winkler distance, on the other hand, are 0 when strings are identical and 1 when they have no letters in common. The Jaro similarity between two strings s1 and s2, simj, is defined as Where:

The Winkler modification to Jaro is to check for identical prefixes of the strings. If we define the number of initial (prefix) characters in common as: l = the length of a common prefix between strings, up to 4 characters and, additionally, select a multiplier (Winkler suggested 0.1) for the relative importance of the prefix for the word similarity: p   =   0.1 The Jaro-Winkler similarity can then be defined as simw = simj + lp(1 - simj) Where: Winkler suggested this be 0.1.

The Jaro-Winkler distance between strings, which is 0.0 for identical strings, is then defined as dw = 1 - simw String metrics such as Jaro-Winkler distance are useful in applications such as spelling checkers, because letter transpositions are common typing errors and humans tend to misspell the middle portions of words more often than their beginnings. This may help a spelling checker program to generate better alternatives for misspelled word replacement. Using a dictionary of your choice and the following list of 9 commonly misspelled words: "accomodate", "definately", "goverment​", "occured", "publically", "recieve​", "seperate", "untill", "wich​"

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Jaro-Winkler distance step by step in the jq programming language

Source code in the jq programming language

# See [[Jaro_similarity#jq]] for the implementation of jaro/2

def length_of_common_prefix($s1; $s2):
  if ($s1|length) > ($s2|length) then length_of_common_prefix($s2; $s1)
  else ($s1|explode) as $x1
  | ($s2|explode) as $x2
  | first( range(0;$x1|length) | select( $x1[.] != $x2[.] )) // ($x1|length)
  end;

# Output: the Jaro-WInkler distance using 0.1 as the common-prefix multiplier
def jaro_winkler($s1; $s2):
  if $s1 == $s2 then 0
  else jaro($s1; $s2) as $j
  | length_of_common_prefix($s1[:4]; $s2[:4]) as $l
  | 1 - ($j + 0.1 * $l * (1 - $j))
  end ;

# Input: an array of words
# Output: [[match, distance] ...]
def candidates($word; $threshold):
  map(jaro_winkler($word; . ) as $x | select($x <= $threshold) | [., $x] );

def lpad($len): tostring | ($len - length) as $l | (" " * $l)[:$l] + .;

def task:
  [inputs] # the dictionary
  | ("accomodate", "definately", "goverment​", "occured", "publically", "recieve​", "seperate", "untill", "wich​") as $word
  | candidates($word; 0.15) | sort_by(.[-1]) | .[:5]
  | "Matches for \($word|lpad(10)): Distance",
    (.[] | "\(.[0] | lpad(21)) : \(.[-1] * 1000 | round / 1000)") ;

task

  

You may also check:How to resolve the algorithm Mutual recursion step by step in the Oforth programming language
You may also check:How to resolve the algorithm Maze solving step by step in the Python programming language
You may also check:How to resolve the algorithm Animation step by step in the REBOL programming language
You may also check:How to resolve the algorithm Sieve of Eratosthenes step by step in the BASIC programming language
You may also check:How to resolve the algorithm LZW compression step by step in the Ring programming language