How to resolve the algorithm Jaro similarity step by step in the Haskell programming language

Published on 7 June 2024 03:52 AM

How to resolve the algorithm Jaro similarity step by step in the Haskell programming language

Table of Contents

Problem Statement

The Jaro distance is a measure of edit distance between two strings; its inverse, called the Jaro similarity, is a measure of two strings' similarity: the higher the value, the more similar the strings are. The score is normalized such that   0   equates to no similarities and   1   is an exact match.

The Jaro similarity

d

j

{\displaystyle d_{j}}

of two given strings

s

1

{\displaystyle s_{1}}

and

s

2

{\displaystyle s_{2}}

is Where:

Two characters from

s

1

{\displaystyle s_{1}}

and

s

2

{\displaystyle s_{2}}

respectively, are considered matching only if they are the same and not farther apart than

max (

|

s

1

|

,

|

s

2

|

)

2

− 1

{\displaystyle \left\lfloor {\frac {\max(|s_{1}|,|s_{2}|)}{2}}\right\rfloor -1}

characters. Each character of

s

1

{\displaystyle s_{1}}

is compared with all its matching characters in

s

2

{\displaystyle s_{2}}

. Each difference in position is half a transposition; that is, the number of transpositions is half the number of characters which are common to the two strings but occupy different positions in each one.

Given the strings

s

1

{\displaystyle s_{1}}

DWAYNE   and

s

2

{\displaystyle s_{2}}

DUANE   we find:

We find a Jaro score of:

Implement the Jaro algorithm and show the similarity scores for each of the following pairs:

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Jaro similarity step by step in the Haskell programming language

This Haskell code implements the Jaro distance measure, which is a method for calculating the similarity between two strings. The code takes two strings as input and returns a floating-point number between 0 and 1, with 1 indicating a perfect match.

Here is a breakdown of the code:

Modules:

  • The code imports several modules from the Haskell standard library:
    • Data.List: Provides functions for working with lists, such as elemIndex, intercalate, and sortOn.
    • Data.Maybe: Provides functions for working with optional values, such as mapMaybe.
    • Text.Printf: Provides functions for formatted printing, such as printf.

Jaro Distance Function:

The core of the code is the jaro function, which computes the Jaro distance between two strings x and y. Here is a step-by-step explanation of how it works:

  1. It first checks if the number of matches between the two strings is 0 (m == 0). If so, it returns 0, indicating no similarity.

  2. Otherwise, it calculates the Jaro distance using the following formula:

    (1 / 3) * ((m / s1) + (m / s2) + ((m - t) / m))
    

    where:

    • m is the number of matches between the two strings.
    • s1 and s2 are the lengths of the two strings.
    • t is the number of transpositions (character swaps) between the two strings.

Helper Functions:

The jaro function uses several helper functions:

  • matches function: This function takes two strings as input and returns a list of pairs [(Int, a)], where Int represents the position of the character a in the first string. It uses the sortOn function to sort the list by the first element (position) in ascending order.
  • transpositions function: This function takes a list of pairs [(Int, a)] as input and returns the number of transpositions between them. It uses the zip function to pair adjacent elements and then uses the filter and uncurry functions to filter out pairs where the second element of the first pair is greater than the second element of the second pair.
  • f function: This is a helper function that converts a value to a Float.

Testing the Code:

The main function is the entry point of the program. It creates a list of string pairs and uses the mapM_ putStrLn function to print the Jaro distance for each pair to the console. It uses the intercalate function to join the strings and the printf function to format the Jaro distance with three decimal places.

Example Output:

When you run the program, it will print the following output:

DWAYNE -> DUANE -> 0.833
MARTHA -> MARHTA -> 0.967
DIXON -> DICKSONX -> 0.767
JELLYFISH -> SMELLYFISH -> 0.433

Source code in the haskell programming language

import Data.List (elemIndex, intercalate, sortOn)
import Data.Maybe (mapMaybe)
import Text.Printf (printf)

---------------------- JARO DISTANCE ---------------------

jaro :: Ord a => [a] -> [a] -> Float
jaro x y
  | 0 == m = 0
  | otherwise =
    (1 / 3)
      * ( (m / s1) + (m / s2) + ((m - t) / m))
  where
    f = fromIntegral . length
    [m, t] =
      [f, fromIntegral . transpositions]
        <*> [matches x y]
    [s1, s2] = [f] <*> [x, y]

matches :: Eq a => [a] -> [a] -> [(Int, a)]
matches s1 s2 =
  let [(l1, xs), (l2, ys)] =
        sortOn
          fst
          ((length >>= (,)) <$> [s1, s2])
      r = quot l2 2 - 1
   in mapMaybe
        ( \(c, n) ->
            -- Initial chars out of range ?

            let offset = max 0 (n - (r + 1))
             in -- Any offset for this char within range.
                elemIndex c (drop offset (take (n + r) ys))
                  >>= (\i -> Just (offset + i, c))
        )
        (zip xs [1 ..])

transpositions :: Ord a => [(Int, a)] -> Int
transpositions =
  length
    . filter (uncurry (>))
    . (zip <*> tail)

--------------------------- TEST -------------------------
main :: IO ()
main =
  mapM_ putStrLn $
    fmap
      ( \(s1, s2) ->
          intercalate
            " -> "
            [s1, s2, printf "%.3f\n" $ jaro s1 s2]
      )
      [ ("DWAYNE", "DUANE"),
        ("MARTHA", "MARHTA"),
        ("DIXON", "DICKSONX"),
        ("JELLYFISH", "SMELLYFISH")
      ]


  

You may also check:How to resolve the algorithm Greyscale bars/Display step by step in the Raku programming language
You may also check:How to resolve the algorithm Greatest element of a list step by step in the Rust programming language
You may also check:How to resolve the algorithm 100 doors step by step in the Miranda programming language
You may also check:How to resolve the algorithm Sockets step by step in the Neko programming language
You may also check:How to resolve the algorithm Array length step by step in the Wren programming language