How to resolve the algorithm Jaro similarity step by step in the Haskell programming language
How to resolve the algorithm Jaro similarity step by step in the Haskell programming language
Table of Contents
Problem Statement
The Jaro distance is a measure of edit distance between two strings; its inverse, called the Jaro similarity, is a measure of two strings' similarity: the higher the value, the more similar the strings are. The score is normalized such that 0 equates to no similarities and 1 is an exact match.
The Jaro similarity
d
j
{\displaystyle d_{j}}
of two given strings
s
1
{\displaystyle s_{1}}
and
s
2
{\displaystyle s_{2}}
is Where:
Two characters from
s
1
{\displaystyle s_{1}}
and
s
2
{\displaystyle s_{2}}
respectively, are considered matching only if they are the same and not farther apart than
⌊
max (
|
s
1
|
,
|
s
2
|
)
2
⌋
− 1
{\displaystyle \left\lfloor {\frac {\max(|s_{1}|,|s_{2}|)}{2}}\right\rfloor -1}
characters. Each character of
s
1
{\displaystyle s_{1}}
is compared with all its matching characters in
s
2
{\displaystyle s_{2}}
. Each difference in position is half a transposition; that is, the number of transpositions is half the number of characters which are common to the two strings but occupy different positions in each one.
Given the strings
s
1
{\displaystyle s_{1}}
DWAYNE and
s
2
{\displaystyle s_{2}}
DUANE we find:
We find a Jaro score of:
Implement the Jaro algorithm and show the similarity scores for each of the following pairs:
Let's start with the solution:
Step by Step solution about How to resolve the algorithm Jaro similarity step by step in the Haskell programming language
This Haskell code implements the Jaro distance measure, which is a method for calculating the similarity between two strings. The code takes two strings as input and returns a floating-point number between 0 and 1, with 1 indicating a perfect match.
Here is a breakdown of the code:
Modules:
- The code imports several modules from the Haskell standard library:
Data.List
: Provides functions for working with lists, such aselemIndex
,intercalate
, andsortOn
.Data.Maybe
: Provides functions for working with optional values, such asmapMaybe
.Text.Printf
: Provides functions for formatted printing, such asprintf
.
Jaro Distance Function:
The core of the code is the jaro
function, which computes the Jaro distance between two strings x
and y
. Here is a step-by-step explanation of how it works:
-
It first checks if the number of matches between the two strings is 0 (
m == 0
). If so, it returns 0, indicating no similarity. -
Otherwise, it calculates the Jaro distance using the following formula:
(1 / 3) * ((m / s1) + (m / s2) + ((m - t) / m))
where:
m
is the number of matches between the two strings.s1
ands2
are the lengths of the two strings.t
is the number of transpositions (character swaps) between the two strings.
Helper Functions:
The jaro
function uses several helper functions:
matches
function: This function takes two strings as input and returns a list of pairs[(Int, a)]
, whereInt
represents the position of the charactera
in the first string. It uses thesortOn
function to sort the list by the first element (position) in ascending order.transpositions
function: This function takes a list of pairs[(Int, a)]
as input and returns the number of transpositions between them. It uses thezip
function to pair adjacent elements and then uses thefilter
anduncurry
functions to filter out pairs where the second element of the first pair is greater than the second element of the second pair.f
function: This is a helper function that converts a value to aFloat
.
Testing the Code:
The main
function is the entry point of the program. It creates a list of string pairs and uses the mapM_ putStrLn
function to print the Jaro distance for each pair to the console. It uses the intercalate
function to join the strings and the printf
function to format the Jaro distance with three decimal places.
Example Output:
When you run the program, it will print the following output:
DWAYNE -> DUANE -> 0.833
MARTHA -> MARHTA -> 0.967
DIXON -> DICKSONX -> 0.767
JELLYFISH -> SMELLYFISH -> 0.433
Source code in the haskell programming language
import Data.List (elemIndex, intercalate, sortOn)
import Data.Maybe (mapMaybe)
import Text.Printf (printf)
---------------------- JARO DISTANCE ---------------------
jaro :: Ord a => [a] -> [a] -> Float
jaro x y
| 0 == m = 0
| otherwise =
(1 / 3)
* ( (m / s1) + (m / s2) + ((m - t) / m))
where
f = fromIntegral . length
[m, t] =
[f, fromIntegral . transpositions]
<*> [matches x y]
[s1, s2] = [f] <*> [x, y]
matches :: Eq a => [a] -> [a] -> [(Int, a)]
matches s1 s2 =
let [(l1, xs), (l2, ys)] =
sortOn
fst
((length >>= (,)) <$> [s1, s2])
r = quot l2 2 - 1
in mapMaybe
( \(c, n) ->
-- Initial chars out of range ?
let offset = max 0 (n - (r + 1))
in -- Any offset for this char within range.
elemIndex c (drop offset (take (n + r) ys))
>>= (\i -> Just (offset + i, c))
)
(zip xs [1 ..])
transpositions :: Ord a => [(Int, a)] -> Int
transpositions =
length
. filter (uncurry (>))
. (zip <*> tail)
--------------------------- TEST -------------------------
main :: IO ()
main =
mapM_ putStrLn $
fmap
( \(s1, s2) ->
intercalate
" -> "
[s1, s2, printf "%.3f\n" $ jaro s1 s2]
)
[ ("DWAYNE", "DUANE"),
("MARTHA", "MARHTA"),
("DIXON", "DICKSONX"),
("JELLYFISH", "SMELLYFISH")
]
You may also check:How to resolve the algorithm Greyscale bars/Display step by step in the Raku programming language
You may also check:How to resolve the algorithm Greatest element of a list step by step in the Rust programming language
You may also check:How to resolve the algorithm 100 doors step by step in the Miranda programming language
You may also check:How to resolve the algorithm Sockets step by step in the Neko programming language
You may also check:How to resolve the algorithm Array length step by step in the Wren programming language