How to resolve the algorithm Ukkonen’s suffix tree construction step by step in the Julia programming language
How to resolve the algorithm Ukkonen’s suffix tree construction step by step in the Julia programming language
Table of Contents
Problem Statement
Suffix Trees are very useful in numerous string processing and computational biology problems. The task is to create a function which implements Ukkonen’s algorithm to create a useful Suffix Tree as described: Using Arithmetic-geometric mean/Calculate Pi generate the first 1000, 10000, and 100000 decimal places of pi. Using your implementation with an alphabet of 0 through 9 (plus $ say to make the tree explicit) find the longest repeated string in each list. Time your results and demonstrate that your implementation is linear (i.e. that 10000 takes approx. 10 times as long as 1000). You may vary the size of the lists of decimal places of pi to give reasonable answers.
Let's start with the solution:
Step by Step solution about How to resolve the algorithm Ukkonen’s suffix tree construction step by step in the Julia programming language
Overview
The code implements the Ukkonen suffix tree data structure, a space-efficient representation of a string that allows for efficient searching and substring analysis. It provides functions to construct a suffix tree from a given string, traverse the tree to find the longest repeated substring, and print the results.
Data Structures
SuffixTree: Represents the suffix tree. It contains:
nodes
: A vector ofNode
objects representing the nodes of the tree.text
: The original input string.root
: Root node of the tree.position
: Current position in the text during construction.currentnode
: Current node being processed.needsuffixlink
,remainder
,activenode
,activelength
,activeedge
: Internal variables for suffix tree construction.
Node: Represents a node in the suffix tree. It contains:
children
: A dictionary mapping characters to child node indices.start
,ending
: Start and ending positions of the suffix in the text.suffixlink
: Index of the suffix link node.suffixindex
: Index of the longest repeated substring starting at this node (-1 if not a leaf).
Functions
Construction
new_node(st) creates a new node in the suffix tree.
SuffixTree(str) constructs a suffix tree from the given string. It uses the Ukkonen algorithm to incrementally build the tree character by character.
Traversal and Analysis
setsuffixindexbyDFS(st) assigns suffix indices to leaf nodes, indicating the start position of the longest repeated substring starting at that node. It uses depth-first search to traverse the tree.
dotraversal(st) finds the longest repeated substring in the tree by traversing it and identifying the maximum height of any non-leaf node.
getLongestRepeatedSubstring(st) calls dotraversal
and returns the longest repeated substring, along with its start positions in the input string.
testsuffixtree() tests the suffix tree implementation on various input strings and prints the longest repeated substrings.
Example Usage
The code demonstrates the usage of the suffix tree by creating suffix trees for the given example strings and finding the longest repeated substrings. It also tests the performance of the tree on a large string derived from the digits of π.
Time Complexity
The worst-case time complexity for constructing a suffix tree is O(|s|^2), where |s| is the length of the input string. For typical texts, it is typically much faster, close to O(|s|).
Source code in the julia programming language
const oo = typemax(Int)
"""The suffix-tree's node."""
mutable struct Node
children::Dict{Char, Int}
start::Int
ending::Int
suffixlink::Int
suffixindex::Int
end
Node() = Node(Dict(), 0, oo, 0, -1)
Node(start, ending) = Node(Dict(), start, ending, 0, -1)
""" Ukkonen Suffix-Tree """
mutable struct SuffixTree
nodes::Vector{Node}
text::Vector{Char}
root::Int
position::Int
currentnode::Int
needsuffixlink::Int
remainder::Int
activenode::Int
activelength::Int
activeedge::Int
end
edgelength(st, n::Node) = min(n.ending, st.position + 1) - n.start
function newnode(st, start, ending)
st.currentnode += 1
st.nodes[st.currentnode] = Node(start, ending)
return st.currentnode
end
function SuffixTree(str::String)
nodes = [Node() for _ in 1:length(str) * 2]
st = SuffixTree(nodes, [c for c in str], 1, 0, 0, 0, 0, 1, 1, 1)
st.root = newnode(st, 0, 0)
st.activenode = st.root
for i in 1:length(st.text)
extendsuffixtree(st, i)
end
setsuffixindexbyDFS(st, st.nodes[st.root], 0)
return st
end
function addsuffixlink(st, nodenum::Int)
if st.needsuffixlink > 0
st.nodes[st.needsuffixlink].suffixlink = nodenum
end
st.needsuffixlink = nodenum
end
activeedge(st) = st.text[st.activeedge]
function walkdown!(st, currnode::Int)
len = edgelength(st, st.nodes[currnode])
st.activelength < len && return false
st.activeedge += len
st.activelength -= len
st.activenode = currnode
return true
end
function extendsuffixtree(st, pos)
st.position = pos
st.needsuffixlink = 0
st.remainder += 1
while st.remainder > 0
st.activelength == 0 && (st.activeedge = st.position)
if !haskey(st.nodes[st.activenode].children, activeedge(st))
nodenum = newnode(st, st.position, oo)
st.nodes[st.activenode].children[activeedge(st)] = nodenum
addsuffixlink(st, st.activenode)
else
next = st.nodes[st.activenode].children[activeedge(st)]
walkdown!(st, next) && continue
if st.text[st.nodes[next].start + st.activelength] == st.text[pos]
addsuffixlink(st, st.activenode)
st.activelength += 1
break
end
splt = newnode(st, st.nodes[next].start, st.nodes[next].start + st.activelength)
st.nodes[st.activenode].children[activeedge(st)] = splt
nodenum = newnode(st, st.position, oo)
st.nodes[splt].children[st.text[pos]] = nodenum
st.nodes[next].start += st.activelength
st.nodes[splt].children[st.text[st.nodes[next].start]] = next
addsuffixlink(st, splt)
end
st.remainder -= 1
if st.activenode == st.root && st.activelength > 0
st.activelength -= 1
st.activeedge = st.position - st.remainder + 1
elseif st.activenode != st.root
st.activenode = st.nodes[st.activenode].suffixlink
end
end
end
function setsuffixindexbyDFS(st, node, labelheight, verbose=false)
verbose && node.start > 0 && print(st.text[node.start:min(node.ending, length(st.text))])
isleaf = true
for child in map(v -> st.nodes[v], collect(values(node.children)))
verbose && isleaf && node.start > 0 && println(" [", node.suffixindex, "]")
isleaf = false
setsuffixindexbyDFS(st, child, labelheight + edgelength(st, child))
end
if isleaf
idx = length(st.text) - labelheight
node.suffixindex = idx
verbose && println(" [$idx]")
end
end
function dotraversal(st)
maxheight, substringstartindices = 0, [0]
function traversal(node::Node, labelheight)
if node.suffixindex == -1
for child in map(v -> st.nodes[v], collect(values(node.children)))
traversal(child, labelheight + edgelength(st, child))
end
elseif maxheight < labelheight - edgelength(st, node)
maxheight = labelheight - edgelength(st, node)
substringstartindices = [node.suffixindex + 1]
elseif maxheight == labelheight - edgelength(st, node)
push!(substringstartindices, node.suffixindex + 1)
end
end
traversal(st.nodes[st.root], 0)
return maxheight, substringstartindices
end
function getlongestrepeatedsubstring(st::SuffixTree, label="", printresult=true)
len, starts = dotraversal(st)
substring = len == 0 ? "" :
join(unique(map(x -> String(st.text[x:x+len-1]), starts)), " (or) ")
if printresult
print(" ", label == "" ? String(st.text) : label, ": ")
println(len == 0 ? "No repeated substring." : substring)
end
return substring
end
function testsuffixtree()
tests = [
"CAAAABAAAABD\$",
"GEEKSFORGEEKS\$",
"AAAAAAAAAA\$",
"ABCDEFG\$",
"ABABABA\$",
"ATCGATCGA\$",
"banana\$",
"abcpqrabpqpq\$",
"pqrpqpqabab\$",
]
println("Longest Repeated Substring in:\n")
for test in tests
st = SuffixTree(test)
getlongestrepeatedsubstring(st)
end
println()
sπ = ""
setprecision(4000000) do
sπ = string(BigFloat(π))[3:end]
end
for number in [1000, 10000, 100000, 1000000]
text = sπ[1:number] * "\$"
@time begin
st = SuffixTree(text)
getlongestrepeatedsubstring(st, "first $number d.p. of π")
end
end
end
testsuffixtree()
You may also check:How to resolve the algorithm Elementary cellular automaton step by step in the Wren programming language
You may also check:How to resolve the algorithm Logical operations step by step in the REBOL programming language
You may also check:How to resolve the algorithm Integer sequence step by step in the Lambdatalk programming language
You may also check:How to resolve the algorithm Order disjoint list items step by step in the Tcl programming language
You may also check:How to resolve the algorithm Fivenum step by step in the 11l programming language