How to resolve the algorithm Ukkonen’s suffix tree construction step by step in the Julia programming language

Published on 22 June 2024 08:30 PM

How to resolve the algorithm Ukkonen’s suffix tree construction step by step in the Julia programming language

Table of Contents

Problem Statement

Suffix Trees are very useful in numerous string processing and computational biology problems. The task is to create a function which implements Ukkonen’s algorithm to create a useful Suffix Tree as described: Using Arithmetic-geometric mean/Calculate Pi generate the first 1000, 10000, and 100000 decimal places of pi. Using your implementation with an alphabet of 0 through 9 (plus $ say to make the tree explicit) find the longest repeated string in each list. Time your results and demonstrate that your implementation is linear (i.e. that 10000 takes approx. 10 times as long as 1000). You may vary the size of the lists of decimal places of pi to give reasonable answers.

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Ukkonen’s suffix tree construction step by step in the Julia programming language

Overview

The code implements the Ukkonen suffix tree data structure, a space-efficient representation of a string that allows for efficient searching and substring analysis. It provides functions to construct a suffix tree from a given string, traverse the tree to find the longest repeated substring, and print the results.

Data Structures

SuffixTree: Represents the suffix tree. It contains:

  • nodes: A vector of Node objects representing the nodes of the tree.
  • text: The original input string.
  • root: Root node of the tree.
  • position: Current position in the text during construction.
  • currentnode: Current node being processed.
  • needsuffixlink, remainder, activenode, activelength, activeedge: Internal variables for suffix tree construction.

Node: Represents a node in the suffix tree. It contains:

  • children: A dictionary mapping characters to child node indices.
  • start, ending: Start and ending positions of the suffix in the text.
  • suffixlink: Index of the suffix link node.
  • suffixindex: Index of the longest repeated substring starting at this node (-1 if not a leaf).

Functions

Construction

new_node(st) creates a new node in the suffix tree.

SuffixTree(str) constructs a suffix tree from the given string. It uses the Ukkonen algorithm to incrementally build the tree character by character.

Traversal and Analysis

setsuffixindexbyDFS(st) assigns suffix indices to leaf nodes, indicating the start position of the longest repeated substring starting at that node. It uses depth-first search to traverse the tree.

dotraversal(st) finds the longest repeated substring in the tree by traversing it and identifying the maximum height of any non-leaf node.

getLongestRepeatedSubstring(st) calls dotraversal and returns the longest repeated substring, along with its start positions in the input string.

testsuffixtree() tests the suffix tree implementation on various input strings and prints the longest repeated substrings.

Example Usage

The code demonstrates the usage of the suffix tree by creating suffix trees for the given example strings and finding the longest repeated substrings. It also tests the performance of the tree on a large string derived from the digits of π.

Time Complexity

The worst-case time complexity for constructing a suffix tree is O(|s|^2), where |s| is the length of the input string. For typical texts, it is typically much faster, close to O(|s|).

Source code in the julia programming language

const oo = typemax(Int)

"""The suffix-tree's node."""
mutable struct Node
    children::Dict{Char, Int}
    start::Int
    ending::Int
    suffixlink::Int
    suffixindex::Int
end

Node() = Node(Dict(), 0, oo, 0, -1)
Node(start, ending) = Node(Dict(), start, ending, 0, -1)

""" Ukkonen Suffix-Tree """
mutable struct SuffixTree
    nodes::Vector{Node}
    text::Vector{Char}
    root::Int
    position::Int
    currentnode::Int
    needsuffixlink::Int
    remainder::Int
    activenode::Int
    activelength::Int
    activeedge::Int
end

edgelength(st, n::Node) = min(n.ending, st.position + 1) - n.start

function newnode(st, start, ending)
    st.currentnode += 1
    st.nodes[st.currentnode] = Node(start, ending)
    return st.currentnode
end

function SuffixTree(str::String)
    nodes = [Node() for _ in 1:length(str) * 2]
    st = SuffixTree(nodes, [c for c in str], 1, 0, 0, 0, 0, 1, 1, 1)
    st.root = newnode(st, 0, 0)
    st.activenode = st.root
    for i in 1:length(st.text)
        extendsuffixtree(st, i)
    end
    setsuffixindexbyDFS(st, st.nodes[st.root], 0)
    return st
end

function addsuffixlink(st, nodenum::Int)
    if st.needsuffixlink > 0
        st.nodes[st.needsuffixlink].suffixlink = nodenum
    end
    st.needsuffixlink = nodenum
end

activeedge(st) = st.text[st.activeedge]

function walkdown!(st, currnode::Int)
    len = edgelength(st, st.nodes[currnode])
    st.activelength < len && return false
    st.activeedge += len
    st.activelength -= len
    st.activenode = currnode
    return true
end

function extendsuffixtree(st, pos)
    st.position = pos
    st.needsuffixlink = 0
    st.remainder += 1
    while st.remainder > 0
        st.activelength == 0 && (st.activeedge = st.position)
        if !haskey(st.nodes[st.activenode].children, activeedge(st))
            nodenum = newnode(st, st.position, oo)
            st.nodes[st.activenode].children[activeedge(st)] = nodenum
            addsuffixlink(st, st.activenode)
        else
            next = st.nodes[st.activenode].children[activeedge(st)]
            walkdown!(st, next) && continue
            if st.text[st.nodes[next].start + st.activelength] == st.text[pos]
                addsuffixlink(st, st.activenode)
                st.activelength += 1
                break
            end
            splt = newnode(st, st.nodes[next].start, st.nodes[next].start + st.activelength)
            st.nodes[st.activenode].children[activeedge(st)] = splt
            nodenum = newnode(st, st.position, oo)
            st.nodes[splt].children[st.text[pos]] = nodenum
            st.nodes[next].start += st.activelength
            st.nodes[splt].children[st.text[st.nodes[next].start]] = next
            addsuffixlink(st, splt)
        end
        st.remainder -= 1
        if st.activenode == st.root && st.activelength > 0
            st.activelength -= 1
            st.activeedge = st.position - st.remainder + 1
        elseif st.activenode != st.root
            st.activenode = st.nodes[st.activenode].suffixlink
        end
    end
end

function setsuffixindexbyDFS(st, node, labelheight, verbose=false)
    verbose && node.start > 0 && print(st.text[node.start:min(node.ending, length(st.text))])
    isleaf = true
    for child in map(v -> st.nodes[v], collect(values(node.children)))
        verbose && isleaf && node.start > 0 && println(" [", node.suffixindex, "]")
        isleaf = false
        setsuffixindexbyDFS(st, child, labelheight + edgelength(st, child))
    end
    if isleaf
        idx = length(st.text) - labelheight
        node.suffixindex = idx
        verbose && println(" [$idx]")
    end
end

function dotraversal(st)
    maxheight, substringstartindices = 0, [0]
    function traversal(node::Node, labelheight)
        if node.suffixindex == -1
            for child in map(v -> st.nodes[v], collect(values(node.children)))
                traversal(child, labelheight + edgelength(st, child))
            end
        elseif maxheight < labelheight - edgelength(st, node)
            maxheight = labelheight - edgelength(st, node)
            substringstartindices = [node.suffixindex + 1]
        elseif maxheight == labelheight - edgelength(st, node)
            push!(substringstartindices, node.suffixindex + 1)
        end
    end
    traversal(st.nodes[st.root], 0)
    return maxheight, substringstartindices
end

function getlongestrepeatedsubstring(st::SuffixTree, label="", printresult=true)
    len, starts = dotraversal(st)
    substring = len == 0 ? "" :
        join(unique(map(x -> String(st.text[x:x+len-1]), starts)), " (or) ")
    if printresult
        print("  ", label == "" ? String(st.text) : label, ": ")
        println(len == 0 ? "No repeated substring." : substring)
    end
    return substring
end

function testsuffixtree()
    tests = [
        "CAAAABAAAABD\$",
        "GEEKSFORGEEKS\$",
        "AAAAAAAAAA\$",
        "ABCDEFG\$",
        "ABABABA\$",
        "ATCGATCGA\$",
        "banana\$",
        "abcpqrabpqpq\$",
        "pqrpqpqabab\$",
    ]
    println("Longest Repeated Substring in:\n")
    for test in tests
        st = SuffixTree(test)
        getlongestrepeatedsubstring(st)
    end
    println()
    sπ = ""
    setprecision(4000000) do
        sπ = string(BigFloat(π))[3:end]
    end
    for number in [1000, 10000, 100000, 1000000]
        text = sπ[1:number] * "\$"
        @time begin
            st = SuffixTree(text)
            getlongestrepeatedsubstring(st, "first $number d.p. of π")
        end
    end
end

testsuffixtree()


  

You may also check:How to resolve the algorithm Elementary cellular automaton step by step in the Wren programming language
You may also check:How to resolve the algorithm Logical operations step by step in the REBOL programming language
You may also check:How to resolve the algorithm Integer sequence step by step in the Lambdatalk programming language
You may also check:How to resolve the algorithm Order disjoint list items step by step in the Tcl programming language
You may also check:How to resolve the algorithm Fivenum step by step in the 11l programming language