How do Suffix Trees work? - string

I'm going through the data structures chapter in The Algorithm Design Manual and came across Suffix Trees.
The example states:
Input:
XYZXYZ$
YZXYZ$
ZXYZ$
XYZ$
YZ$
Z$
$
Output:
I'm not able to understand how that tree gets generated from the given input string. Suffix trees are used to find a given Substring in a given String, but how does the given tree help towards that? I do understand another given example of a trie shown below, but if the below trie gets compacted to a suffix tree, then what would it look like?

The standard efficient algorithms for constructing a suffix tree are definitely nontrivial. The main algorithm for doing so is called Ukkonen's algorithm and is a modification of the naive algorithm with two extra optmizations. You are probably best off reading this earlier question for details on how to build it.
You can construct suffix trees by using the standard insertion algorithms on radix tries to insert each suffix into the tree, but doing so wlil take time O(n2), which can be expensive for large strings.
As for doing fast substring searching, remember that a suffix tree is a compressed trie of all the suffixes of the original string (plus some special end-of-string marker). If a string S is a substring of the initial string T and you had a trie of all the suffixes of T, then you could just do a search to see if T is a prefix of any of the strings in that trie. If so, then T must be a substring of S, since all its characters exist in sequence somewhere in T. The suffix tree substring search algorithm is precisely this search applied to the compressed trie, where you follow the appropriate edges at each step.
Hope this helps!

I'm not able to understand how that tree gets generated from the given input string.
You essentially create a patricia trie with all the suffixes you've listed. When inserting into a patricia trie, you search the root for a child starting with the first char from the input string, if it exists you continue down the tree but if it doesn't then you create a new node off the root. The root will have as many children as unique characters in the input string ($, a, e, h, i, n, r, s, t, w). You can continue that process for each character in the input string.
Suffix trees are used to find a given Substring in a given String, but how does the given tree help towards that?
If you are looking for substring "hen" then start searching from the root for a child which starts with "h". If the length of the string of in child "h" then continue to process child "h" until you've come to the end of the string or you get a mismatch of characters in input string and child "h" string. If you match all of child "h", i.e. input "hen" matches "he" in child "h" then move on to the children of "h" until you get to "n", if it fail to find a child beginning with "n" then the substring doesn't exist.
Compact Suffix Trie code:
└── (black)
├── (white) as
├── (white) e
│ ├── (white) eir
│ ├── (white) en
│ └── (white) ere
├── (white) he
│ ├── (white) heir
│ ├── (white) hen
│ └── (white) here
├── (white) ir
├── (white) n
├── (white) r
│ └── (white) re
├── (white) s
├── (white) the
│ ├── (white) their
│ └── (white) there
└── (black) w
├── (white) was
└── (white) when
Suffix Tree code:
String = the$their$there$was$when$
End of word character = $
└── (0)
├── (22) $
├── (25) as$
├── (9) e
│ ├── (10) ir$
│ ├── (32) n$
│ └── (17) re$
├── (7) he
│ ├── (2) $
│ ├── (8) ir$
│ ├── (31) n$
│ └── (16) re$
├── (11) ir$
├── (33) n$
├── (18) r
│ ├── (12) $
│ └── (19) e$
├── (26) s$
├── (5) the
│ ├── (1) $
│ ├── (6) ir$
│ └── (15) re$
└── (29) w
├── (24) as$
└── (30) hen$

A suffix tree basically just compacts runs of letters together when there are no choices to be made. For example, if you look at the right side of the trie in your question, after you've seen a w, there are really only two choices: was and when. In the trie, the as in was and the hen in when each still have one node for each letter. In a suffix tree, you'd put those together into two nodes holding as and hen, so the right side of your trie would turn into:

Related

How to use a vector to cache results within a Haskell function?

I have a computationally expensive vector I want to index into inside a function, but since the table is never used anywhere else, I don't want to pass the vector around, but access the precomputed values like a memoized function.
The idea is:
cachedFunction :: Int -> Int
cachedFunction ix = table ! ix
where table = <vector creation>
One aspect I've noticed is that all memoization examples I've seen deal with recursion, where even if a table is used to memoize, values in the table depend on other values in the table. This is not in my case, where computed values are found using a trial-and-error approach but each element is independent from another.
How do I achieve the cached table in the function?
You had it almost right. The problem is, your example basically scoped like this:
┌────────────────────────────────┐
cachedFunction ix = │ table ! ix │
│where table = <vector creation> │
└────────────────────────────────┘
i.e. table is not shared between different ix. This is regardless of the fact that it happens to not depend on ix (which is obvious in this example, but not in general). Therefore it would not be useful to keep it in memory, and Haskell doesn't do it.
But you can change that by pulling the ix argument into the result with its associated where-block:
cachedFunction = \ix -> table ! ix
where table = <vector creation>
i.e.
┌────────────────────────────────┐
cachedFunction = │ \ix -> table ! ix │
│where table = <vector creation> │
└────────────────────────────────┘
or shorter,
cachedFunction = (<vector creation> !)
In this form, cachedFunction is a constant applicative form, i.e. despite having a function type it is treated by the compiler as a constant value. It's not a value you could ever evaluate to normal form, but it will keep the same table around (which can't depend on ix; it doesn't have it in scope) when using it for evaluating the lambda function inside.
According to this answer, GHC will never recompute values declared at the top-level of a module. So by moving your table up to the top-level of your module, it will be evaluated lazily (once) the first time it's ever needed, and then it will never be requested again. We can see the behavior directly with Debug.Trace (example uses a simple integer rather than a vector, for simplicity)
import Debug.Trace
cachedFunction :: Int -> Int
cachedFunction ix = table + ix
table = traceShow "Computed" 0
main :: IO ()
main = do
print 0
print $ cachedFunction 1
print $ cachedFunction 2
Outputs:
0
"Computed"
1
2
We see that table is not computed until cachedFunction is called, and it's only computed once, even though we call cachedFunction twice.

Will the native buffer owned by BytesMut keep growing?

Suppose I have a BytesMut object, and then keep writing data to its interior. Then split some frames out of it according to the frame segmentation format.
According to my idea, this memory is constantly reallocate out. So the question is, I can make its capacity smaller by constantly splitting, but at what point does it actually allocate contiguous memory to be freed?
If I keep taking data splits from the front, won't this memory usage get larger and larger? Or maybe I have a problem understanding that different BytesMut objects will have different native buffer when split, but how is this done?
#[test]
fn test_bytesmut_growth() {
use bytes::{BufMut, BytesMut};
let mut bm = BytesMut::with_capacity(16);
for i in 0..10000 {
bm.put(&b"1234567890"[..]);
let front = bm.split();
drop(front);
}
//println!("current cap={}, len={}", bm.capacity(), bm.len());
}
If you split a BytesMut object into two, the two objects will still share the underlying reference-counted buffer. Here's an attempt at a visualization, containing a few implementation details. Before splitting:
Underlying buffer, ref count 1
┌────────────────────────────────┐
│0123456789ABCDEFGHIJ │
└▲───────────────────────────────┘
│
│ first
│ ┌────────────┐
├──┤ptr │
│ │len: 20 │
│ │cap: 32 │
└──┤data │
└────────────┘
After calling let second = first.split_off(10), we will get
Underlying buffer, ref count 2
┌────────────────────────────────┐
│0123456789ABCDEFGHIJ │
└▲─────────▲─────────────────────┘
│ │
│ └──────────┐
│ first │ second
│ ┌────────────┐ │ ┌────────────┐
├──┤ptr │ └─────┤ptr │
│ │len: 10 │ │len: 10 │
│ │cap: 10 │ │cap: 22 │
├──┤data │ ┌─────┤data │
│ └────────────┘ │ └────────────┘
│ │
└────────────────────┘
Once we drop first, we will have
Underlying buffer, ref count 1
┌────────────────────────────────┐
│0123456789ABCDEFGHIJ │
└▲─────────▲─────────────────────┘
│ │
│ │
│ │ second
│ │ ┌────────────┐
│ └────────────────┤ptr │
│ │len: 10 │
│ │cap: 22 │
└──────────────────────────┤data │
└────────────┘
If you now call second.reserve(10), or call an operation that implicitly reserves, like writing more than fits in the current capacity, the BytesMut implementation can detect that second actually owns its underlying buffer, since the reference count is one. The implementation then may be able to reuse spare capacity in the buffer by moving the existing buffer contents to the beginning, so after second.reserve(20), the result could look like this:
Underlying buffer, ref count 1
┌────────────────────────────────┐
│ABCDEFGHIJ │
└▲───────────────────────────────┘
│
│ second
│ ┌────────────┐
├──┤ptr │
│ │len: 10 │
│ │cap: 32 │
└──┤data │
└────────────┘
However, the conditions for this optimization to be applied are not guaranteed. The documentation for reserve() states (emphasis mine)
Before allocating new buffer space, the function will attempt to reclaim space in the existing buffer. If the current handle references a view into a larger original buffer, and all other handles referencing part of the same original buffer have been dropped, then the current view can be copied/shifted to the front of the buffer and the handle can take ownership of the full buffer, provided that the full buffer is large enough to fit the requested additional capacity.
This optimization will only happen if shifting the data from the current view to the front of the buffer is not too expensive in terms of the (amortized) time required. The precise condition is subject to change; as of now, the length of the data being shifted needs to be at least as large as the distance that it’s shifted by. If the current view is empty and the original buffer is large enough to fit the requested additional capacity, then reallocations will never happen.
In summary, this optimization is only guaranteed if the reference count is one and the view is empty. This is the case in your example, so your code is guaranteed to reuse the buffer.
According to the documentation, BytesMut::split 'Removes the bytes from the current view, returning them in a new BytesMut handle.
Afterwards, self will be empty, but will retain any additional capacity that it had before the operation.'
This is done by creating a new BytesMut (which is then owned by front) that contains exactly the items of bm, after which bm is modified such that it contains only the remaining empty capacity. This way, BytesMut::split doesn't allocate any new memory.
You then drop the BytesMut (owned by front), making it so that there is no view into the memory owned by the backing Vec, from the start of the Vec until the start of bm. When you then put, the implementation first checks if there is enough space before the view of bm, but still inside the backing Vec, and tries to store the data there.
Because the amount of memory you put is the same as the memory 'freed' by dropping front, the implementation is able to store that data before the view of bm, and no data is allocated.

Are variables used in nested functions considered global?

This is a dumb question, so I apologise if so. This is for Julia, but I guess the question is not language specific.
There is advice in Julia that global variables should not be used in functions, but there is a case where I am not sure if a variable is global or local. I have a variable defined in a function, but is global for a nested function. For example, in the following,
a=2;
f(x)=a*x;
variable a is considered global. However, if we were to wrap this all in another function, would a still be considered global for f? For example,
function g(a)
f(x)=a*x;
end
We don't use a as an input for f, so it's global in that sense, but its still only defined in the scope of g, so is local in that sense. I am not sure. Thank you.
You can check directly that what #DNF commented indeed is the case (i.e. that the variable a is captured in a closure).
Here is the code:
julia> function g(a)
f(x)=a*x
end
g (generic function with 1 method)
julia> v = g(2)
(::var"#f#1"{Int64}) (generic function with 1 method)
julia> dump(v)
f (function of type var"#f#1"{Int64})
a: Int64 2
In this example your function g returns a function. I bind a v variable to the returned function to be able to inspect it.
If you dump the value bound to the v variable you can see that the a variable is stored in the closure.
A variable stored in a closure should not a problem for performance of your code. This is a typical pattern used e.g. when doing optimization of some function conditional on some parameter (captured in a closure).
As you can see in this code:
julia> #code_warntype v(10)
MethodInstance for (::var"#f#1"{Int64})(::Int64)
from (::var"#f#1")(x) in Main at REPL[1]:2
Arguments
#self#::var"#f#1"{Int64}
x::Int64
Body::Int64
1 ─ %1 = Core.getfield(#self#, :a)::Int64
│ %2 = (%1 * x)::Int64
└── return %2
everything is type stable so such code is fast.
There are some situations though in which boxing happens (they should be rare; they happen in cases when your function is so complex that the compiler is not able to prove that boxing is not needed; most of the time it happens if you assign value to the variable captured in a closure):
julia> function foo()
x::Int = 1
return bar() = (x = 1; x)
end
foo (generic function with 1 method)
julia> dump(foo())
bar (function of type var"#bar#6")
x: Core.Box
contents: Int64 1
julia> #code_warntype foo()()
MethodInstance for (::var"#bar#1")()
from (::var"#bar#1")() in Main at REPL[1]:3
Arguments
#self#::var"#bar#1"
Locals
x::Union{}
Body::Int64
1 ─ %1 = Core.getfield(#self#, :x)::Core.Box
│ %2 = Base.convert(Main.Int, 1)::Core.Const(1)
│ %3 = Core.typeassert(%2, Main.Int)::Core.Const(1)
│ Core.setfield!(%1, :contents, %3)
│ %5 = Core.getfield(#self#, :x)::Core.Box
│ %6 = Core.isdefined(%5, :contents)::Bool
└── goto #3 if not %6
2 ─ goto #4
3 ─ Core.NewvarNode(:(x))
└── x
4 ┄ %11 = Core.getfield(%5, :contents)::Any
│ %12 = Core.typeassert(%11, Main.Int)::Int64
└── return %12

Where clause in Haskell

I'm confused about how the where clause in Haskell works in a certain situation.
My biggest question is, is it possible to declare a variable that does something in the where clause and use back that declared variable through another variable declared in the where clause?
For example:
someFunc :: somefunc
.
| (guard expression)
| (guard expression)
where a = 1+3
b = a + 2 --using back 'a' variable which was also declared in the where clause.
Is this possible? When i do this haskell does not report any error but I was having doubts if it's a correct.
Yes. The variables in the where clause can see other variables in the same where clause.
For doubts, you can test it out with a simpler structure to see if it gives the correct value:
testing = b
where
a = 1000
b = a + 234
main = print testing
Does it print out 1234 as expected?
Yes, it is even possible to use the same variable in the expression as the one you define.
would work as well. In essence, the variables are just references to "expressions". So for your case you construct something that looks like:
┏━━━━━━━┓
b──>┃ (+) ┃
┣━━━┳━━━┫ ┏━━━┓
┃ o ┃ o─╂──>┃ 2 ┃
┗━┿━┻━━━┛ ┗━━━┛
│
v
┏━━━━━━━┓
a──>┃ (+) ┃
┣━━━┳━━━┫ ┏━━━┓
┃ o ┃ o─╂──>┃ 3 ┃
┗━┿━┻━━━┛ ┗━━━┛
│
│ ┏━━━┓
╰────────>┃ 1 ┃
┗━━━┛
This expression tree thus contains functions which point to other expression trees. Haskell will by default not evaluate these expressions: the expressions are evaluated lazily: only when you have to calculate these, you will calculate the corresponding value. Furthermore, if you are for example interested in the value of b, you thus will calculate the value of a, and thus the 1+3 expression will only be evaluated once. The same holds in the opposite direction: in case you first evaluate a, then evaluating b will benefit from the fact that a was already calculated. You can for example define two variables in terms of each other, like:
foo :: Int
foo = a
where a = 1 + b
b = 1 + a
but this will get stuck in an infinite loop since you will create an expression that looks like 1 + (1 + (1 + (...))).
We can even define a variable in terms of itself. For example the function below generates an infinite list of ones:
ones :: [Int]
ones = lst
where lst = 1 : lst
This will be represented as:
┏━━━━━━━┓
lst──>┃ (:) ┃<─╮
┣━━━┳━━━┫ │
┃ o ┃ o─╂──╯
┗━┿━┻━━━┛
│
v
┏━━━┓
┃ 1 ┃
┗━━━┛
calcProfit revenue cost= if revenue - cost >0
then revenue - cost
else 0
In this example we are repeating (revenuw-cost) your computation. This is a cheap operation but if it was an expensive operation you would be wasting resources. To prevent this, we use where clause:
calcProfit revenue cost = if profit >0
then profit
else 0
where profit= revenue-cost
With "where" clause we reverse the normal order used to write variables. In most of the programming languages, variables are declared before they’re used. In Haskell, because of referential transparency, variable order isn’t an issue.
As you see, we declare the variable "profit" in where clause and then we used it

How to force grouping of monadic verbs?

I came up with an incorrect J verb in my head, which would find the proportion of redundant letters in a string. I started with just a bunch of verbs with no precedence defined, and tried grouping inwards:
c=. 'cool' NB. The test data string, 1/4 is redundant.
box =. 5!:2 NB. The verb to show the structure of another verb in a box.
p=.%#~.%# NB. First attempt. Meant to read "inverse of (tally of unique divided by tally)".
box < 'p'
┌─┬─┬────────┐
│%│#│┌──┬─┬─┐│
│ │ ││~.│%│#││
│ │ │└──┴─┴─┘│
└─┴─┴────────┘
p2=.%(#~.%#) NB. The first tally is meant to be in there with the nub sieve, so paren everything after the inverse monad.
box < 'p2'
┌─┬────────────┐
│%│┌─┬────────┐│
│ ││#│┌──┬─┬─┐││
│ ││ ││~.│%│#│││
│ ││ │└──┴─┴─┘││
│ │└─┴────────┘│
└─┴────────────┘
p3=. %((#~.)%#) NB. The first tally is still not grouped with the nub sieve, so paren the two together directly.
box < 'p3'
┌─┬────────────┐
│%│┌──────┬─┬─┐│
│ ││┌─┬──┐│%│#││
│ │││#│~.││ │ ││
│ ││└─┴──┘│ │ ││
│ │└──────┴─┴─┘│
└─┴────────────┘
p3 c NB. Looks about right, so test it!
|length error: p3
| p3 c
(#~.)c NB. Unexpected error, but I guessed as to what part had the problem.
|length error
| (#~.)c
My question is, why did my approach to grouping fail with this length error, and how should I have grouped it to get the desired effect?
(I assume it is something to do with turning it into a hook instead of grouping, or it just not realising it needs to use the monad forms, but I don't know how to verify or get around it if so.)
Fork and compose.
(# ~.) is a hook. This is probably what you're not expecting. (# ~.) 'cool' is applying ~. to 'cool' to give you 'col'. But as it's a monadic hook, it is then attempting 'cool' # 'col', which isn't what you're intending and which gives a length error.
To get 0.25 as the ratio of redundant characters in a string, don't use the reciprocal (%). You just subtract from 1 the ratio of unique characters. This is pretty straightforward with a fork:
(1 - #&~. % #) 'cool'
0.25
p9 =. 1 - #&~. % #
box < 'p9'
┌─┬─┬──────────────┐
│1│-│┌────────┬─┬─┐│
│ │ ││┌─┬─┬──┐│%│#││
│ │ │││#│&│~.││ │ ││
│ │ ││└─┴─┴──┘│ │ ││
│ │ │└────────┴─┴─┘│
└─┴─┴──────────────┘
Compose (&) ensures that you tally (#) the nub (~.) together, so that the fork grabs it as a single verb. The fork is a series of three verbs that applies the first and third verb, and then applies the middle verb to the results. So #&~. % # is the fork, where #&~. is applied to the string, resulting in 3. # is applied, resulting in 4. And then % is applied to those results, as 3 % 4, giving you 0.75. That's our ratio of unique characters.
1 - is just to get us 0.25 instead of 0.75. % 0.75 is the same as 1 % 0.75, which gives you 1.33333.

Resources