padding in generalized suffix tree and implementation resource - string

From the wikipedia page, it says using unique terminator strings $0, $1, …, $n-1 for a tree with n strings, s1, ..., sn.
My question is: how to deal with situations in which there are literal suffix of $i for string i+1? For example, my first string s1 is example$0. What is the clever way of doing this?
Also, the implementation of suffix tree I found are mostly for a single string, not for the generalized version. Given a implementation for a single string, how can one easily extend it?
Thank you!

1st question: if you're using Unicode, you may use PUA codes (http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Private_use_characters) which are not assigned in your environment. Starting at U+E000 would do. If you're using 8-bit ascii, use a byte code which you know is not in your strings -- \003 (end of text) sounds appropriate -- instead of that '$'.
2nd question: just start over, only starting with the current tree instead of an empty one. The unique terminators guarantee that you'll never find yourself trying to split a leaf node.

Related

Get value of nth char in string in rust

How do I get the value of a character at position n in a string?
For example, if I had the string "Hello, world!", how would I get the value of the first character?
It's simple as s.chars().nth(n).
However, beware that like said in the docs:
It’s important to remember that char represents a Unicode Scalar Value, and might not match your idea of what a ‘character’ is. Iteration over grapheme clusters may be what you actually want. This functionality is not provided by Rust’s standard library, check crates.io instead.
See How to iterate over Unicode grapheme clusters in Rust?.
For the first character specifically, you can use s.chars().next().
If your string is ASCII-only, you can use as_bytes(): s.as_bytes()[n]. But I would not recommend that, as this is not future-proof (though this is faster, O(1) vs O(n)).

Longest common substring via suffix array: do we really need unique sentinels?

I am reading about LCP arrays and their use, in conjunction with suffix arrays, in solving the "Longest common substring" problem. This video states that the sentinels used to separate individual strings must be unique, and not be contained in any of the strings themselves.
Unless I am mistaken, the reason for this is so when we construct the LCP array (by comparing how many characters adjacent suffixes have in common) we don't count the sentinel value in the case where two sentinels happen to be at the same index in both the suffixes we are comparing.
This means we can write code like this:
for each character c in the shortest suffix
if suffix_1[c] == suffix_2[c]
increment count of common characters
However, in order to facilitate this, we need to jump through some hoops to ensure we use unique sentinels, which I asked about here.
However, would a simpler (to implement) solution not be to simply count the number of characters in common, stopping when we reach the (single, unique) sentinel character, like this:
set sentinel = '#'
for each character c in the shortest suffix
if suffix_1[c] == suffix_2[c]
if suffix_1[c] != sentinel
increment count of common characters
else
return
Or, am I missing something fundamental here?
Actually I just devised an algorithm that doesn't use sentinels at all: https://github.com/BurntSushi/suffix/issues/14
When concatenating the strings, also record the boundary indexes (e.g. for 3 string of length 4, 2, 5, the boundaries 4, 6, and 11 will be recorded, so we know that concatenated_string[5] belongs to the second original string because 4<= 5 < 6).
Then, to identify which original string every suffix belongs to, just do a binary search.
The short version is "this is mostly an artifact of how suffix array construction algorithms work and has nothing to do with LCP calculations, so provided your suffix array building algorithm doesn't need those sentinels, you can safely skip them."
The longer answer:
At a high level, the basic algorithm described in the video goes like this:
Construct a generalized suffix array for the strings T1 and T2.
Construct an LCP array for that resulting suffix array.
Iterate across the LCP array, looking for adjacent pairs of suffixes that come from different strings.
Find the largest LCP between any two such strings; call it k.
Extract the first k characters from either of the two suffixes.
So, where do sentinels appear in here? They mostly come up in steps (1) and (2). The video alludes to using a linear-time suffix array construction algorithm (SACA). Most fast SACAs for generating suffix arrays for two or more strings assume, as part of their operation, that there are distinct endmarkers at the ends of those strings, and often the internal correctness of the algorithm relies on this. So in that sense, the endmarkers might need to get added in purely to use a fast SACA, completely independent of any later use you might have.
(Why do SACAs need this? Some of the fastest SACAs, such as the SA-IS algorithm, assume the last character of the string is unique, lexicographically precedes everything, and doesn't appear anywhere else. In order to use that algorithm with multiple strings, you need some sort of internal delimiter to mark where one string ends and another starts. That character needs to act as a strong "and we're now done with the first string" character, which is why it needs to lexicographically precede all the other characters.)
Assuming you're using a SACA as a black box this way, from this point forward, those sentinels are completely unnecessary. They aren't used to tell which suffix comes from which string (this should be provided by the SACA), and they can't be a part of the overlap between adjacent strings.
So in that sense, you can think of these sentinels as an implementation detail needed to use a fast SACA, which you'd need to do in order to get the fast runtime.

How does punycode distinguish similar IRIs?

I've been looking into internationalised resource identifiers and there's one thing bugging me.
My understanding is that, for each label in a domain name (xyzzy.plugh.com has three labels, xyzzy, plugh and com), the following process is performed to translate it into ASCII representation so that it can be processed okay by all legacy software:
If it consists solely of ASCII characters, it's copied as is.
Otherwise:
First we output xn-- followed by all the ASCII characters (skipping non-ASCII).
Then, if the final character isn't -, we output - to separate the ASCII from non-ASCII.
Finally, we encode each of the non-ASCII characters using punycode so that they appear to be ASCII.
My question then is: how do we distinguish between the following two Unicode URIs?
http://aa☃.net/
http://☃aa.net/
It seems to me that both of these will encode to:
http://xn--aa-nfh.net/
simply because the sequencing information has been lost for the label as a whole.
Or am I missing something in the specification?
According to one punycode encoder, there are encoded differently:
aa☃.net -> xn--aa-gsx.net
☃aa.net -> xn--aa-esx.net
^
see here
The relevant RFC 3492 details why this is the case. First, it provides clues in the introduction:
Uniqueness: There is at most one basic string that represents a given extended string.
Reversibility: Any extended string mapped to a basic string can be recovered from that basic string.
That means there must be differentiable one-to-one mapping for every single basic/extended string pair.
Understanding how it differentiates the two possibilities requires an understanding of the decoder (the thing that turns the basic string back into an extended one, with all its Unicode glory) works.
The decoder begins by starting with just the basic string aa.net with a pointer to the first a, then applies a series of deltas, such as gsx or esx.
The delta actually encodes two things. The first is the number of non-insertions to be done and the second is the actual insertion.
So, gsx (the delta in aa☃.net) would encode two non-insertions (to skip the aa) followed by an insertion of ☃. The esx delta (for ☃aa.net) would encode zero non-insertions followed by an insertion of ☃.
That is how position is encoded into the basic strings.

What is the difference between binary safe strings and binary unsafe strings?

I was reading redis manifesto[1] and it seems redis accepts only binary safe strings as keys but I don't know the difference between the two. Can anyone explain with an example?
[1] http://oldblog.antirez.com/post/redis-manifesto.html
According to Redis documentation, simple Redis strings have syntax "+redis_response\r\n" whereas bulk Redis strings have syntax "$str_len\r\nbinary_safe_string\r\n".
In other words, binary safe string in Redis can contain any data as simple as "foo" to any binary data upto 512MB say a JEPG image. Binary safe string has its length encoded in it and does not terminate with any particular character such as a NULL terminating string in C which ends with '\0.
HTH,
Swanand
I'm not familiar with the system in question, but the term "binary safe string" might be used either to describe certain string-storage types or to describe particular string instances. In a binary-safe string type, a string of length N may be used to encapsulate any sequence of N values in the range either 0-255 or 0-65535 (for 8- or 16-bit types, respectively). A binary-safe string instance might be one whose representation may be subdivided into uniformly-sized pieces, with each piece representing one character, as distinct from a string instance in which different characters require different amounts of storage space.
Some string types (which are not binary safe) will use variable-length representations for certain characters, and will behave oddly if asked to act upon e.g. a string which contains the code for "first half of a multi-part character" followed by something other than a "second half of multi-part character". Further, some code which works with strings will assume that it the Nth character will be stored in either the Nth byte or the Nth pair of bytes, and will malfunction if given a string in which, e.g. the 8th character is stored in the 12th and 13th pairs of bytes.
Looking only briefly at the link provided, I would guess that it's saying that the redis does not expect to only work with strings that use different numbers of bytes to hold different characters, though I'm not quite clear whether it's assuming that a string type will be able to handle any possible sequence of bytes, or whether it's assuming that any string instance which it's given may be safely regarded as a sequence of bytes. I think the fundamental concepts of interest, though, are (1) some string types use variable-length encodings and others do not; (2) even in types that use variable-length encodings, a useful subset of string instances will consist only of fixed-length characters.
Binary-safe means that a string can contain any character, while binary-unsafe can not, such as '\0' in C language. '\0' is the ending of a string, which means characters after '\0' and before '\0' will be considered as two different strings.

caesar cipher check in ocaml

I want to implement a check function that given two strings s1 and s2 will check if s2 is the caesar cipher of s1 or not. the inter face needs to be looked like string->string->bool.
the problem is that I am not allowed to use any string functions other than String.length, so how can I solve it? i am not permitted any list array, iterations. Only recursions and pattern matching.
Please help me. And also can you tell me how I can write a substring function in ocaml other than the module function with the above restrictions?
My guess is that you are probably allowed to use s.[i] to get the ith character of string s. This is the same as String.get, but the instructor may not think of it in those terms. Without some form of getting the individual characters for the string, I believe that this is impossible. You should probably double check with your instructor to be sure, but I would be surprised if he had meant for you to be unable to separate a string into characters (which is something that you cannot do with pattern-matching alone in Ocaml).
Once you can get individual characters, the way to do it should be pretty clear (you do not need substring to traverse each string recursively).
If you still want to write substring, creating it would be complex since you don't have access to String.create or other similar functions. But you can write your own version of String.create using recursion, one character string literals (like "x"), the ability to set a character in a string to another (like s.[0] <- c), and string concatenation (s1 ^ s2). Again, of course, all of this is assuming that those operators are allowed to be used.

Resources