how to get longest repeating string in substring from suffix tree

how to get longest repeating string in substring from suffix tree - string

I need to find the longest repeating string in substring. Let's say I have string "bannana"
Wikipedia says following:
In computer science, the longest repeated substring problem is the
problem of finding the longest substring of a string that occurs at
least twice. In the figure with the string "ATCGATCGA$", the longest
repeated substring is "ATCGA"
So I assume that for string "bannana" there are two equally long substrings (if not correct me please): "an" and "na".
Wikipedia also says that for this purpose suffix trees are used. To be more specific here is quotation how to do it (this seems to me more understable than definition on wikipedia):
build a Suffix tree, then find the highest node with at least 2
descendants.
I've found several implementations of suffix trees. Following code is taken from here:
use strict;
use warnings;
use Data::Dumper;
sub classify {
my ($f, $h) = (shift, {});
for (#_) { push #{$h->{$f->($_)}}, $_ }
return $h;
}
sub suffixes {
my $str = shift;
map { substr $str, $_ } 0 .. length($str) - 1;
}
sub suffix_tree {
return +{} if #_ == 0;
return +{ $_[0] => +{} } if #_ == 1;
my $h = {};
my $classif = classify sub { substr shift, 0, 1 }, #_;
for my $key (sort keys %$classif) {
my $subtree = suffix_tree(
grep "$_", map { substr $_, 1 } #{$classif->{$key}}
);
my #subkeys = keys %$subtree;
if (#subkeys == 1) {
my $subkey = shift #subkeys;
$h->{"$key$subkey"} = $subtree->{$subkey};
} else { $h->{$key} = $subtree }
}
return $h;
}
print +Dumper suffix_tree suffixes 'bannana$';
for string "bannana" it returns following tree:
$VAR1 = {
'$' => {},
'n' => {
'a' => {
'na$' => {},
'$' => {}
},
'nana$' => {}
},
'a' => {
'$' => {},
'n' => {
'a$' => {},
'nana$' => {}
}
},
'bannana$' => {}
};
Another implementation is online from here, for string "bannana" it returns following tree:
7: a
5: ana
2: annana
1: bannana
6: na
4: nana
3: nnana
|(1:bannana)|leaf
tree:|
| |(4:nana)|leaf
|(2:an)|
| |(7:a)|leaf
|
| |(4:nana)|leaf
|(3:n)|
| |(5:ana)|leaf
3 branching nodes
Questions:
How can I get from those graphs "an" and "na" strings?
As you can see trees are different, are they equivalent or not, if yes why they are different, if not which algorithm is correct?
If perl implementation is wrong is there any working implementation for perl/python?
I've read about Ukkonen's algorithm which is also mentioned on page with 2nd example (I did not catch if the online version is using this algorithm or not), does any of the mentioned examples using this algorithm? If not, is used algorithm slower or has any drawbacks compared to Ukkonen?

1. How can I get from those graphs "an" and "na" strings?
build a Suffix tree, then find the highest node with at least 2 descendants.
string-node is concatenate strings for every node from root to this node. highest node is node with maximum length string-node.
See tree in my answer for second question. (3:n) have 2 descendants and path to node is (2:a)->(3:n), concatenate is an. And also for (5:a) get na.
2. As you can see trees are different, are they equivalent or not, if yes why they are different, if not which algorithm is correct?
These trees are different. Rebuild second tree for string "bannana$" (
as in the first tree):
8: $
7: a$
5: ana$
2: annana$
1: bannana$
6: na$
4: nana$
3: nnana$
|(1:bannana$)|leaf
tree:|
| | |(4:nana$)|leaf
| |(3:n)|
| | |(7:a$)|leaf
|(2:a)|
| |(8:$)|leaf
|
| |(4:nana$)|leaf
|(3:n)|
| | |(6:na$)|leaf
| |(5:a)|
| | |(8:$)|leaf
|
|(8:$)|leaf
5 branching nodes
3. If perl implementation is wrong is there any working implementation for perl/python?
I don't know Perl, but the tree is built correctly.
4. I've read about Ukkonen's algorithm which is also mentioned on page with 2nd example (I did not catch if the online version is using this algorithm or not), does any of the mentioned examples using this algorithm? If not, is used algorithm slower or has any drawbacks compared to Ukkonen?
I said earlier that I don't know Perl, but it's a line in first algorthim means that it works at least O(n^2) (n it is length string):
map { substr $str, $_ } 0 .. length($str) - 1;
Ukkonen's algorithm works linear time O(n).
First algorithm also recursive which may affect used memory.

Related

What does the int value returned from compareTo function in Kotlin really mean?

In the documentation of compareTo function, I read:
Returns zero if this object is equal to the specified other object, a
negative number if it's less than other, or a positive number if it's
greater than other.
What does this less than or greater than mean in the context of strings? Is -for example- Hello World less than a single character a?
val epicString = "Hello World"
println(epicString.compareTo("a")) //-25
Why -25 and not -10 or -1 (for example)?
Other examples:
val epicString = "Hello World"
println(epicString.compareTo("HelloWorld")) //-55
Is Hello World less than HelloWorld? Why?
Why it returns -55 and not -1, -2, -3, etc?
val epicString = "Hello World"
println(epicString.compareTo("Hello World")) //55
Is Hello World greater than Hello World? Why?
Why it returns 55 and not 1, 2, 3, etc?

I believe you're asking about the implementation of compareTo method for java.lang.String. Here is a source code for java 11:
public int compareTo(String anotherString) {
byte v1[] = value;
byte v2[] = anotherString.value;
if (coder() == anotherString.coder()) {
return isLatin1() ? StringLatin1.compareTo(v1, v2)
: StringUTF16.compareTo(v1, v2);
}
return isLatin1() ? StringLatin1.compareToUTF16(v1, v2)
: StringUTF16.compareToLatin1(v1, v2);
}
So we have a delegation to either StringLatin1 or StringUTF16 here, so we should look further:
Fortunately StringLatin1 and StringUTF16 have similar implementation when it comes to compare functionality:
Here is an implementation for StringLatin1 for example:
public static int compareTo(byte[] value, byte[] other) {
int len1 = value.length;
int len2 = other.length;
return compareTo(value, other, len1, len2);
}
public static int compareTo(byte[] value, byte[] other, int len1, int len2) {
int lim = Math.min(len1, len2);
for (int k = 0; k < lim; k++) {
if (value[k] != other[k]) {
return getChar(value, k) - getChar(other, k);
}
}
return len1 - len2;
}
As you see, it iterated over the characters of the shorter string and in case the charaters in the same index of two strings are different it returns the difference between them. If during the iterations it doesn't find any different (one string is prefix of another) it resorts to the comparison between the length of two strings.
In your case, there is a difference in the first iteration already...
So its the same as `"H".compareTo("a") --> -25".
The code of "H" is 72
The code of "a" is 97
So, 72 - 97 = -25

Short answer: The exact value doesn't have any meaning; only its sign does.
As the specification for compareTo() says, it returns a -ve number if the receiver is smaller than the other object, a +ve number if the receiver is larger, or 0 if the two are considered equal (for the purposes of this ordering).
The specification doesn't distinguish between different -ve numbers, nor between different +ve numbers — and so neither should you.  Some classes always return -1, 0, and 1, while others return different numbers, but that's just an implementation detail — and implementations vary.
Let's look at a very simple hypothetical example:
class Length(val metres: Int) : Comparable<Length> {
override fun compareTo(other: Length)
= metres - other.metres
}
This class has a single numerical property, so we can use that property to compare them.  One common way to do the comparison is simply to subtract the two lengths: that gives a number which is positive if the receiver is larger, negative if it's smaller, and zero of they're the same length — which is just what we need.
In this case, the value of compareTo() would happen to be the signed difference between the two lengths.
However, that method has a subtle bug: the subtraction could overflow, and give the wrong results if the difference is bigger than Int.MAX_VALUE.  (Obviously, to hit that you'd need to be working with astronomical distances, both positive and negative — but that's not implausible.  Rocket scientists write programs too!)
To fix it, you might change it to something like:
class Length(val metres: Int) : Comparable<Length> {
override fun compareTo(other: Length) = when {
metres > other.metres -> 1
metres < other.metres -> -1
else -> 0
}
}
That fixes the bug; it works for all possible lengths.
But notice that the actual return value has changed in most cases: now it only ever returns -1, 0, or 1, and no longer gives an indication of the actual difference in lengths.
If this was your class, then it would be safe to make this change because it still matches the specification.  Anyone who just looked at the sign of the result would see no change (apart from the bug fix).  Anyone using the exact value would find that their programs were now broken — but that's their own fault, because they shouldn't have been relying on that, because it was undocumented behaviour.
Exactly the same applies to the String class and its implementation.  While it might be interesting to poke around inside it and look at how it's written, the code you write should never rely on that sort of detail.  (It could change in a future version.  Or someone could apply your code to another object which didn't behave the same way.  Or you might want to expand your project to be cross-platform, and discover the hard way that the JavaScript implementation didn't behave exactly the same as the Java one.)
In the long run, life is much simpler if you don't assume anything more than the specification promises!

Sort list of string based on length

I have a list of strings
List("cbda","xyz","jlki","badce")
I want to sort the strings in such a way that the odd length strings are sorted in descending order and even length strings are sorted in ascending order
List("abcd","zyx","ijkl","edcba")
Now I have implemented this by iterating over each elements separately, then finding their length and sorting them accordingly. Finally I store them in separate list. I was hoping to know if there is any other efficient way to do this in Scala, or any shorter way to do this (like some sort of list comprehensions we have in Python) ?

You can do it with sortWith and map:
list.map(s => {if(s.length % 2 == 0) s.sortWith(_ < _) else s.sortWith(_ > _)})

I'm not sure what you refer to in Python, so details could help if the examples below don't match your expectations
A first one, make you go through the list twice:
List("cbda","xyz","jlki","badce").map(_.sorted).map {
case even if even.length % 2 == 0 => even
case odd => odd.reverse
}
Or makes you go through elements with even length twice:
List("cbda","xyz","jlki","badce").map {
case even if even.length % 2 == 0 => even.sorted
case odd => odd.sorted.reverse
}

Short-circuiting in functional Groovy?

"When you've found the treasure, stop digging!"
I'm wanting to use more functional programming in Groovy, and thought rewriting the following method would be good training. It's harder than it looks because Groovy doesn't appear to build short-circuiting into its more functional features.
Here's an imperative function to do the job:
fullyQualifiedNames = ['a/b/c/d/e', 'f/g/h/i/j', 'f/g/h/d/e']
String shortestUniqueName(String nameToShorten) {
def currentLevel = 1
String shortName = ''
def separator = '/'
while (fullyQualifiedNames.findAll { fqName ->
shortName = nameToShorten.tokenize(separator)[-currentLevel..-1].join(separator)
fqName.endsWith(shortName)
}.size() > 1) {
++currentLevel
}
return shortName
}
println shortestUniqueName('a/b/c/d/e')
Result: c/d/e
It scans a list of fully-qualified filenames and returns the shortest unique form. There are potentially hundreds of fully-qualified names.
As soon as the method finds a short name with only one match, that short name is the right answer, and the iteration can stop. There's no need to scan the rest of the name or do any more expensive list searches.
But turning to a more functional flow in Groovy, neither return nor break can drop you out of the iteration:
return simply returns from the present iteration, not from the whole .each so it doesn't short-circuit.
break isn't allowed outside of a loop, and .each {} and .eachWithIndex {} are not considered loop constructs.
I can't use .find() instead of .findAll() because my program logic requires that I scan all elements of the list, nut just stop at the first.
There are plenty of reasons not to use try..catch blocks, but the best I've read is from here:
Exceptions are basically non-local goto statements with all the
consequences of the latter. Using exceptions for flow control
violates the principle of least astonishment, make programs hard to read
(remember that programs are written for programmers first).
Some of the usual ways around this problem are detailed here including a solution based on a new flavour of .each. This is the closest to a solution I've found so far, but I need to use .eachWithIndex() for my use case (in progress.)
Here's my own poor attempt at a short-circuiting functional solution:
fullyQualifiedNames = ['a/b/c/d/e', 'f/g/h/i/j', 'f/g/h/d/e']
def shortestUniqueName(String nameToShorten) {
def found = ''
def final separator = '/'
def nameComponents = nameToShorten.tokenize(separator).reverse()
nameComponents.eachWithIndex { String _, int i ->
if (!found) {
def candidate = nameComponents[0..i].reverse().join(separator)
def matches = fullyQualifiedNames.findAll { String fqName ->
fqName.endsWith candidate
}
if (matches.size() == 1) {
found = candidate
}
}
}
return found
}
println shortestUniqueName('a/b/c/d/e')
Result: c/d/e
Please shoot me down if there is a more idiomatic way to short-circuit in Groovy that I haven't thought of. Thank you!

There's probably a cleaner looking (and easier to read) solution, but you can do this sort of thing:
String shortestUniqueName(String nameToShorten) {
// Split the name to shorten, and make a list of all sequential combinations of elements
nameToShorten.split('/').reverse().inject([]) { agg, l ->
if(agg) agg + [agg[-1] + l] else agg << [l]
}
// Starting with the smallest element
.find { elements ->
fullyQualifiedNames.findAll { name ->
name.endsWith(elements.reverse().join('/'))
}.size() == 1
}
?.reverse()
?.join('/')
?: ''
}

Globally Optimal String Substitutions

I'm looking for some pointers for writing a function (let's call it replaceGlobal) that takes an input string and a mapping of substrings to replacement values, and applies these mappings such that as many characters as possible from the input string are replaced. For example:
replaceGlobal("abcde", {
'a' -> 'w',
'abc' -> 'x',
'ab' -> 'y',
'cde' -> 'z'
})
would return "yz" by applying 'ab' -> 'y' and 'cde' -> 'z'.
The function will only apply one round of substitutions, so it can't replace a value and then use part of the replacement value as part of another substitution.
A greedy approach produces non-optimal results (shown here in Javascript):
"abcde".replace(/(abc|cde|ab|a)/g, function(x) {
return {
'a': 'w',
'abc': 'x',
'ab': 'y',
'cde': 'z'
}[x];
});
returns 'xde'
Any thoughts on a good starting point here?
I think the problem boils down to finding the lowest cost path in a weighted DAG constructed with the input string as a spine and other edges provided by the substitutions:
/------x------------\
/-----y------\ \
/---w--\ \ \ /-------z------\
0 -----> a ----> b -----> c -----> d ----> e ----> $
where edges along the spine have a cost of 1 but the other edges have cost zero.
But that may be overcomplicating things.

Seems to me that dynamic programming is the way to go. This is due to the restriction:
The function will only apply one round of substitutions, so it can't
replace a value and then use part of the replacement value as part of
another substitution.
Specifically, say you have some random string abcdefg as input. Now you apply some rule to substitute some middle part, say de -> x. Now you have abcxfg, where the only (smaller subproblems) strings you are now allowed to manipulate are abc and fg. And for repetitive substrings, you can then leverage memoization.

Based on #Matt Timmermans comments and the original DAG idea, here's what I came up with in Javascript as a first attempt (I'm more interested in the algorithm itself than any specific language implementation):
const replaceGlobal = (str, dict) => {
let open = []; // set of substitutions being actively explored
let best = { value: [], weight: 0 }; // optimal path info
// For each character in the input string, left to right
for (let c of str) {
// Add new nodes to `open` for all `substitutions` that
// start with `c`
for (let entry of dict)
if (entry.match[0] === c)
open.push({
value: best.value.concat(entry.sub),
rest: entry.match,
weight: best.weight
});
// Add current character onto best path
best.value.push(c);
++best.weight;
// For each `open` path, try to match against the current character
let new_open = [];
for (let o of open) {
if (o.rest[0] === c) {
if (o.rest.length > 1) { // still more to match
new_open.push({
rest: o.rest.slice(1),
value: o.value,
weight: o.weight
});
} else { // full match found
if (o.weight < best.weight)
best = o;
}
}
}
open = new_open;
}
return best.value.join('');
};
Which would be used:
replaceGlobal('abcde', [
{ match: 'a', sub: 'w' },
{ match: 'abc', sub: 'x' },
{ match: 'ab', sub: 'y' },
{ match: 'cde', sub: 'z' }
])) === 'yz'
It passes some simple unit tests, but I may be overlooking something silly and it still seems more complicated than needed.
You could also make dict a trie of characters to make looking up the matches easier (and do the same with open). Even with the trie, I believe this approach would still be O(str.length * dict.length) though.

Generating strings and identifying substrings is very slow

I'd like to benchmark certain operations in Rust, but I seem to be having some trouble:
fn main(){
let needle = (0..100).map(|_| "b").collect::<String>();
let haystack = (0..100_000).map(|_| "a").collect::<String>();
println!("Data ready.");
for _ in 0..1_000_000 {
if haystack.contains( &needle ) {
// Stuff...
}
}
}
The above takes a very long time to complete while the same operation in Ruby finishes in around 4.5 seconds:
needle = 'b' * 100
haystack = 'a' * 100_000
puts 'Data ready.'
1_000_000.times do
haystack.include? needle
end
I can't help but think that I'm doing something fundamentally wrong.
What would be the proper way to do this in Rust?
rustc 1.0.0 (a59de37e9 2015-05-13) (built 2015-05-14)
ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]

A fix for this issue was merged today. That means it should be part of the next nightly, and will be expected to be released in Rust 1.3. The fix revived the Two-way substring search implementation that Rust used to have and adapted it to the new Pattern API in the standard library.
The Two-way algorithm is a good match for Rust's libcore since it is a linear time substring search algorithm that uses O(1) space and needs no dynamic allocation.
The particular implementation contains a simple addition that will reject this particular query in the question extremely quickly (and no, it was not written because of this question, it was part of the old code too).
During setup the searcher computes a kind of fingerprint for the needle: For each byte in the needle, take its low 6 bits, which is a number 0-63, then set the corresponding bit in the u64 variable byteset.
let byteset = needle.iter().fold(0, |a, &b| (1 << ((b & 0x3f) as usize)) | a);
Since the needle only contains 'b's, the value of byteset will have only the 34th bit set (98 & 63 == 34).
Now we can test any byte whether it can possibly be part of the needle or not. If its corresponding bit isn't set in byteset, the needle cannot match. Each byte we test in the haystack in this case will be 'a' (97 & 63 == 33), and it can't match. So the algorithm will read a single byte, reject it, and then skip the needle's length.
fn byteset_contains(&self, byte: u8) -> bool {
(self.byteset >> ((byte & 0x3f) as usize)) & 1 != 0
}
// Quickly skip by large portions unrelated to our substring
if !self.byteset_contains(haystack[self.position + needle.len() - 1]) {
self.position += needle.len();
continue 'search;
}
From libcore/str/pattern.rs in rust-lang/rust

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to get longest repeating string in substring from suffix tree - string

Related

What does the int value returned from compareTo function in Kotlin really mean?

Sort list of string based on length

Short-circuiting in functional Groovy?

Globally Optimal String Substitutions

Generating strings and identifying substrings is very slow

Categories

Resources