Globally Optimal String Substitutions

Globally Optimal String Substitutions - string

I'm looking for some pointers for writing a function (let's call it replaceGlobal) that takes an input string and a mapping of substrings to replacement values, and applies these mappings such that as many characters as possible from the input string are replaced. For example:
replaceGlobal("abcde", {
'a' -> 'w',
'abc' -> 'x',
'ab' -> 'y',
'cde' -> 'z'
})
would return "yz" by applying 'ab' -> 'y' and 'cde' -> 'z'.
The function will only apply one round of substitutions, so it can't replace a value and then use part of the replacement value as part of another substitution.
A greedy approach produces non-optimal results (shown here in Javascript):
"abcde".replace(/(abc|cde|ab|a)/g, function(x) {
return {
'a': 'w',
'abc': 'x',
'ab': 'y',
'cde': 'z'
}[x];
});
returns 'xde'
Any thoughts on a good starting point here?
I think the problem boils down to finding the lowest cost path in a weighted DAG constructed with the input string as a spine and other edges provided by the substitutions:
/------x------------\
/-----y------\ \
/---w--\ \ \ /-------z------\
0 -----> a ----> b -----> c -----> d ----> e ----> $
where edges along the spine have a cost of 1 but the other edges have cost zero.
But that may be overcomplicating things.

Seems to me that dynamic programming is the way to go. This is due to the restriction:
The function will only apply one round of substitutions, so it can't
replace a value and then use part of the replacement value as part of
another substitution.
Specifically, say you have some random string abcdefg as input. Now you apply some rule to substitute some middle part, say de -> x. Now you have abcxfg, where the only (smaller subproblems) strings you are now allowed to manipulate are abc and fg. And for repetitive substrings, you can then leverage memoization.

Based on #Matt Timmermans comments and the original DAG idea, here's what I came up with in Javascript as a first attempt (I'm more interested in the algorithm itself than any specific language implementation):
const replaceGlobal = (str, dict) => {
let open = []; // set of substitutions being actively explored
let best = { value: [], weight: 0 }; // optimal path info
// For each character in the input string, left to right
for (let c of str) {
// Add new nodes to `open` for all `substitutions` that
// start with `c`
for (let entry of dict)
if (entry.match[0] === c)
open.push({
value: best.value.concat(entry.sub),
rest: entry.match,
weight: best.weight
});
// Add current character onto best path
best.value.push(c);
++best.weight;
// For each `open` path, try to match against the current character
let new_open = [];
for (let o of open) {
if (o.rest[0] === c) {
if (o.rest.length > 1) { // still more to match
new_open.push({
rest: o.rest.slice(1),
value: o.value,
weight: o.weight
});
} else { // full match found
if (o.weight < best.weight)
best = o;
}
}
}
open = new_open;
}
return best.value.join('');
};
Which would be used:
replaceGlobal('abcde', [
{ match: 'a', sub: 'w' },
{ match: 'abc', sub: 'x' },
{ match: 'ab', sub: 'y' },
{ match: 'cde', sub: 'z' }
])) === 'yz'
It passes some simple unit tests, but I may be overlooking something silly and it still seems more complicated than needed.
You could also make dict a trie of characters to make looking up the matches easier (and do the same with open). Even with the trie, I believe this approach would still be O(str.length * dict.length) though.

Related

In-place modification, insertion or removal in the same function for hash maps in Rust

Say I have a hash map m: HashMap<K, V>, a key k: K and a value v: V, and would like to do the following:
If m does not contain a value at index k, insert v at index k.
If m contains a value w at index k, apply a function fn combine(x: V, y: V) -> Option<V> to v and w, and:
If the result is None, remove the entry at index k from m.
If the result is Some(u), replace the value at index k by u.
Is there a way to do this "in-place", without calling functions that access, modify or remove the value at k multiple times?
I would also like to avoid copying data, so ideally one shouldn't need to clone v to feed the clones into insert and combine separately.
I could rewrite combine to use (mutable) references (or inline it), but the wish of not copying data still remains.

Digging deeper into the Entry documentation, I noticed that the variants of the Entry enum offer functions to modify, remove or insert entries in-place.
After taking std::collections::hash_map::Entry into scope, one could do the following:
match m.entry(k) {
Entry::Occupied(mut oe) => {
let w = oe.get_mut();
match combine(v, w) {
Some(u) => { *w = u; },
None => { oe.remove_entry(); },
}
},
Entry::Vacant(ve) => { ve.insert(v); },
}
(Here is a PoC in the Rust playground.)
This, however, requires combine to take a (mutable) reference as its second argument (which is fine in my case).

I managed to do it in one access, one write and one key-deletion in total in the worst case. The last key-deletion should not be necessary, but I'm not certain it can be done. I gave it my best so far. I hope this helps!
Okay, so I think we want to use the Entry API.
The full method list for Entry is here.
I think we'd do it in the following order:
If m contains a value w at index k: (two more steps)
Or insert v at index k.
This can be done by using .and_modify and then .or_insert. Something like this:
let map = // ... Initialize the map
// Do stuff to it
// ...
// Our important bit:
let mut delete_entry = false;
map.entry(k)
.and_modify(|w| { // If the entry exists, we modify it
let u = combine(v, w);
match u {
Some(y) => *w = y;
None => delete_entry = true;
}
}
)
.or_insert(v); // If it doesn't, we insert v
if delete_entry {
map.remove(k);
}
I don't think there's a way to do all three things without that last map.remove access, so this is my best attempt for now.

Short-circuiting in functional Groovy?

"When you've found the treasure, stop digging!"
I'm wanting to use more functional programming in Groovy, and thought rewriting the following method would be good training. It's harder than it looks because Groovy doesn't appear to build short-circuiting into its more functional features.
Here's an imperative function to do the job:
fullyQualifiedNames = ['a/b/c/d/e', 'f/g/h/i/j', 'f/g/h/d/e']
String shortestUniqueName(String nameToShorten) {
def currentLevel = 1
String shortName = ''
def separator = '/'
while (fullyQualifiedNames.findAll { fqName ->
shortName = nameToShorten.tokenize(separator)[-currentLevel..-1].join(separator)
fqName.endsWith(shortName)
}.size() > 1) {
++currentLevel
}
return shortName
}
println shortestUniqueName('a/b/c/d/e')
Result: c/d/e
It scans a list of fully-qualified filenames and returns the shortest unique form. There are potentially hundreds of fully-qualified names.
As soon as the method finds a short name with only one match, that short name is the right answer, and the iteration can stop. There's no need to scan the rest of the name or do any more expensive list searches.
But turning to a more functional flow in Groovy, neither return nor break can drop you out of the iteration:
return simply returns from the present iteration, not from the whole .each so it doesn't short-circuit.
break isn't allowed outside of a loop, and .each {} and .eachWithIndex {} are not considered loop constructs.
I can't use .find() instead of .findAll() because my program logic requires that I scan all elements of the list, nut just stop at the first.
There are plenty of reasons not to use try..catch blocks, but the best I've read is from here:
Exceptions are basically non-local goto statements with all the
consequences of the latter. Using exceptions for flow control
violates the principle of least astonishment, make programs hard to read
(remember that programs are written for programmers first).
Some of the usual ways around this problem are detailed here including a solution based on a new flavour of .each. This is the closest to a solution I've found so far, but I need to use .eachWithIndex() for my use case (in progress.)
Here's my own poor attempt at a short-circuiting functional solution:
fullyQualifiedNames = ['a/b/c/d/e', 'f/g/h/i/j', 'f/g/h/d/e']
def shortestUniqueName(String nameToShorten) {
def found = ''
def final separator = '/'
def nameComponents = nameToShorten.tokenize(separator).reverse()
nameComponents.eachWithIndex { String _, int i ->
if (!found) {
def candidate = nameComponents[0..i].reverse().join(separator)
def matches = fullyQualifiedNames.findAll { String fqName ->
fqName.endsWith candidate
}
if (matches.size() == 1) {
found = candidate
}
}
}
return found
}
println shortestUniqueName('a/b/c/d/e')
Result: c/d/e
Please shoot me down if there is a more idiomatic way to short-circuit in Groovy that I haven't thought of. Thank you!

There's probably a cleaner looking (and easier to read) solution, but you can do this sort of thing:
String shortestUniqueName(String nameToShorten) {
// Split the name to shorten, and make a list of all sequential combinations of elements
nameToShorten.split('/').reverse().inject([]) { agg, l ->
if(agg) agg + [agg[-1] + l] else agg << [l]
}
// Starting with the smallest element
.find { elements ->
fullyQualifiedNames.findAll { name ->
name.endsWith(elements.reverse().join('/'))
}.size() == 1
}
?.reverse()
?.join('/')
?: ''
}

how to get longest repeating string in substring from suffix tree

I need to find the longest repeating string in substring. Let's say I have string "bannana"
Wikipedia says following:
In computer science, the longest repeated substring problem is the
problem of finding the longest substring of a string that occurs at
least twice. In the figure with the string "ATCGATCGA$", the longest
repeated substring is "ATCGA"
So I assume that for string "bannana" there are two equally long substrings (if not correct me please): "an" and "na".
Wikipedia also says that for this purpose suffix trees are used. To be more specific here is quotation how to do it (this seems to me more understable than definition on wikipedia):
build a Suffix tree, then find the highest node with at least 2
descendants.
I've found several implementations of suffix trees. Following code is taken from here:
use strict;
use warnings;
use Data::Dumper;
sub classify {
my ($f, $h) = (shift, {});
for (#_) { push #{$h->{$f->($_)}}, $_ }
return $h;
}
sub suffixes {
my $str = shift;
map { substr $str, $_ } 0 .. length($str) - 1;
}
sub suffix_tree {
return +{} if #_ == 0;
return +{ $_[0] => +{} } if #_ == 1;
my $h = {};
my $classif = classify sub { substr shift, 0, 1 }, #_;
for my $key (sort keys %$classif) {
my $subtree = suffix_tree(
grep "$_", map { substr $_, 1 } #{$classif->{$key}}
);
my #subkeys = keys %$subtree;
if (#subkeys == 1) {
my $subkey = shift #subkeys;
$h->{"$key$subkey"} = $subtree->{$subkey};
} else { $h->{$key} = $subtree }
}
return $h;
}
print +Dumper suffix_tree suffixes 'bannana$';
for string "bannana" it returns following tree:
$VAR1 = {
'$' => {},
'n' => {
'a' => {
'na$' => {},
'$' => {}
},
'nana$' => {}
},
'a' => {
'$' => {},
'n' => {
'a$' => {},
'nana$' => {}
}
},
'bannana$' => {}
};
Another implementation is online from here, for string "bannana" it returns following tree:
7: a
5: ana
2: annana
1: bannana
6: na
4: nana
3: nnana
|(1:bannana)|leaf
tree:|
| |(4:nana)|leaf
|(2:an)|
| |(7:a)|leaf
|
| |(4:nana)|leaf
|(3:n)|
| |(5:ana)|leaf
3 branching nodes
Questions:
How can I get from those graphs "an" and "na" strings?
As you can see trees are different, are they equivalent or not, if yes why they are different, if not which algorithm is correct?
If perl implementation is wrong is there any working implementation for perl/python?
I've read about Ukkonen's algorithm which is also mentioned on page with 2nd example (I did not catch if the online version is using this algorithm or not), does any of the mentioned examples using this algorithm? If not, is used algorithm slower or has any drawbacks compared to Ukkonen?

1. How can I get from those graphs "an" and "na" strings?
build a Suffix tree, then find the highest node with at least 2 descendants.
string-node is concatenate strings for every node from root to this node. highest node is node with maximum length string-node.
See tree in my answer for second question. (3:n) have 2 descendants and path to node is (2:a)->(3:n), concatenate is an. And also for (5:a) get na.
2. As you can see trees are different, are they equivalent or not, if yes why they are different, if not which algorithm is correct?
These trees are different. Rebuild second tree for string "bannana$" (
as in the first tree):
8: $
7: a$
5: ana$
2: annana$
1: bannana$
6: na$
4: nana$
3: nnana$
|(1:bannana$)|leaf
tree:|
| | |(4:nana$)|leaf
| |(3:n)|
| | |(7:a$)|leaf
|(2:a)|
| |(8:$)|leaf
|
| |(4:nana$)|leaf
|(3:n)|
| | |(6:na$)|leaf
| |(5:a)|
| | |(8:$)|leaf
|
|(8:$)|leaf
5 branching nodes
3. If perl implementation is wrong is there any working implementation for perl/python?
I don't know Perl, but the tree is built correctly.
4. I've read about Ukkonen's algorithm which is also mentioned on page with 2nd example (I did not catch if the online version is using this algorithm or not), does any of the mentioned examples using this algorithm? If not, is used algorithm slower or has any drawbacks compared to Ukkonen?
I said earlier that I don't know Perl, but it's a line in first algorthim means that it works at least O(n^2) (n it is length string):
map { substr $str, $_ } 0 .. length($str) - 1;
Ukkonen's algorithm works linear time O(n).
First algorithm also recursive which may affect used memory.

How to use advance function in swift with three parameters?

I was using only advance function by passing two arguments. Can somebody help me to use it with three arguments which is illustrated as:
func advance<T : ForwardIndexType>(start: T, n: T.Distance, end: T) -> T

That function increments the start index by n positions, but not
beyond the end index.
Example: You want to truncate strings to a given maximal length:
func truncate(string : String, length : Int) -> String {
let index = advance(string.startIndex, length, string.endIndex)
return string.substringToIndex(index)
}
println(truncate("fooBar", 3)) // foo
println(truncate("fo", 3)) // fo
In the first call, the start index is incremented by 3 positions,
in the second example only by two. With
let index = advance(string.startIndex, length)
the second call would crash with a runtime exception, because
a string index must not be advanced beyond the end index.

Generating String from List in Erlang

I'm trying to generate a formatted string based on a list:
[{"Max", 18}, {"Peter", 25}]
To a string:
"(Name: Max, Age: 18), (Name: Peter, Age: 35)"

The first step is to make a function that can convert your {Name, Age} tuple to a list:
format_person({Name, Age}) ->
lists:flatten(io_lib:format("(Name: ~s, Age: ~b)", [Name, Age])).
The next part is simply to apply this function to each element in the list, and then join it together.
format_people(People) ->
string:join(lists:map(fun format_person/1, People), ", ").
The reason for the flatten is that io_lib returns an iolist and not a flat list.

If performance is important, you can use this solution:
format([]) -> [];
format(List) ->
[[_|F]|R] = [ [", ","(Name: ",Name,", Age: ",integer_to_list(Age)|")"]
|| {Name, Age} <- List ],
[F|R].
But remember that it returns io_list() so if you want see result, use lists:flatten/1. It is way how to write very efficient string manipulations in Erlang but use it only if performance is far more important than readability and maintainability.

A simple but slow way:
string:join([lists:flatten(io_lib:format("(~s: ~p)", [Key, Value])) || {Key,Value} <- [{"Max", 18}, {"Peter", 25}]], ", ").

is it JSON?
use some already written modules in e.g mochiweb.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Globally Optimal String Substitutions - string

Related

In-place modification, insertion or removal in the same function for hash maps in Rust

Short-circuiting in functional Groovy?

how to get longest repeating string in substring from suffix tree

How to use advance function in swift with three parameters?

Generating String from List in Erlang

Categories

Resources