How can I slice a string like Python does in Perl 6? - string

In Python, I can splice string like this:
solo = A quick brown fox jump over the lazy dog
solo[3:5]
I know substr and comb is enough, I want to know if it's possible, However. Should I use a role to do this?

How to slice a string
Strings (represented as class Str) are seen as single values and not positional data structures in Perl 6, and thus can't be indexed/sliced with the built-in [ ] array indexing operator.
As you already suggested, comb or substr is the way to go:
my $solo = "A quick brown fox jumps over the lazy dog";
dd $solo.comb[3..4].join; # "ui"
dd $solo.comb[3..^5].join; # "ui"
dd $solo.substr(3, 2); # "ui"
dd $solo.substr(3..4); # "ui"
dd $solo.substr(3..^5); # "ui"
If you want to modify the slice, use substr-rw:
$solo.substr-rw(2..6) = "slow";
dd $solo; # "A slow brown fox jumps over the lazy dog"
How to make operator [ ] work directly on strings
If you really wanted to, you could compose a role into your string that adds method AT-POS, which would make the [ ] operator work on it:
my $solo = "A quick brown fox jumps over the lazy dog" but role {
method AT-POS ($i) { self.substr($i, 1) }
}
dd $solo[3..5]; ("u", "i", "c")
However, this returns slices as a list, because that's what the [ ] operator does by default. To make it return a concatenated string, you'd have to re-implement [ ] for type Str:
multi postcircumfix:<[ ]>(Str $string, Int:D $index) {
$string.substr($index, 1)
}
multi postcircumfix:<[ ]>(Str $string, Range:D $slice) {
$string.substr($slice)
}
multi postcircumfix:<[ ]>(Str $string, Iterable:D \slice) {
slice.map({ $string.substr($_, 1) }).join
}
my $solo = "A quick brown fox jumps over the lazy dog";
dd $solo[3]; # "u"
dd $solo[3..4]; # "ui"
dd $solo[3, 5]; # "uc"
You could extend the [ ] multi-candidates added here to handle all the other magic that the operator provides for lists, like:
from-the-end indices, e.g. $solo[*-1]
truncating slices, e.g. $solo[3..*]
adverbs, e.g. $solo[3..5]:kv
assignment, e.g. $solo[2..6] = "slow"
But it would take a lot of effort to get all of that right.
Also, keep in mind that overriding built-in operators to do things they weren't supposed to do, will confuse other Perl 6 programmers who will have to review or work with your code in the future.

Related

Looking for efficient string replacement algorythm

I'm trying to create a string replacer that accepts multilpe replacements.
The ideia is that it would scan the string to find substrings and replace those substrings with another substring.
For example, I should be able to ask it to replace every "foo" for "bar". Doing that is trivial.
The issue starts when I'm trying to add multiple replacements for this function. Because if I ask it to replace "foo" for "bar" and "bar" for "biz", running those replacements in sequence would result in "foo" turning to "biz", and this behavior is unintended.
I tried splitting the string into words and running each replacement function in each word. However that's not bullet proof either because still results in unintended behavior, since you can ask it to replace substrings that are not whole words. Also, I find that very inefficient.
I'm thinking in some way of running each replacer once in the whole string and sort of storing those changes and merging them. However I think I'm overengineering.
Searching on the web gives me trivial results on how to use string.replace with regular expressions, it doesn't solve my problem.
Is this a problem already solved? Is there an algorithm that can be used here for this string manipulation efficiently?
If you modify your string while searching for all occurences of substrings to be replaced, you'll end up modifying incorrect states of the string. An easy way out could be to get a list of all indexes to update first, then iterate over the indexes and make replacements. That way, indexes for "bar" would've been already computed, and won't be affected even if you replace any substring with "bar" later.
Adding a rough Python implementation to give you an idea:
import re
string = "foo bar biz"
replacements = [("foo", "bar"), ("bar", "biz")]
replacement_indexes = []
offset = 0
for item in replacements:
replacement_indexes.append([m.start() for m in re.finditer(item[0], string)])
temp = list(string)
for i in range(len(replacement_indexes)):
old, new, indexes = replacements[i][0], replacements[i][1], replacement_indexes[i]
for index in indexes:
temp[offset+index:offset+index+len(old)] = list(new)
offset += len(new)-len(old)
print(''.join(temp)) # "bar biz biz"
Here's the approach I would take.
I start with my text and the set of replacements:
string text = "alpha foo beta bar delta";
Dictionary<string, string> replacements = new()
{
{ "foo", "bar" },
{ "bar", "biz" },
};
Now I create an array of parts that are either "open" or not. Open parts can have their text replaced.
var parts = new List<(string text, bool open)>
{
(text: text, open: true)
};
Now I run through each replacement and build a new parts list. If the part is open I can do the replacements, if it's closed just add it in untouched. It's this last bit that prevents double mapping of replacements.
Here's the main logic:
foreach (var replacement in replacements)
{
var parts2 = new List<(string text, bool open)>();
foreach (var part in parts)
{
if (part.open)
{
bool skip = true;
foreach (var split in part.text.Split(new[] { replacement.Key }, StringSplitOptions.None))
{
if (skip)
{
skip = false;
}
else
{
parts2.Add((text: replacement.Value, open: false));
}
parts2.Add((text: split, open: true));
}
}
else
{
parts2.Add(part);
}
}
parts = parts2;
}
That produces the following:
Now it just needs to be joined back up again:
string result = String.Concat(parts.Select(p => p.text));
That gives:
alpha bar beta biz delta
As requested.
Let's suppose your given string were
str = "Mary had fourteen little lambs"
and the desired replacements were given by the following hash (aka hashmap):
h = { "Mary"=>"Butch", "four"=>"three", "little"=>"wee", "lambs"=>"hippos" }
indicating that we want to replace "Mary" (wherever it appears in the string, if at all) with "Butch", and so on. We therefore want to return the following string:
"Butch had fourteen wee hippos"
Notice that we do not want 'fourteen' to be replaced with 'threeteen' and we want the extra spaces between 'fourteen' and 'wee' to be preserved.
First collect the keys of the hash h into an array (or list):
keys = h.keys
#=> ["Mary", "four", "little", "lambs"]
Most languages have a method or function sub or gsub that works something like the following:
str.gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "Butch had fourteen wee hippos"
The regular expression /\w+/ (r'\w+' in Python, for example) matches one or more word characters, as many as possible (i.e., a greedy match). Word characters are letters, digits and the underscore ('_'). It therefore will sequentially match 'Mary', 'had', 'fourteen', 'little' and 'lambs'.
Each matched word is passed to the "block" do |word| ...end and is held by the variable word. The block calculation then computes and returns the string that is to replace the value of word in a duplicate of the original string. Different languages uses different structures and formats to do this, of course.
The first word passed to the block by gsub is 'Mary'. The following calculation is then performed:
if keys.include?("Mary") # true
# so replace "Mary" with:
h[word] #=> "Butch
else # not executed
# not executed
end
Next, gsub passes the word 'had' to the block and assigns that string to the variable word. The following calculation is then performed:
if keys.include?("had") # false
# not executed
else
# so replace "had" with:
"had"
# that is, leave "had" unchanged
end
Similar calculations are made for each word matched by the regular expression.
We see that punctuation and other non-word characters is not a problem:
str = "Mary, had fourteen little lambs!"
str.gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "Butch, had fourteen wee hippos!"
We can see that gsub does not perform replacements sequentially:
h = { "foo"=>"bar", "bar"=>"baz" }
keys = h.keys
#=> ["foo", "bar"]
"foo bar".gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "bar baz"
Note that a linear search of keys is required to evaluate
keys.include?("Mary")
This could be relatively time-consuming if keys has many elements.
In most languages this can be sped up by making keys a set (an unordered collection of unique elements). Determining whether a set contains a given element is quite fast, comparable to determining if a hash has a given key.
An alternative formulation is to write
str.gsub(/\b(?:Mary|four|little|lambs)\b/) { |word| h[word] }
#=> "Butch had fourteen wee hippos"
where the regular expression is constructed programmatically from h.keys. This regular expression reads, "match one of the four words indicated, preceded and followed by a word boundary (\b). The trailing word boundary prevents 'four' from matching 'fourteen'. Since gsub is now only considering the replacement of those four words the block can be simplified to { |word| h[word] }.
Again, this preserves punctuation and extra spaces.
If for some reason we wanted to be able to replace parts of words (e.g., to replace 'fourteen' with 'threeteen'), simply remove the word boundaries from the regular expression:
str.gsub(/Mary|four|little|lambs/) { |word| h[word] }
#=> "Butch had threeteen wee hippos"
Naturally, different languages provide variations of this approach. In Ruby, for example, one could write:
g = Hash.new { |h,k| k }.merge(h)
The creates a hash g that has the same key-value pairs as h but has the additional property that if g does not have a key k, g[k] (the value of key k) returns k. That allows us to write simply:
str.gsub(/\w+/, g)
#=> "Butch had fourteen wee hippos"
See the second version of String#gsub.
A different approach (which I will show is problematic) is to construct an array (or list) of words from the string, replace those words as appropriate and then rejoin the resulting words to form a string. For example,
words = str.split
#=> ["Mary", "had", "fourteen", "little", "lambs"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
["Butch", "had", "fourteen", "wee", "hippos"]
arr.join(' ')
#=> "Butch had fourteen wee hippos"
This produces similar results except the extra spaces have been removed.
Now suppose the string contained punctuation:
str = "Mary, had fourteen little lambs!"
words = str.split
#=> ["Mary,", "had", "fourteen", "little", "lambs!"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> ["Mary,", "had", "fourteen", "wee", "lambs!"]
arr.join(' ')
#=> "Mary, had fourteen wee lambs!"
We could deal with punctuation by writing
words = str.scan(/\w+/)
#=> ["Mary", "had", "fourteen", "little", "lambs"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> ["Butch", "had", "fourteen", "wee", "hippos"]
Here str.scan returns an array of all matches of the regular expression /\w+/ (one or more word characters). The obvious problem is that all punctuation has been lost when arr.join(' ').
You can achieve in a simple way, by using regular expressions:
import re
replaces = {'foo' : 'bar', 'alfa' : 'beta', 'bar': 'biz'}
original_string = 'foo bar, alfa foo. bar other.'
expected_string = 'bar biz, beta bar. biz other.'
replaced = re.compile(r'\w+').sub(lambda m: replaces[m.group()] if m.group() in replaces else m.group(), original_string)
assert replaced == expected_string
I haven't checked the performance, but I believe it is probably faster than using "nested for loops".

Lua string manipulation pattern matching alternative "|"

Is there a way I can do a string pattern that will match "ab|cd" so it matches for either "ab" or "cd" in the input string. I know you use something like "[ab]" as a pattern and it will match for either "a" or "b", but that only works for one letter stuff.
Note that my actual problem is a lot more complicated, but essentially I just need to know if there is an OR thing in Lua's string manipulation. I would actually want to put other patterns on each sides of the OR thing, and etc. But if it works with something like "hello|world" and matches "hello, world!" with both "hello" and "world" then it's great!
Using logical operator with Lua patterns can solve most problems. For instance, for the regular expression [hello|world]%d+, you can use
string.match(str, "hello%d+") or string.match(str, "world%d+")
The shortcut circuit of or operator makes sure the string matches hello%d+ first, if if fails, then matches world%d+
Unfortunately Lua patterns are not regular expressions and are less powerful. In particular they don't support alternation (that vertical bar | operator of Java or Perl regular expressions), which is what you want to do.
A simple workaround could be the following:
local function MatchAny( str, pattern_list )
for _, pattern in ipairs( pattern_list ) do
local w = string.match( str, pattern )
if w then return w end
end
end
s = "hello dolly!"
print( MatchAny( s, { "hello", "world", "%d+" } ) )
s = "cruel world!"
print( MatchAny( s, { "hello", "world", "%d+" } ) )
s = "hello world!"
print( MatchAny( s, { "hello", "world", "%d+" } ) )
s = "got 1000 bucks"
print( MatchAny( s, { "hello", "world", "%d+" } ) )
Output:
hello
world
hello
1000
The function MatchAny will match its first argument (a string) against a list of Lua patterns and return the result of the first successful match.
Just to expand on peterm's suggestion, lpeg also provides a re module that exposes a similar interface to lua's standard string library while still preserving the extra power and flexibility offered by lpeg.
I would say try out the re module first since its syntax is a bit less esoteric compared to lpeg. Here's an example usage that can match your hello world example:
dump = require 'pl.pretty'.dump
re = require 're'
local subj = "hello, world! padding world1 !hello hello hellonomatch nohello"
pat = re.compile [[
toks <- tok (%W+ tok)*
tok <- {'hello' / 'world'} !%w / %w+
]]
res = { re.match(subj, pat) }
dump(res)
which would output:
{
"hello",
"world",
"hello",
"hello"
}
If you're interested in capturing the position of the matches just modify the grammar slightly for positional capture:
tok <- {}('hello' / 'world') !%w / %w+

Groovy String Comparison

I need to know if two strings "match" where "matching" basically means that there is significant overlap between the two strings. For example, if string1 is "foo" and string2 is "foobar", this should be a match. If string2 was "barfoo", that should also be a match with string1. However, if string2 was "fobar", this should not be a match. I'm struggling to find a clever way to do this. Do I need to split the strings into lists of characters first or is there a way to do this kind of comparison already in Groovy? Thanks!
Using Apache commons StringUtils:
#Grab( 'org.apache.commons:commons-lang3:3.1' )
import static org.apache.commons.lang3.StringUtils.getLevenshteinDistance
int a = getLevenshteinDistance( 'The quick fox jumped', 'The fox jumped' )
int b = getLevenshteinDistance( 'The fox jumped', 'The fox' )
// Assert a is more similar than b
assert a < b
Levenshtein Distance tells you the number of characters that have to change for one string to become another
So to get from 'The quick fox jumped' to 'The fox jumped', you need to delete 6 chars (so it has a score of 6)
And to get from 'The fox jumped' to 'The fox', you need to delete 7 chars.
As per your examples, plain old String.contains may suffice:
assert 'foobar'.contains('foo')
assert 'barfoo'.contains('foo')
assert !'fobar'.contains('foo')

replace strings inside a substring

I have this string:
"The quick brown f0x jumps 0ver the lazy d0g, the quick brown f0x jumps 0ver the lazy d0g.".
I need a function that will replace all zeros between "brown" and "lazy" with "o". So the output will look like this:
"The quick brown fox jumps over the lazy d0g, the quick brown fox jumps over the lazy d0g.".
So it will look all over the string and most importantly will leave all other zeros intact.
function(text, leftBorder, rightBorder, searchString, replaceString) : string;
Is there any good algorithm?
If you have Python, here's an example using just string manipulation, eg split(), indexing etc. Your programming language should have these features as well.
>>> s="The quick brown f0x jumps 0ver the lazy d0g, the quick brown f0x jumps 0ver the lazy d0g."
>>> words = s.split("lazy")
>>> for n,word in enumerate(words):
... if "brown" in word:
... w = word.split("brown")
... w[-1]=w[-1].replace("0","o")
... word = 'brown'.join(w)
... words[n]=word
...
>>> 'lazy'.join(words)
'The quick brown fox jumps over the lazy d0g, the quick brown fox jumps over the lazy d0g.'
>>>
The steps:
Split the words on "lazy" to an array A
Go through each element in A to look for "brown"
if found , split on "brown" into array B. The part you are going to change is the
last element
replace it with whatever methods your programming language provides
join back the array B using "brown"
update this element in the first array A
lastly, join the whole string back using "lazy"

How can I split multiple joined words?

I have an array of 1000 or so entries, with examples below:
wickedweather
liquidweather
driveourtrucks
gocompact
slimprojector
I would like to be able to split these into their respective words, as:
wicked weather
liquid weather
drive our trucks
go compact
slim projector
I was hoping a regular expression my do the trick. But, since there is no boundary to stop on, nor is there any sort of capitalization that I could possibly key on, I am thinking, that some sort of reference to a dictionary might be necessary?
I suppose it could be done by hand, but why - when it can be done with code! =) But this has stumped me. Any ideas?
The Viterbi algorithm is much faster. It computes the same scores as the recursive search in Dmitry's answer above, but in O(n) time. (Dmitry's search takes exponential time; Viterbi does it by dynamic programming.)
import re
from collections import Counter
def viterbi_segment(text):
probs, lasts = [1.0], [0]
for i in range(1, len(text) + 1):
prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
for j in range(max(0, i - max_word_length), i))
probs.append(prob_k)
lasts.append(k)
words = []
i = len(text)
while 0 < i:
words.append(text[lasts[i]:i])
i = lasts[i]
words.reverse()
return words, probs[-1]
def word_prob(word): return dictionary[word] / total
def words(text): return re.findall('[a-z]+', text.lower())
dictionary = Counter(words(open('big.txt').read()))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))
Testing it:
>>> viterbi_segment('wickedweather')
(['wicked', 'weather'], 5.1518198982768158e-10)
>>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0])
'its easy for me to split long run together blocks'
To be practical you'll likely want a couple refinements:
Add logs of probabilities, don't multiply probabilities. This avoids floating-point underflow.
Your inputs will in general use words not in your corpus. These substrings must be assigned a nonzero probability as words, or you end up with no solution or a bad solution. (That's just as true for the above exponential search algorithm.) This probability has to be siphoned off the corpus words' probabilities and distributed plausibly among all other word candidates: the general topic is known as smoothing in statistical language models. (You can get away with some pretty rough hacks, though.) This is where the O(n) Viterbi algorithm blows away the search algorithm, because considering non-corpus words blows up the branching factor.
Can a human do it?
farsidebag
far sidebag
farside bag
far side bag
Not only do you have to use a dictionary, you might have to use a statistical approach to figure out what's most likely (or, god forbid, an actual HMM for your human language of choice...)
For how to do statistics that might be helpful, I turn you to Dr. Peter Norvig, who addresses a different, but related problem of spell-checking in 21 lines of code:
http://norvig.com/spell-correct.html
(he does cheat a bit by folding every for loop into a single line.. but still).
Update This got stuck in my head, so I had to birth it today. This code does a similar split to the one described by Robert Gamble, but then it orders the results based on word frequency in the provided dictionary file (which is now expected to be some text representative of your domain or English in general. I used big.txt from Norvig, linked above, and catted a dictionary to it, to cover missing words).
A combination of two words will most of the time beat a combination of 3 words, unless the frequency difference is enormous.
I posted this code with some minor changes on my blog
http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/
and also wrote a little about the underflow bug in this code.. I was tempted to just quietly fix it, but figured this may help some folks who haven't seen the log trick before:
http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/
Output on your words, plus a few of my own -- notice what happens with "orcore":
perl splitwords.pl big.txt words
answerveal: 2 possibilities
- answer veal
- answer ve al
wickedweather: 4 possibilities
- wicked weather
- wicked we at her
- wick ed weather
- wick ed we at her
liquidweather: 6 possibilities
- liquid weather
- liquid we at her
- li quid weather
- li quid we at her
- li qu id weather
- li qu id we at her
driveourtrucks: 1 possibilities
- drive our trucks
gocompact: 1 possibilities
- go compact
slimprojector: 2 possibilities
- slim projector
- slim project or
orcore: 3 possibilities
- or core
- or co re
- orc ore
Code:
#!/usr/bin/env perl
use strict;
use warnings;
sub find_matches($);
sub find_matches_rec($\#\#);
sub find_word_seq_score(#);
sub get_word_stats($);
sub print_results($#);
sub Usage();
our(%DICT,$TOTAL);
{
my( $dict_file, $word_file ) = #ARGV;
($dict_file && $word_file) or die(Usage);
{
my $DICT;
($DICT, $TOTAL) = get_word_stats($dict_file);
%DICT = %$DICT;
}
{
open( my $WORDS, '<', $word_file ) or die "unable to open $word_file\n";
foreach my $word (<$WORDS>) {
chomp $word;
my $arr = find_matches($word);
local $_;
# Schwartzian Transform
my #sorted_arr =
map { $_->[0] }
sort { $b->[1] <=> $a->[1] }
map {
[ $_, find_word_seq_score(#$_) ]
}
#$arr;
print_results( $word, #sorted_arr );
}
close $WORDS;
}
}
sub find_matches($){
my( $string ) = #_;
my #found_parses;
my #words;
find_matches_rec( $string, #words, #found_parses );
return #found_parses if wantarray;
return \#found_parses;
}
sub find_matches_rec($\#\#){
my( $string, $words_sofar, $found_parses ) = #_;
my $length = length $string;
unless( $length ){
push #$found_parses, $words_sofar;
return #$found_parses if wantarray;
return $found_parses;
}
foreach my $i ( 2..$length ){
my $prefix = substr($string, 0, $i);
my $suffix = substr($string, $i, $length-$i);
if( exists $DICT{$prefix} ){
my #words = ( #$words_sofar, $prefix );
find_matches_rec( $suffix, #words, #$found_parses );
}
}
return #$found_parses if wantarray;
return $found_parses;
}
## Just a simple joint probability
## assumes independence between words, which is obviously untrue
## that's why this is broken out -- feel free to add better brains
sub find_word_seq_score(#){
my( #words ) = #_;
local $_;
my $score = 1;
foreach ( #words ){
$score = $score * $DICT{$_} / $TOTAL;
}
return $score;
}
sub get_word_stats($){
my ($filename) = #_;
open(my $DICT, '<', $filename) or die "unable to open $filename\n";
local $/= undef;
local $_;
my %dict;
my $total = 0;
while ( <$DICT> ){
foreach ( split(/\b/, $_) ) {
$dict{$_} += 1;
$total++;
}
}
close $DICT;
return (\%dict, $total);
}
sub print_results($#){
#( 'word', [qw'test one'], [qw'test two'], ... )
my ($word, #combos) = #_;
local $_;
my $possible = scalar #combos;
print "$word: $possible possibilities\n";
foreach (#combos) {
print ' - ', join(' ', #$_), "\n";
}
print "\n";
}
sub Usage(){
return "$0 /path/to/dictionary /path/to/your_words";
}
pip install wordninja
>>> import wordninja
>>> wordninja.split('bettergood')
['better', 'good']
The best tool for the job here is recursion, not regular expressions. The basic idea is to start from the beginning of the string looking for a word, then take the remainder of the string and look for another word, and so on until the end of the string is reached. A recursive solution is natural since backtracking needs to happen when a given remainder of the string cannot be broken into a set of words. The solution below uses a dictionary to determine what is a word and prints out solutions as it finds them (some strings can be broken out into multiple possible sets of words, for example wickedweather could be parsed as "wicked we at her"). If you just want one set of words you will need to determine the rules for selecting the best set, perhaps by selecting the solution with fewest number of words or by setting a minimum word length.
#!/usr/bin/perl
use strict;
my $WORD_FILE = '/usr/share/dict/words'; #Change as needed
my %words; # Hash of words in dictionary
# Open dictionary, load words into hash
open(WORDS, $WORD_FILE) or die "Failed to open dictionary: $!\n";
while (<WORDS>) {
chomp;
$words{lc($_)} = 1;
}
close(WORDS);
# Read one line at a time from stdin, break into words
while (<>) {
chomp;
my #words;
find_words(lc($_));
}
sub find_words {
# Print every way $string can be parsed into whole words
my $string = shift;
my #words = #_;
my $length = length $string;
foreach my $i ( 1 .. $length ) {
my $word = substr $string, 0, $i;
my $remainder = substr $string, $i, $length - $i;
# Some dictionaries contain each letter as a word
next if ($i == 1 && ($word ne "a" && $word ne "i"));
if (defined($words{$word})) {
push #words, $word;
if ($remainder eq "") {
print join(' ', #words), "\n";
return;
} else {
find_words($remainder, #words);
}
pop #words;
}
}
return;
}
I think you're right in thinking that it's not really a job for a regular expression. I would approach this using the dictionary idea - look for the longest prefix that is a word in the dictionary. When you find that, chop it off and do the same with the remainder of the string.
The above method is subject to ambiguity, for example "drivereallyfast" would first find "driver" and then have trouble with "eallyfast". So you would also have to do some backtracking if you ran into this situation. Or, since you don't have that many strings to split, just do by hand the ones that fail the automated split.
This is related to a problem known as identifier splitting or identifier name tokenization. In the OP's case, the inputs seem to be concatenations of ordinary words; in identifier splitting, the inputs are class names, function names or other identifiers from source code, and the problem is harder. I realize this is an old question and the OP has either solved their problem or moved on, but in case someone else comes across this question while looking for identifier splitters (like I was, not long ago), I would like to offer Spiral ("SPlitters for IdentifieRs: A Library"). It is written in Python but comes with a command-line utility that can read a file of identifiers (one per line) and split each one.
Splitting identifiers is deceptively difficult. Programmers commonly use abbreviations, acronyms and word fragments when naming things, and they don't always use consistent conventions. Even in when identifiers do follow some convention such as camel case, ambiguities can arise.
Spiral implements numerous identifier splitting algorithms, including a novel algorithm called Ronin. It uses a variety of heuristic rules, English dictionaries, and tables of token frequencies obtained from mining source code repositories. Ronin can split identifiers that do not use camel case or other naming conventions, including cases such as splitting J2SEProjectTypeProfiler into [J2SE, Project, Type, Profiler], which requires the reader to recognize J2SE as a unit. Here are some more examples of what Ronin can split:
# spiral mStartCData nonnegativedecimaltype getUtf8Octets GPSmodule savefileas nbrOfbugs
mStartCData: ['m', 'Start', 'C', 'Data']
nonnegativedecimaltype: ['nonnegative', 'decimal', 'type']
getUtf8Octets: ['get', 'Utf8', 'Octets']
GPSmodule: ['GPS', 'module']
savefileas: ['save', 'file', 'as']
nbrOfbugs: ['nbr', 'Of', 'bugs']
Using the examples from the OP's question:
# spiral wickedweather liquidweather driveourtrucks gocompact slimprojector
wickedweather: ['wicked', 'weather']
liquidweather: ['liquid', 'weather']
driveourtrucks: ['driveourtrucks']
gocompact: ['go', 'compact']
slimprojector: ['slim', 'projector']
As you can see, it is not perfect. It's worth noting that Ronin has a number of parameters and adjusting them makes it possible to split driveourtrucks too, but at the cost of worsening performance on program identifiers.
More information can be found in the GitHub repo for Spiral.
A simple solution with Python: install the wordsegment package: pip install wordsegment.
$ echo thisisatest | python -m wordsegment
this is a test
Well, the problem itself is not solvable with just a regular expression. A solution (probably not the best) would be to get a dictionary and do a regular expression match for each work in the dictionary to each word in the list, adding the space whenever successful. Certainly this would not be terribly quick, but it would be easy to program and faster than hand doing it.
A dictionary based solution would be required. This might be simplified somewhat if you have a limited dictionary of words that can occur, otherwise words that form the prefix of other words are going to be a problem.
There is python package released Santhosh thottingal called mlmorph which can be used for morphological analysis.
https://pypi.org/project/mlmorph/
Examples:
from mlmorph import Analyser
analyser = Analyser()
analyser.analyse("കേരളത്തിന്റെ")
Gives
[('കേരളം<np><genitive>', 179)]
He also wrote a blog on the topic https://thottingal.in/blog/2017/11/26/towards-a-malayalam-morphology-analyser/
This will work if the are camelCase. JavaScript!!!
function spinalCase(str) {
let lowercase = str.trim()
let regEx = /\W+|(?=[A-Z])|_/g
let result = lowercase.split(regEx).join("-").toLowerCase()
return result;
}
spinalCase("AllThe-small Things");
One of the solutions could be with recurssion (the same can be converted into dynamic-programming):
static List<String> wordBreak(
String input,
Set<String> dictionary
) {
List<List<String>> result = new ArrayList<>();
List<String> r = new ArrayList<>();
helper(input, dictionary, result, "", 0, new Stack<>());
for (List<String> strings : result) {
String s = String.join(" ", strings);
r.add(s);
}
return r;
}
static void helper(
final String input,
final Set<String> dictionary,
final List<List<String>> result,
String state,
int index,
Stack<String> stack
) {
if (index == input.length()) {
// add the last word
stack.push(state);
for (String s : stack) {
if (!dictionary.contains(s)) {
return;
}
}
result.add((List<String>) stack.clone());
return;
}
if (dictionary.contains(state)) {
// bifurcate
stack.push(state);
helper(input, dictionary, result, "" + input.charAt(index),
index + 1, stack);
String pop = stack.pop();
String s = stack.pop();
helper(input, dictionary, result, s + pop.charAt(0),
index + 1, stack);
}
else {
helper(input, dictionary, result, state + input.charAt(index),
index + 1, stack);
}
return;
}
The other possible solution would be the use of Tries data structure.
output :-
['better', 'good'] ['coffee', 'shop']
['coffee', 'shop']
pip install wordninja
import wordninja
n=wordninja.split('bettergood')
m=wordninja.split("coffeeshop")
print(n,m)
list=['hello','coffee','shop','better','good']
mat='coffeeshop'
expected=[]
for i in list:
if i in mat:
expected.append(i)
print(expected)
So I spent like 2 days on this answer, since I need it for my own NLP work. My answer is derived from Darius Bacon's answer, which itself was derived from the Viterbi algorithm. I also abstracted it to take each word in a message, attempt to split it, and then reassemble the message. I expanded Darius's code to make it debuggable. I also swapped out the need for "big.txt", and use the wordfreq library instead. Some comments stress the need to use a non-zero word frequency for non-existent words. I found that using any frequency higher than zero would cause "itseasyformetosplitlongruntogetherblocks" to undersplit into "itseasyformetosplitlongruntogether blocks". The algorithm in general tends to either oversplit or undersplit various test messages depending on how you combine word frequencies and how you handle missing word frequencies. I played around with many tweaks until it behaved well. My solution uses a 0.0 frequency for missing words. It also adds a reward for word length (otherwise it tends to split words into characters). I tried many length rewards, and the one that seems to work best for my test cases is word_frequency * (e ** word_length). There were also comments warning against multiplying word frequencies together. I tried adding them, using the harmonic mean, and using 1-freq instead of the 0.00001 form. They all tended to oversplit the test cases. Simply multiplying word frequencies together worked best. I left my debugging print statements in there, to make it easier for others to continue tweaking. Finally, there's a special case where if your whole message is a word that doesn't exist, like "Slagle's", then the function splits the word into individual letters. In my case, I don't want that, so I have a special return statement at the end to return the original message in those cases.
import numpy as np
from wordfreq import get_frequency_dict
word_prob = get_frequency_dict(lang='en', wordlist='large')
max_word_len = max(map(len, word_prob)) # 34
def viterbi_segment(text, debug=False):
probs, lasts = [1.0], [0]
for i in range(1, len(text) + 1):
new_probs = []
for j in range(max(0, i - max_word_len), i):
substring = text[j:i]
length_reward = np.exp(len(substring))
freq = word_prob.get(substring, 0) * length_reward
compounded_prob = probs[j] * freq
new_probs.append((compounded_prob, j))
if debug:
print(f'[{j}:{i}] = "{text[lasts[j]:j]} & {substring}" = ({probs[j]:.8f} & {freq:.8f}) = {compounded_prob:.8f}')
prob_k, k = max(new_probs) # max of a touple is the max across the first elements, which is the max of the compounded probabilities
probs.append(prob_k)
lasts.append(k)
if debug:
print(f'i = {i}, prob_k = {prob_k:.8f}, k = {k}, ({text[k:i]})\n')
# when text is a word that doesn't exist, the algorithm breaks it into individual letters.
# in that case, return the original word instead
if len(set(lasts)) == len(text):
return text
words = []
k = len(text)
while 0 < k:
word = text[lasts[k]:k]
words.append(word)
k = lasts[k]
words.reverse()
return ' '.join(words)
def split_message(message):
new_message = ' '.join(viterbi_segment(wordmash, debug=False) for wordmash in message.split())
return new_message
messages = [
'tosplit',
'split',
'driveourtrucks',
"Slagle's",
"Slagle's wickedweather liquidweather driveourtrucks gocompact slimprojector",
'itseasyformetosplitlongruntogetherblocks',
]
for message in messages:
print(f'{message}')
new_message = split_message(message)
print(f'{new_message}\n')
tosplit
to split
split
split
driveourtrucks
drive our trucks
Slagle's
Slagle's
Slagle's wickedweather liquidweather driveourtrucks gocompact slimprojector
Slagle's wicked weather liquid weather drive our trucks go compact slim projector
itseasyformetosplitlongruntogetherblocks
its easy for me to split long run together blocks
I may get downmodded for this, but have the secretary do it.
You'll spend more time on a dictionary solution than it would take to manually process. Further, you won't possibly have 100% confidence in the solution, so you'll still have to give it manual attention anyway.

Resources