Related
I have a chromosome sequence and have to find subsequences in it and the distances between them.
For example:
string:
AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT
Substring:
ACGT
I have to find the distance between all occurrences of ACGT.
I normally do not recommend answering posts where it is obvious the OP just wants other people to do their work. However, there is already one answer the use of which will be problematic if input strings are largish, so here is something that uses Perl builtins.
The special variable #- stores the positions of matches after a pattern matches.
use strict;
use warnings;
use Data::Dumper;
my $string = 'AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT';
my #pos;
while ( $string =~ /ACGT/g ) {
push #pos, $-[0];
}
my #dist;
for my $i (1 .. $#pos) {
push #dist, $pos[$i] - $pos[$i - 1];
}
print Dumper(\#pos, \#dist);
This method uses less memory than splitting the original string (which may be a problem if the original string is large enough). Its memory footprint can be further reduced, but I focused on clarity by showing the accumulation of match positions and the calculation of deltas separately.
One open question is whether you want the index of the first match from the beginning of the string. Strictly speaking, "distances between matches" excludes that.
use strict;
use warnings;
use Data::Dumper;
my $string = 'AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT';
my #dist;
my $last;
while ($string =~ /ACGT/g) {
no warnings 'uninitialized';
push #dist, $-[0] - $last;
$last = $-[0];
}
# Do we want the distance of the first
# match from the beginning of the string?
shift #dist;
print Dumper \#dist;
Of course, it is possible to use index for this as well, but it looks considerably uglier.
You may split your input string by "ACGT" and remove the first and the last elements of the returned array to get all fragments between "ACGT". Then calculate lengths of this fragments:
my $input = "AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT";
my #fragments = split("ACGT", $input, -1);
#fragments = #fragments[1..$#fragments - 1];
my #dist_arr = map {length} #fragments;
Demo: https://ideone.com/AqEwGu
I have a .csv file, in which the numbers are formatted according to da_DK locale (i.e. a comma is used instead of a period as decimal point separator, among other things), so it looks something like this:
"5000","0,00","5,25", ....
I'd like to use a command line application to convert all the numbers in the file in one go, so the output is "C" (or POSIX) locale (i.e. dot/period is used as decimal separator):
"5000","0.00","5.25", ....
... keeping the decimal places as they are (i.e. "0,00" should be converted to "0.00", not "0" or "0.") and leaving all other data/formatting unchanged.
I am aware that there is numfmt, which should allow something like:
$ LC_ALL=en_DK.utf8 numfmt --from=iec --grouping 22123,11
22.123,11
... however, numfmt can only convert between units, not locales (once LC_ALL is specified, also the input number has to conform to it, just like the output).
I'd ultimately like something that is CSV-agnostic - that is, can parse through a text file, find all substrings that match a format of a number in the given input locale (i.e. the program would deduce from a string like "5000","0,00","5,25","hello".... three locale-specific numeric substrings 5000, 0,00 and 5,25), convert and replace these substrings, and leave everything else as is; but as an alternative, I'd also like to know about a CSV-aware approach (i.e., all fields are parsed row by row, and then content of each field is checked if it matches a locale-specific numeric string).
Ok, I did find a way to do this in Perl, and it's not exactly trivial; an example (csv-agnostic) script which converts a test string is pasted below. Ultimately it prints:
Orig string: "AO900-020","Hello","World","5000","0,00","5,25","stk","","1","0,00","Test 2","42.234,12","","","0,00","","","","5,25"
Conv string: "AO900-020","Hello","World","5000","0.00","5.25","stk","","1","0.00","Test 2","42234.12","","","0.00","","","","5.25"
... which is basically what I wanted to achieve; but there may be edge cases here which would be undesirable. Maybe better to use something like this with a tool like csvfix or csvtool, or use a Perl csv library directly in code.
Still, here is the code:
#!/usr/bin/env perl
use warnings;
use strict;
use locale;
use POSIX qw(setlocale locale_h LC_ALL);
use utf8;
use Number::Format qw(:subs); # sudo perl -MCPAN -e 'install Number::Format'
use Data::Dumper;
use Scalar::Util::Numeric qw(isint); # sudo perl -MCPAN -e 'install Scalar::Util::Numeric'
my $old_locale;
# query and save the old locale
$old_locale = setlocale(LC_ALL);
# list of (installed) locales: bash$ locale -a
setlocale(LC_ALL, "POSIX");
# localeconv() returns "a reference to a hash of locale-dependent info"
# dereference here:
#%posixlocalesettings = %{localeconv()};
#print Dumper(\%posixlocalesettings);
# or without dereference:
my $posixlocalesettings = localeconv();
# the $posixlocalesettings has only 'decimal_point' => '.';
# force also thousands_sep to '', else it will be comma later on, and grouping will be made regardless
$posixlocalesettings->{'thousands_sep'} = '';
print Dumper($posixlocalesettings);
#~ my $posixNumFormatter = new Number::Format %args;
# thankfully, Number::Format seems to accept as argument same kind of hash that localeconv() returns:
my $posixNumFormatter = new Number::Format(%{$posixlocalesettings});
print Dumper($posixNumFormatter);
setlocale(LC_ALL, "en_DK.utf8");
my $dklocalesettings = localeconv();
print Dumper($dklocalesettings);
# Get some of locale's numeric formatting parameters
my ($thousands_sep, $decimal_point, $grouping) =
# #{localeconv()}{'thousands_sep', 'decimal_point', 'grouping'};
#{$dklocalesettings}{'thousands_sep', 'decimal_point', 'grouping'};
# grouping and mon_grouping are packed lists
# of small integers (characters) telling the
# grouping (thousand_seps and mon_thousand_seps
# being the group dividers) of numbers and
# monetary quantities. The integers’ meanings:
# 255 means no more grouping, 0 means repeat
# the previous grouping, 1-254 means use that
# as the current grouping. Grouping goes from
# right to left (low to high digits). In the
# below we cheat slightly by never using anything
# else than the first grouping (whatever that is).
my #grouping = unpack("C*", $grouping);
print "en_DK.utf8: thousands_sep $thousands_sep; decimal_point $decimal_point; grouping " .join(", ", #grouping). "\n";
my $inputCSVString = '"AO900-020","Hello","World","5000","0,00","5,25","stk","","1","0,00","Test 2","42.234,12","","","0,00","","","","5,25"';
# Character set modifiers
# /d, /u , /a , and /l , available starting in 5.14, are called the character set modifiers;
# /l sets the character set to that of whatever Locale is in effect at the time of the execution of the pattern match.
while ($inputCSVString =~ m/[[:digit:]]+/gl) { # doesn't take locale in account
print "A Found '$&'. Next attempt at character " . (pos($inputCSVString)+1) . "\n";
}
print "----------\n";
#~ while ($inputCSVString =~ m/(\d{$grouping[0]}($|$thousands_sep))+/gl) {
#~ while ($inputCSVString =~ m/(\d)(\d{$grouping[0]}($|$thousands_sep))+/gl) {
# match a string that starts with digit, and contains only digits, thousands separators and decimal points
# note - it will NOT match negative numbers
while ($inputCSVString =~ m/\d[\d$thousands_sep$decimal_point]+/gl) {
my $numstrmatch = $&;
my $unnumstr = unformat_number($numstrmatch); # should unformat according to current locale ()
my $posixnumstr = $posixNumFormatter->format_number($unnumstr);
print "B Found '$numstrmatch' (unf: '$unnumstr', form: '$posixnumstr'). Next attempt at character " . (pos($inputCSVString)+1) . "\n";
}
sub convertNumStr{
my $numstrmatch = $_[0];
my $unnumstr = unformat_number($numstrmatch);
# if an integer, return as is so it doesn't change trailing zeroes, if the number is a label
if ( (isint $unnumstr) && ( $numstrmatch !~ m/$decimal_point_dk/) ) { return $numstrmatch; }
#~ print "--- $unnumstr\n";
# find the length of the string after the decimal point - the precision
my $precision_strlen = length( substr( $numstrmatch, index($numstrmatch, $decimal_point_dk)+1 ) );
# must manually spec precision and trailing zeroes here:
my $posixnumstr = $posixNumFormatter->format_number($unnumstr, $precision_strlen, 1);
return $posixnumstr;
}
# e modifier to evaluate perl Code
(my $replaceString = $inputCSVString) =~ s/(\d[\d$thousands_sep$decimal_point]+)/"".convertNumStr($1).""/gle;
print "Orig string: " . $inputCSVString . "\n";
print "Conv string: " . $replaceString . "\n";
updated: this will convert numbers.numbers to numbersnumbers and numbers,numbers to numbers.numbers for any text:
sed -e 's/\([0-9]\+\)\.\([0-9]\+\)/\1\2/g' -e 's/\([0-9]\+\),\([0-9]\+\)/\1.\2/g'
Orig string: "AO900-020","Hello","World","5000","0,00","5,25","stk","","1","0,00","Test 2","42.234,12","","","0,00","","","","5,25"
Conv string: "AO900-020","Hello","World","5000","0.00","5.25","stk","","1","0.00","Test 2","42234.12","","","0.00","","","","5.25"
(same example i/o as OP's perl answer)
note: this would be very bad if you have any unquoted fields in your csv.
There is a large string s, that contains item codes which are comma delimited.
e.g.:
$s="90320,328923,SKJS32767,DSIKUDIU,829EUE,AUSIUD0Q897,AJIUE98,
387493420DA,93RE,AKDJ93,SADI983,90439,JADKJ84";
In my application these strings are passed to a function, which returns the price of these items, i.e. the output of the function is corresponding price for the item code input.
However, due to certain limitations, the maximum length of $s should not exceed 16. If the length of $s exceeds 16, then an exception is thrown. Thus, these strings should be partitioned into an array, such that, the length of each element of array is less than or equal to 16.
e.g: After partitioning $s, the array is:
$Arr[0]='90320,328923',#Note First 16 char is 0320,328923,SK.
However, SK is neglected as its an incomplete(being partial) item code.
$Arr[1]='SKJS32767',
$Arr[2]='DSIKUDIU,829EUE',
$Arr[3]='AUSIUD0Q897',
$Arr[4]='AJIUE98',
$Arr[5]='387493420DA,93RE'
For a given $s, the function should return an array, following the constraints noted above.
My approach has been to use the substr function, and extract a string up to a 16 offset, from an updated position index. Can it be done in a better way?
This is very simple using a global /g regular expression match.
This program demonstrates. The regex pattern looks for as many characters as possible up to a maximum of sixteen that must be followed by a comma or the end of the string.
However, my first thought was the same as RobEarl's comment - why not just put one field from the string into each element of the array? Is there really a need to pack more than one into an element just because it is possible?
use strict;
use warnings;
use 5.010;
my $s = '90320,328923,SKJS32767,DSIKUDIU,829EUE,AUSIUD0Q897,AJIUE98,387493420DA,93RE,AKDJ93,SADI983,90439,JADKJ84';
my #partitions;
while ( $s =~ /\G(.{0,16})(?:,|\z)/g ) {
push #partitions, $1;
}
say for #partitions;
output
90320,328923
SKJS32767
DSIKUDIU,829EUE
AUSIUD0Q897
AJIUE98
387493420DA,93RE
AKDJ93,SADI983
90439,JADKJ84
You need to look at the length of the current string plus the current article number to determine if it is too long.
Split the long string into single articles. Concatenate the last element of the new list of strings if it's below 17 chars or push the article number as a fresh string into the list.
my $s="90320,328923,SKJS32767,DSIKUDIU,829EUE,AUSIUD0Q897,AJIUE98,387493420DA,93RE,AKDJ93,SADI983,90439,JADKJ84";
my #items = split /,/, $s;
my #strings = ( shift #items );
while ( my $item = shift #items ) {
if ( length($strings[-1]) + length($item) > 15) { # 15 because of the comma
push #strings, $item;
} else {
$strings[-1] .= ',' . $item;
}
}
dd \#strings;
__END__
[
"90320,328923",
"SKJS32767",
"DSIKUDIU,829EUE",
"AUSIUD0Q897",
"AJIUE98",
"387493420DA,93RE",
"AKDJ93,SADI983",
"90439,JADKJ84",
]
How can I check whether a given string contains a certain substring, using Perl?
More specifically, I want to see whether s1.domain.example is present in the given string variable.
To find out if a string contains substring you can use the index function:
if (index($str, $substr) != -1) {
print "$str contains $substr\n";
}
It will return the position of the first occurrence of $substr in $str, or -1 if the substring is not found.
Another possibility is to use regular expressions which is what Perl is famous for:
if ($mystring =~ /s1\.domain\.example/) {
print qq("$mystring" contains "s1.domain.example"\n);
}
The backslashes are needed because a . can match any character. You can get around this by using the \Q and \E operators.
my $substring = "s1.domain.example";
if ($mystring =~ /\Q$substring\E/) {
print qq("$mystring" contains "$substring"\n);
}
Or, you can do as eugene y stated and use the index function.
Just a word of warning: Index returns a -1 when it can't find a match instead of an undef or 0.
Thus, this is an error:
my $substring = "s1.domain.example";
if (not index($mystring, $substr)) {
print qq("$mystring" doesn't contains "$substring"\n";
}
This will be wrong if s1.domain.example is at the beginning of your string. I've personally been burned on this more than once.
Case Insensitive Substring Example
This is an extension of Eugene's answer, which converts the strings to lower case before checking for the substring:
if (index(lc($str), lc($substr)) != -1) {
print "$str contains $substr\n";
}
I have an array of 1000 or so entries, with examples below:
wickedweather
liquidweather
driveourtrucks
gocompact
slimprojector
I would like to be able to split these into their respective words, as:
wicked weather
liquid weather
drive our trucks
go compact
slim projector
I was hoping a regular expression my do the trick. But, since there is no boundary to stop on, nor is there any sort of capitalization that I could possibly key on, I am thinking, that some sort of reference to a dictionary might be necessary?
I suppose it could be done by hand, but why - when it can be done with code! =) But this has stumped me. Any ideas?
The Viterbi algorithm is much faster. It computes the same scores as the recursive search in Dmitry's answer above, but in O(n) time. (Dmitry's search takes exponential time; Viterbi does it by dynamic programming.)
import re
from collections import Counter
def viterbi_segment(text):
probs, lasts = [1.0], [0]
for i in range(1, len(text) + 1):
prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
for j in range(max(0, i - max_word_length), i))
probs.append(prob_k)
lasts.append(k)
words = []
i = len(text)
while 0 < i:
words.append(text[lasts[i]:i])
i = lasts[i]
words.reverse()
return words, probs[-1]
def word_prob(word): return dictionary[word] / total
def words(text): return re.findall('[a-z]+', text.lower())
dictionary = Counter(words(open('big.txt').read()))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))
Testing it:
>>> viterbi_segment('wickedweather')
(['wicked', 'weather'], 5.1518198982768158e-10)
>>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0])
'its easy for me to split long run together blocks'
To be practical you'll likely want a couple refinements:
Add logs of probabilities, don't multiply probabilities. This avoids floating-point underflow.
Your inputs will in general use words not in your corpus. These substrings must be assigned a nonzero probability as words, or you end up with no solution or a bad solution. (That's just as true for the above exponential search algorithm.) This probability has to be siphoned off the corpus words' probabilities and distributed plausibly among all other word candidates: the general topic is known as smoothing in statistical language models. (You can get away with some pretty rough hacks, though.) This is where the O(n) Viterbi algorithm blows away the search algorithm, because considering non-corpus words blows up the branching factor.
Can a human do it?
farsidebag
far sidebag
farside bag
far side bag
Not only do you have to use a dictionary, you might have to use a statistical approach to figure out what's most likely (or, god forbid, an actual HMM for your human language of choice...)
For how to do statistics that might be helpful, I turn you to Dr. Peter Norvig, who addresses a different, but related problem of spell-checking in 21 lines of code:
http://norvig.com/spell-correct.html
(he does cheat a bit by folding every for loop into a single line.. but still).
Update This got stuck in my head, so I had to birth it today. This code does a similar split to the one described by Robert Gamble, but then it orders the results based on word frequency in the provided dictionary file (which is now expected to be some text representative of your domain or English in general. I used big.txt from Norvig, linked above, and catted a dictionary to it, to cover missing words).
A combination of two words will most of the time beat a combination of 3 words, unless the frequency difference is enormous.
I posted this code with some minor changes on my blog
http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/
and also wrote a little about the underflow bug in this code.. I was tempted to just quietly fix it, but figured this may help some folks who haven't seen the log trick before:
http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/
Output on your words, plus a few of my own -- notice what happens with "orcore":
perl splitwords.pl big.txt words
answerveal: 2 possibilities
- answer veal
- answer ve al
wickedweather: 4 possibilities
- wicked weather
- wicked we at her
- wick ed weather
- wick ed we at her
liquidweather: 6 possibilities
- liquid weather
- liquid we at her
- li quid weather
- li quid we at her
- li qu id weather
- li qu id we at her
driveourtrucks: 1 possibilities
- drive our trucks
gocompact: 1 possibilities
- go compact
slimprojector: 2 possibilities
- slim projector
- slim project or
orcore: 3 possibilities
- or core
- or co re
- orc ore
Code:
#!/usr/bin/env perl
use strict;
use warnings;
sub find_matches($);
sub find_matches_rec($\#\#);
sub find_word_seq_score(#);
sub get_word_stats($);
sub print_results($#);
sub Usage();
our(%DICT,$TOTAL);
{
my( $dict_file, $word_file ) = #ARGV;
($dict_file && $word_file) or die(Usage);
{
my $DICT;
($DICT, $TOTAL) = get_word_stats($dict_file);
%DICT = %$DICT;
}
{
open( my $WORDS, '<', $word_file ) or die "unable to open $word_file\n";
foreach my $word (<$WORDS>) {
chomp $word;
my $arr = find_matches($word);
local $_;
# Schwartzian Transform
my #sorted_arr =
map { $_->[0] }
sort { $b->[1] <=> $a->[1] }
map {
[ $_, find_word_seq_score(#$_) ]
}
#$arr;
print_results( $word, #sorted_arr );
}
close $WORDS;
}
}
sub find_matches($){
my( $string ) = #_;
my #found_parses;
my #words;
find_matches_rec( $string, #words, #found_parses );
return #found_parses if wantarray;
return \#found_parses;
}
sub find_matches_rec($\#\#){
my( $string, $words_sofar, $found_parses ) = #_;
my $length = length $string;
unless( $length ){
push #$found_parses, $words_sofar;
return #$found_parses if wantarray;
return $found_parses;
}
foreach my $i ( 2..$length ){
my $prefix = substr($string, 0, $i);
my $suffix = substr($string, $i, $length-$i);
if( exists $DICT{$prefix} ){
my #words = ( #$words_sofar, $prefix );
find_matches_rec( $suffix, #words, #$found_parses );
}
}
return #$found_parses if wantarray;
return $found_parses;
}
## Just a simple joint probability
## assumes independence between words, which is obviously untrue
## that's why this is broken out -- feel free to add better brains
sub find_word_seq_score(#){
my( #words ) = #_;
local $_;
my $score = 1;
foreach ( #words ){
$score = $score * $DICT{$_} / $TOTAL;
}
return $score;
}
sub get_word_stats($){
my ($filename) = #_;
open(my $DICT, '<', $filename) or die "unable to open $filename\n";
local $/= undef;
local $_;
my %dict;
my $total = 0;
while ( <$DICT> ){
foreach ( split(/\b/, $_) ) {
$dict{$_} += 1;
$total++;
}
}
close $DICT;
return (\%dict, $total);
}
sub print_results($#){
#( 'word', [qw'test one'], [qw'test two'], ... )
my ($word, #combos) = #_;
local $_;
my $possible = scalar #combos;
print "$word: $possible possibilities\n";
foreach (#combos) {
print ' - ', join(' ', #$_), "\n";
}
print "\n";
}
sub Usage(){
return "$0 /path/to/dictionary /path/to/your_words";
}
pip install wordninja
>>> import wordninja
>>> wordninja.split('bettergood')
['better', 'good']
The best tool for the job here is recursion, not regular expressions. The basic idea is to start from the beginning of the string looking for a word, then take the remainder of the string and look for another word, and so on until the end of the string is reached. A recursive solution is natural since backtracking needs to happen when a given remainder of the string cannot be broken into a set of words. The solution below uses a dictionary to determine what is a word and prints out solutions as it finds them (some strings can be broken out into multiple possible sets of words, for example wickedweather could be parsed as "wicked we at her"). If you just want one set of words you will need to determine the rules for selecting the best set, perhaps by selecting the solution with fewest number of words or by setting a minimum word length.
#!/usr/bin/perl
use strict;
my $WORD_FILE = '/usr/share/dict/words'; #Change as needed
my %words; # Hash of words in dictionary
# Open dictionary, load words into hash
open(WORDS, $WORD_FILE) or die "Failed to open dictionary: $!\n";
while (<WORDS>) {
chomp;
$words{lc($_)} = 1;
}
close(WORDS);
# Read one line at a time from stdin, break into words
while (<>) {
chomp;
my #words;
find_words(lc($_));
}
sub find_words {
# Print every way $string can be parsed into whole words
my $string = shift;
my #words = #_;
my $length = length $string;
foreach my $i ( 1 .. $length ) {
my $word = substr $string, 0, $i;
my $remainder = substr $string, $i, $length - $i;
# Some dictionaries contain each letter as a word
next if ($i == 1 && ($word ne "a" && $word ne "i"));
if (defined($words{$word})) {
push #words, $word;
if ($remainder eq "") {
print join(' ', #words), "\n";
return;
} else {
find_words($remainder, #words);
}
pop #words;
}
}
return;
}
I think you're right in thinking that it's not really a job for a regular expression. I would approach this using the dictionary idea - look for the longest prefix that is a word in the dictionary. When you find that, chop it off and do the same with the remainder of the string.
The above method is subject to ambiguity, for example "drivereallyfast" would first find "driver" and then have trouble with "eallyfast". So you would also have to do some backtracking if you ran into this situation. Or, since you don't have that many strings to split, just do by hand the ones that fail the automated split.
This is related to a problem known as identifier splitting or identifier name tokenization. In the OP's case, the inputs seem to be concatenations of ordinary words; in identifier splitting, the inputs are class names, function names or other identifiers from source code, and the problem is harder. I realize this is an old question and the OP has either solved their problem or moved on, but in case someone else comes across this question while looking for identifier splitters (like I was, not long ago), I would like to offer Spiral ("SPlitters for IdentifieRs: A Library"). It is written in Python but comes with a command-line utility that can read a file of identifiers (one per line) and split each one.
Splitting identifiers is deceptively difficult. Programmers commonly use abbreviations, acronyms and word fragments when naming things, and they don't always use consistent conventions. Even in when identifiers do follow some convention such as camel case, ambiguities can arise.
Spiral implements numerous identifier splitting algorithms, including a novel algorithm called Ronin. It uses a variety of heuristic rules, English dictionaries, and tables of token frequencies obtained from mining source code repositories. Ronin can split identifiers that do not use camel case or other naming conventions, including cases such as splitting J2SEProjectTypeProfiler into [J2SE, Project, Type, Profiler], which requires the reader to recognize J2SE as a unit. Here are some more examples of what Ronin can split:
# spiral mStartCData nonnegativedecimaltype getUtf8Octets GPSmodule savefileas nbrOfbugs
mStartCData: ['m', 'Start', 'C', 'Data']
nonnegativedecimaltype: ['nonnegative', 'decimal', 'type']
getUtf8Octets: ['get', 'Utf8', 'Octets']
GPSmodule: ['GPS', 'module']
savefileas: ['save', 'file', 'as']
nbrOfbugs: ['nbr', 'Of', 'bugs']
Using the examples from the OP's question:
# spiral wickedweather liquidweather driveourtrucks gocompact slimprojector
wickedweather: ['wicked', 'weather']
liquidweather: ['liquid', 'weather']
driveourtrucks: ['driveourtrucks']
gocompact: ['go', 'compact']
slimprojector: ['slim', 'projector']
As you can see, it is not perfect. It's worth noting that Ronin has a number of parameters and adjusting them makes it possible to split driveourtrucks too, but at the cost of worsening performance on program identifiers.
More information can be found in the GitHub repo for Spiral.
A simple solution with Python: install the wordsegment package: pip install wordsegment.
$ echo thisisatest | python -m wordsegment
this is a test
Well, the problem itself is not solvable with just a regular expression. A solution (probably not the best) would be to get a dictionary and do a regular expression match for each work in the dictionary to each word in the list, adding the space whenever successful. Certainly this would not be terribly quick, but it would be easy to program and faster than hand doing it.
A dictionary based solution would be required. This might be simplified somewhat if you have a limited dictionary of words that can occur, otherwise words that form the prefix of other words are going to be a problem.
There is python package released Santhosh thottingal called mlmorph which can be used for morphological analysis.
https://pypi.org/project/mlmorph/
Examples:
from mlmorph import Analyser
analyser = Analyser()
analyser.analyse("കേരളത്തിന്റെ")
Gives
[('കേരളം<np><genitive>', 179)]
He also wrote a blog on the topic https://thottingal.in/blog/2017/11/26/towards-a-malayalam-morphology-analyser/
This will work if the are camelCase. JavaScript!!!
function spinalCase(str) {
let lowercase = str.trim()
let regEx = /\W+|(?=[A-Z])|_/g
let result = lowercase.split(regEx).join("-").toLowerCase()
return result;
}
spinalCase("AllThe-small Things");
One of the solutions could be with recurssion (the same can be converted into dynamic-programming):
static List<String> wordBreak(
String input,
Set<String> dictionary
) {
List<List<String>> result = new ArrayList<>();
List<String> r = new ArrayList<>();
helper(input, dictionary, result, "", 0, new Stack<>());
for (List<String> strings : result) {
String s = String.join(" ", strings);
r.add(s);
}
return r;
}
static void helper(
final String input,
final Set<String> dictionary,
final List<List<String>> result,
String state,
int index,
Stack<String> stack
) {
if (index == input.length()) {
// add the last word
stack.push(state);
for (String s : stack) {
if (!dictionary.contains(s)) {
return;
}
}
result.add((List<String>) stack.clone());
return;
}
if (dictionary.contains(state)) {
// bifurcate
stack.push(state);
helper(input, dictionary, result, "" + input.charAt(index),
index + 1, stack);
String pop = stack.pop();
String s = stack.pop();
helper(input, dictionary, result, s + pop.charAt(0),
index + 1, stack);
}
else {
helper(input, dictionary, result, state + input.charAt(index),
index + 1, stack);
}
return;
}
The other possible solution would be the use of Tries data structure.
output :-
['better', 'good'] ['coffee', 'shop']
['coffee', 'shop']
pip install wordninja
import wordninja
n=wordninja.split('bettergood')
m=wordninja.split("coffeeshop")
print(n,m)
list=['hello','coffee','shop','better','good']
mat='coffeeshop'
expected=[]
for i in list:
if i in mat:
expected.append(i)
print(expected)
So I spent like 2 days on this answer, since I need it for my own NLP work. My answer is derived from Darius Bacon's answer, which itself was derived from the Viterbi algorithm. I also abstracted it to take each word in a message, attempt to split it, and then reassemble the message. I expanded Darius's code to make it debuggable. I also swapped out the need for "big.txt", and use the wordfreq library instead. Some comments stress the need to use a non-zero word frequency for non-existent words. I found that using any frequency higher than zero would cause "itseasyformetosplitlongruntogetherblocks" to undersplit into "itseasyformetosplitlongruntogether blocks". The algorithm in general tends to either oversplit or undersplit various test messages depending on how you combine word frequencies and how you handle missing word frequencies. I played around with many tweaks until it behaved well. My solution uses a 0.0 frequency for missing words. It also adds a reward for word length (otherwise it tends to split words into characters). I tried many length rewards, and the one that seems to work best for my test cases is word_frequency * (e ** word_length). There were also comments warning against multiplying word frequencies together. I tried adding them, using the harmonic mean, and using 1-freq instead of the 0.00001 form. They all tended to oversplit the test cases. Simply multiplying word frequencies together worked best. I left my debugging print statements in there, to make it easier for others to continue tweaking. Finally, there's a special case where if your whole message is a word that doesn't exist, like "Slagle's", then the function splits the word into individual letters. In my case, I don't want that, so I have a special return statement at the end to return the original message in those cases.
import numpy as np
from wordfreq import get_frequency_dict
word_prob = get_frequency_dict(lang='en', wordlist='large')
max_word_len = max(map(len, word_prob)) # 34
def viterbi_segment(text, debug=False):
probs, lasts = [1.0], [0]
for i in range(1, len(text) + 1):
new_probs = []
for j in range(max(0, i - max_word_len), i):
substring = text[j:i]
length_reward = np.exp(len(substring))
freq = word_prob.get(substring, 0) * length_reward
compounded_prob = probs[j] * freq
new_probs.append((compounded_prob, j))
if debug:
print(f'[{j}:{i}] = "{text[lasts[j]:j]} & {substring}" = ({probs[j]:.8f} & {freq:.8f}) = {compounded_prob:.8f}')
prob_k, k = max(new_probs) # max of a touple is the max across the first elements, which is the max of the compounded probabilities
probs.append(prob_k)
lasts.append(k)
if debug:
print(f'i = {i}, prob_k = {prob_k:.8f}, k = {k}, ({text[k:i]})\n')
# when text is a word that doesn't exist, the algorithm breaks it into individual letters.
# in that case, return the original word instead
if len(set(lasts)) == len(text):
return text
words = []
k = len(text)
while 0 < k:
word = text[lasts[k]:k]
words.append(word)
k = lasts[k]
words.reverse()
return ' '.join(words)
def split_message(message):
new_message = ' '.join(viterbi_segment(wordmash, debug=False) for wordmash in message.split())
return new_message
messages = [
'tosplit',
'split',
'driveourtrucks',
"Slagle's",
"Slagle's wickedweather liquidweather driveourtrucks gocompact slimprojector",
'itseasyformetosplitlongruntogetherblocks',
]
for message in messages:
print(f'{message}')
new_message = split_message(message)
print(f'{new_message}\n')
tosplit
to split
split
split
driveourtrucks
drive our trucks
Slagle's
Slagle's
Slagle's wickedweather liquidweather driveourtrucks gocompact slimprojector
Slagle's wicked weather liquid weather drive our trucks go compact slim projector
itseasyformetosplitlongruntogetherblocks
its easy for me to split long run together blocks
I may get downmodded for this, but have the secretary do it.
You'll spend more time on a dictionary solution than it would take to manually process. Further, you won't possibly have 100% confidence in the solution, so you'll still have to give it manual attention anyway.