Any perl standard library to check if a string contains a given substring - string

Given a query, I would like to check if this contains a given substring (can contain more than one word) . But I don't want exhaustive search, because this substring can only start a fresh word.
Any perl standard libraries for this, so that I get something efficient and don't have to reinvent the wheel?
Thanks,

Maybe you'll find builtin index() suited for the job.
It's a very fast substring search function ( implements the Boyer-Moore algorithm ).
Just check its documentation with perldoc -f index.

I would make a hash with the key being the first word of the 9000 substrings and the value an array with all substrings with that first word. If many strings contain the same first word, you could use the first two words.
Then for each query, for each word, I would see if that word is in the hash, and then need to match only those strings in the hash's array, starting at that point in the string using the index function.
Assuming that matching is sparse, this would be pretty efficient. One hash lookup per word and minimal searching for potential matches.
As I write this it reminds me of an Aho-Corasick search. (See Algorithm::AhoCorasick in CPAN.) I've never used the module, but the algorithm spends a lot of time building a finite state machine out of the search keys so finding a match is super efficient. I don't know if the CPAN implementation handles word boundaries issues.

You can use this approach:
# init
my $re = join"|", map quotemeta, sort #substrings;
$re = qr/\b(?:$re)/;
# usage
while (<>) {
found($1) if /($re)/;
}
where found is action what you want to do if substring found.

The builtin index function is the fastest general purpose way to check if a string contains a substring.
my $find = 'abc';
my $str = '123 abc xyz';
if (index($str, $find) != -1) {
# process matching $str here
}
If index still is not fast enough, and you know where in the string your substring might be, you can narrow down on it using substr and then use eq for the actual comparison:
my $find = 'abc';
my $str = '123 abc xyz';
if (substr($str, 4, 3) eq $find) {
# process matching $str here
}
You are not going to get faster than that in Perl without dropping down to C.

This sounds like the perfect job for regular expressions:
if($string =~ m/your substring/) {
say "substring found";
} else {
say "nothing found";
}

Related

I have a string, need that string to be compared with list of strings in TCL

Need to compare string1 with string2 in TCL
set string1 {laptop Keyboard mouse MONITOR PRINTER}
set string2 {mouse}
Well, you can use:
if {$string2 in $string1} {
puts "present in the list"
}
Or you can use lsearch if you want to know where (it returns the index that it finds the element at, or -1 if it isn't there). This is most useful when you want to know where in the list the value is. It also has options to do binary searching (if you know the list is sorted) which is far faster than a linear search.
set idx [lsearch -exact $string1 $string2]
if {$idx >= 0} {
puts "present in the list at index $idx"
}
But if you are doing a lot of searching, it can be best to create a hash table using an array or a dictionary. Those are extremely fast but require some setup. Whether the setup costs are worth it depends on your application.
set words {}
foreach word $string1 {dict set words $word 1}
if {[dict exists $words $string2]} {
puts "word is present"
}
Note that if you're dealing with ordinary user input, you probably want a sanitization step or two. Tcl lists aren't exactly sentences, and the differences can really catch you out once you move to production. The two main tools for that are split and regexp -all -inline.
set words [split $sentence]
set words [regexp -all -inline {\S+} $sentence]
Understanding how to do the cleanup requires understanding your input data more completely than I do.
There is string first
if {[string first $string2 $string1] != -1} {
puts "string1 contains string2"
}
or
if {[string match *$string2* $string1]} {
puts "string1 contains string2"
}

Distance between matched substrings

I have a chromosome sequence and have to find subsequences in it and the distances between them.
For example:
string:
AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT
Substring:
ACGT
I have to find the distance between all occurrences of ACGT.
I normally do not recommend answering posts where it is obvious the OP just wants other people to do their work. However, there is already one answer the use of which will be problematic if input strings are largish, so here is something that uses Perl builtins.
The special variable #- stores the positions of matches after a pattern matches.
use strict;
use warnings;
use Data::Dumper;
my $string = 'AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT';
my #pos;
while ( $string =~ /ACGT/g ) {
push #pos, $-[0];
}
my #dist;
for my $i (1 .. $#pos) {
push #dist, $pos[$i] - $pos[$i - 1];
}
print Dumper(\#pos, \#dist);
This method uses less memory than splitting the original string (which may be a problem if the original string is large enough). Its memory footprint can be further reduced, but I focused on clarity by showing the accumulation of match positions and the calculation of deltas separately.
One open question is whether you want the index of the first match from the beginning of the string. Strictly speaking, "distances between matches" excludes that.
use strict;
use warnings;
use Data::Dumper;
my $string = 'AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT';
my #dist;
my $last;
while ($string =~ /ACGT/g) {
no warnings 'uninitialized';
push #dist, $-[0] - $last;
$last = $-[0];
}
# Do we want the distance of the first
# match from the beginning of the string?
shift #dist;
print Dumper \#dist;
Of course, it is possible to use index for this as well, but it looks considerably uglier.
You may split your input string by "ACGT" and remove the first and the last elements of the returned array to get all fragments between "ACGT". Then calculate lengths of this fragments:
my $input = "AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT";
my #fragments = split("ACGT", $input, -1);
#fragments = #fragments[1..$#fragments - 1];
my #dist_arr = map {length} #fragments;
Demo: https://ideone.com/AqEwGu

Perl Morgan and a String?

I am trying to solve this problem on hackerrank:
So the problem is:
Jack and Daniel are friends. Both of them like letters, especially upper-case ones.
They are cutting upper-case letters from newspapers, and each one of them has their collection of letters stored in separate stacks.
One beautiful day, Morgan visited Jack and Daniel. He saw their collections. Morgan wondered what is the lexicographically minimal string, made of that two collections. He can take a letter from a collection when it is on the top of the stack.
Also, Morgan wants to use all the letters in the boys' collections.
This is my attempt in Perl:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $n=<>);
while($n>0){
chomp(my $string1=<>);
chomp(my $string2=<>);
lexi($string1,$string2);
$n--;
}
sub lexi{
my($str1,$str2)=#_;
my #str1=split(//,$str1);
my #str2=split(//,$str2);
my $final_string="";
while(#str2 && #str1){
my $st2=$str2[0];
my $st1=$str1[0];
if($st1 le $st2){
$final_string.=$st1;
shift #str1;
}
else{
$final_string.=$st2;
shift #str2;
}
}
if(#str1){
$final_string=$final_string.join('',#str1);
}
else{
$final_string=$final_string.join('',#str2);
}
print $final_string,"\n";
}
Sample Input:
2
JACK
DANIEL
ABACABA
ABACABA
The first line contains the number of test cases, T.
Every next two lines have such format: the first line contains string A, and the second line contains string B.
Sample Output:
DAJACKNIEL
AABABACABACABA
But for Sample test-case it is giving right results while it is giving wrong results for other test-cases. One case for which it gives an incorrect result is
1
AABAC
AACAB
It outputs AAAABACCAB instead of AAAABACABC.
I don't know what is wrong with the algorithm and why it is failing with other test cases?
Update:
As per #squeamishossifrage comments If I add
($str1,$str2)=sort{$a cmp $b}($str1,$str2);
The results become same irrespective of user-inputs but still the test-case fails.
The problem is in your handling of the equal characters. Take the following example:
ACBA
BCAB
When faced with two identical characters (C in my example), you naïvely chose the one from the first string, but that's not always correct. You need to look ahead to break ties. You may even need to look many characters ahead. In this case, next character after C of the second string is lower than the next character of the first string, so you should take the C from the second string first.
By leaving the strings as strings, a simple string comparison will compare as many characters as needed to determine which character to consume.
sub lexi {
my ($str1, $str2) = #_;
utf8::downgrade($str1); # Makes sure length() will be fast
utf8::downgrade($str2); # since we only have ASCII letters.
my $final_string = "";
while (length($str2) && length($str1)) {
$final_string .= substr($str1 le $str2 ? $str1 : $str2, 0, 1, '');
}
$final_string .= $str1;
$final_string .= $str2;
print $final_string, "\n";
}
Too little rep to comment thus the answer:
What you need to do is to look ahead if the two characters match. You currently do a simple le match and in the case of
ZABB
ZAAA
You'll get ZABBZAA since the first match Z will be le Z. So what you need to do (a naive solution which most likely won't be very effective) is to keep looking as long as the strings/chars match so:
Z eq Z
ZA eq ZA
ZAB gt ZAA
and at that point will you know that the second string is the one you want to pop from for the first character.
Edit
You updated with sorting the strings, but like I wrote you still need to look ahead. The sorting will solve the two above strings but will fail with these two:
ZABAZA
ZAAAZB
ZAAAZBZABAZA
Because here the correct answer is ZAAAZABAZAZB and you can't find that will simply comparing character per character

Lua - How to find a substring with 1 or 2 characters discrepancy

Say I have a string
local a = "Hello universe"
I find the substring "universe" by
a:find("universe")
Now, suppose the string is
local a = "un#verse"
The string to be searched is universe; but the substring differs by a single character.
So obviously Lua ignores it.
How do I make the function find the string even if there is a discrepancy by a single character?
If you know where the character would be, use . instead of that character: a:find("un.verse")
However, it looks like you're looking for a fuzzy string search. It is out of a scope for a Lua string library. You may want to start with this article: http://ntz-develop.blogspot.com/2011/03/fuzzy-string-search.html
As for Lua fuzzy search implementations — I haven't used any, but googing "lua fuzzy search" gives a few results. Some are based on this paper: http://web.archive.org/web/20070518080535/http://www.heise.de/ct/english/97/04/386/
Try https://github.com/ajsher/luafuzzy.
It sounds like you want something along the lines of TRE:
TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.
Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.
A Lua binding for it is available as part of lrexlib.
If you are really looking for a single character difference and do not care about performance, here is a simple approach that should work:
local a = "Hello un#verse"
local myfind = function(s,p)
local withdot = function(n)
return p:sub(1,n-1) .. '.' .. p:sub(n+1)
end
local a,b
for i=1,#s do
a,b = s:find(withdot(i))
if a then return a,b end
end
end
print(myfind(a,"universe"))
A simple roll your own approach (based on the assumption that the pattern keeps the same length):
function hammingdistance(a,b)
local ta={a:byte(1,-1)}
local tb={b:byte(1,-1)}
local res = 0
for k=1,#a do
if ta[k]~=tb[k] then
res=res+1
end
end
print(a,b,res) -- debugging/demonstration print
return res
end
function fuz(s,pat)
local best_match=10000
local best_location
for k=1,#s-#pat+1 do
local cur_diff=hammingdistance(s:sub(k,k+#pat-1),pat)
if cur_diff < best_match then
best_location = k
best_match = cur_diff
end
end
local start,ending = math.max(1,best_location),math.min(best_location+#pat-1,#s)
return start,ending,s:sub(start,ending)
end
s=[[Hello, Universe! UnIvErSe]]
print(fuz(s,'universe'))
Disclaimer: not recommended, just for fun:
If you want a better syntax (and you don't mind messing with standard type's metatables) you could use this:
getmetatable('').__sub=hammingdistance
a='Hello'
b='hello'
print(a-b)
But note that a-b does not equal b-a this way.

What is the difference between string.find and string.match in Lua?

I am trying to understand what the difference is between string.find and string.match in Lua. To me it seems that both find a pattern in a string. But what is the difference? And how do I use each? Say, if I had the string "Disk Space: 3000 kB" and I wanted to extract the '3000' out of it.
EDIT: Ok, I think I overcomplicated things and now I'm lost. Basically, I need to translate this, from Perl to Lua:
my $mem;
my $memfree;
open(FILE, 'proc/meminfo');
while (<FILE>)
{
if (m/MemTotal/)
{
$mem = $_;
$mem =~ s/.*:(.*)/$1/;
}
elseif (m/MemFree/)
{
$memfree = $_;
$memfree =~ s/.*:(.*)/$1/;
}
}
close(FILE);
So far I've written this:
for Line in io.lines("/proc/meminfo") do
if Line:find("MemTotal") then
Mem = Line
Mem = string.gsub(Mem, ".*", ".*", 1)
end
end
But it is obviously wrong. What am I not getting? I understand why it is wrong, and what it is actually doing and why when I do
print(Mem)
it returns
.*
but I don't understand what is the proper way to do it. Regular expressions confuse me!
In your case, you want string.match:
local space = tonumber(("Disk Space 3000 kB"):match("Disk Space ([%.,%d]+) kB"))
string.find is slightly different, in that before returning any captures, it returns the start and end index of the substring found. When no captures are present, string.match will return the entire string matched, while string.find simply won't return anything past the second return value. string.find also lets you search the string without being aware of Lua patterns, by using the 'plain' parameter.
Use string.match when you want the matched captures, and string.find when you want the substring's position, or when you want both the position and captures.

Resources