What is the difference between string.find and string.match in Lua? - string

I am trying to understand what the difference is between string.find and string.match in Lua. To me it seems that both find a pattern in a string. But what is the difference? And how do I use each? Say, if I had the string "Disk Space: 3000 kB" and I wanted to extract the '3000' out of it.
EDIT: Ok, I think I overcomplicated things and now I'm lost. Basically, I need to translate this, from Perl to Lua:
my $mem;
my $memfree;
open(FILE, 'proc/meminfo');
while (<FILE>)
{
if (m/MemTotal/)
{
$mem = $_;
$mem =~ s/.*:(.*)/$1/;
}
elseif (m/MemFree/)
{
$memfree = $_;
$memfree =~ s/.*:(.*)/$1/;
}
}
close(FILE);
So far I've written this:
for Line in io.lines("/proc/meminfo") do
if Line:find("MemTotal") then
Mem = Line
Mem = string.gsub(Mem, ".*", ".*", 1)
end
end
But it is obviously wrong. What am I not getting? I understand why it is wrong, and what it is actually doing and why when I do
print(Mem)
it returns
.*
but I don't understand what is the proper way to do it. Regular expressions confuse me!

In your case, you want string.match:
local space = tonumber(("Disk Space 3000 kB"):match("Disk Space ([%.,%d]+) kB"))
string.find is slightly different, in that before returning any captures, it returns the start and end index of the substring found. When no captures are present, string.match will return the entire string matched, while string.find simply won't return anything past the second return value. string.find also lets you search the string without being aware of Lua patterns, by using the 'plain' parameter.
Use string.match when you want the matched captures, and string.find when you want the substring's position, or when you want both the position and captures.

Related

Match whole, exact text line with Lua

I'm looking for a little help on some Lua. I need some code to match this exact line:
efs.test efs.test.gpg
Here's what I have so far, which matches "efs.test":
if string.match(a.message, "%a+%a+%a+.%%a+%a+%a+%a+") then
print(a.message)
else
print ("Does not match")
end
I've also tried this, which matches:
if string.match(a.message, "efs.test") then
print(a.message)
else
print ("Does not match")
end
But when I try to add the extra text my compiler errors with "Number expected, got string" when running this code:
if string.match(a.message, "efs.test", "efs") then
print(a.message)
else
print ("Does not match")
end
Any pointers would be great!
Thanks.
if string.match(a.message, "%a+%a+%a+.%%a+%a+%a+%a+") then
Firstly, this is a wrong use of quantifiers. From PiL 20.2:
+ 1 or more repetitions
* 0 or more repetitions
- also 0 or more repetitions
? optional (0 or 1 occurrence)
In words, you try to match for unlimited %a+ after you already matched the full word with unlimited %a+
To match efs.test efs.test.gpg - we have 2 filenames I suppose, in a strict sense file names may contain only %w - alphanumeric characters (A-Za-z0-9). This would correctly match efs.test:
string.match(message, "%w+%.%w+")
Going one step further, match efs.test as filename and the following filename:
string.match(message, "%w+%.%w+ %w+%.%w+%.gpg")
While this would match both filenames, you would need to check if matched filenames are the same. We can go one step further yet:
local file, gpgfile = string.match(message, "(%w+%.%w+) (%1%.gpg)")
This pattern will return any <filename> <filename>.gpg where the filenames are equal.
With the use of capture-groups, we capture the filename: it will be returned as the first variable and further represented as %1. Then after the space char, we try to match for %1 (captured filename) followed by .gpg. Since it's also enclosed in brackets, it will become the second captured group and returned as the second variable. Done!
PS: You may want to grab ".gpg" by case-insensitive [Gg][Pp][Gg] pattern.
PPS: File names may contain spaces, dashes, UTF-8 characters etc. E.g. ext4 only forbids \0 and / characters.
string.match optional third argument is the index of the given string to start searching at. If you are looking for exactly efs.test efs.test.gpg in that order with that given spacing, why not just use:
string.match(a.message, "efs%.test efs%.test%.gpg")
If you want to match the entire line containing that substring:
string.match(a.message, ".*efs%.test efs%.test%.gpg.*")
For reference
If you are trying to match that exact line its way easier to just use:
if "efs.test efs.test.gpg" = a.message then
print(a.message)
else
print("string does not match!")
end
Of course this wouldn't find any other strings than this.
Another interpretation I see for your question is that you want to know if it has efs.test in the string, which you should be able to accomplish by doing:
if string.match(a.message, "%w+%.%w+") == "efs.test" then
...
end
Also, look into regex, it's basically the language Lua used to match strings with some exceptions.

Distance between matched substrings

I have a chromosome sequence and have to find subsequences in it and the distances between them.
For example:
string:
AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT
Substring:
ACGT
I have to find the distance between all occurrences of ACGT.
I normally do not recommend answering posts where it is obvious the OP just wants other people to do their work. However, there is already one answer the use of which will be problematic if input strings are largish, so here is something that uses Perl builtins.
The special variable #- stores the positions of matches after a pattern matches.
use strict;
use warnings;
use Data::Dumper;
my $string = 'AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT';
my #pos;
while ( $string =~ /ACGT/g ) {
push #pos, $-[0];
}
my #dist;
for my $i (1 .. $#pos) {
push #dist, $pos[$i] - $pos[$i - 1];
}
print Dumper(\#pos, \#dist);
This method uses less memory than splitting the original string (which may be a problem if the original string is large enough). Its memory footprint can be further reduced, but I focused on clarity by showing the accumulation of match positions and the calculation of deltas separately.
One open question is whether you want the index of the first match from the beginning of the string. Strictly speaking, "distances between matches" excludes that.
use strict;
use warnings;
use Data::Dumper;
my $string = 'AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT';
my #dist;
my $last;
while ($string =~ /ACGT/g) {
no warnings 'uninitialized';
push #dist, $-[0] - $last;
$last = $-[0];
}
# Do we want the distance of the first
# match from the beginning of the string?
shift #dist;
print Dumper \#dist;
Of course, it is possible to use index for this as well, but it looks considerably uglier.
You may split your input string by "ACGT" and remove the first and the last elements of the returned array to get all fragments between "ACGT". Then calculate lengths of this fragments:
my $input = "AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT";
my #fragments = split("ACGT", $input, -1);
#fragments = #fragments[1..$#fragments - 1];
my #dist_arr = map {length} #fragments;
Demo: https://ideone.com/AqEwGu

Lua - How to find a substring with 1 or 2 characters discrepancy

Say I have a string
local a = "Hello universe"
I find the substring "universe" by
a:find("universe")
Now, suppose the string is
local a = "un#verse"
The string to be searched is universe; but the substring differs by a single character.
So obviously Lua ignores it.
How do I make the function find the string even if there is a discrepancy by a single character?
If you know where the character would be, use . instead of that character: a:find("un.verse")
However, it looks like you're looking for a fuzzy string search. It is out of a scope for a Lua string library. You may want to start with this article: http://ntz-develop.blogspot.com/2011/03/fuzzy-string-search.html
As for Lua fuzzy search implementations — I haven't used any, but googing "lua fuzzy search" gives a few results. Some are based on this paper: http://web.archive.org/web/20070518080535/http://www.heise.de/ct/english/97/04/386/
Try https://github.com/ajsher/luafuzzy.
It sounds like you want something along the lines of TRE:
TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.
Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.
A Lua binding for it is available as part of lrexlib.
If you are really looking for a single character difference and do not care about performance, here is a simple approach that should work:
local a = "Hello un#verse"
local myfind = function(s,p)
local withdot = function(n)
return p:sub(1,n-1) .. '.' .. p:sub(n+1)
end
local a,b
for i=1,#s do
a,b = s:find(withdot(i))
if a then return a,b end
end
end
print(myfind(a,"universe"))
A simple roll your own approach (based on the assumption that the pattern keeps the same length):
function hammingdistance(a,b)
local ta={a:byte(1,-1)}
local tb={b:byte(1,-1)}
local res = 0
for k=1,#a do
if ta[k]~=tb[k] then
res=res+1
end
end
print(a,b,res) -- debugging/demonstration print
return res
end
function fuz(s,pat)
local best_match=10000
local best_location
for k=1,#s-#pat+1 do
local cur_diff=hammingdistance(s:sub(k,k+#pat-1),pat)
if cur_diff < best_match then
best_location = k
best_match = cur_diff
end
end
local start,ending = math.max(1,best_location),math.min(best_location+#pat-1,#s)
return start,ending,s:sub(start,ending)
end
s=[[Hello, Universe! UnIvErSe]]
print(fuz(s,'universe'))
Disclaimer: not recommended, just for fun:
If you want a better syntax (and you don't mind messing with standard type's metatables) you could use this:
getmetatable('').__sub=hammingdistance
a='Hello'
b='hello'
print(a-b)
But note that a-b does not equal b-a this way.

Any perl standard library to check if a string contains a given substring

Given a query, I would like to check if this contains a given substring (can contain more than one word) . But I don't want exhaustive search, because this substring can only start a fresh word.
Any perl standard libraries for this, so that I get something efficient and don't have to reinvent the wheel?
Thanks,
Maybe you'll find builtin index() suited for the job.
It's a very fast substring search function ( implements the Boyer-Moore algorithm ).
Just check its documentation with perldoc -f index.
I would make a hash with the key being the first word of the 9000 substrings and the value an array with all substrings with that first word. If many strings contain the same first word, you could use the first two words.
Then for each query, for each word, I would see if that word is in the hash, and then need to match only those strings in the hash's array, starting at that point in the string using the index function.
Assuming that matching is sparse, this would be pretty efficient. One hash lookup per word and minimal searching for potential matches.
As I write this it reminds me of an Aho-Corasick search. (See Algorithm::AhoCorasick in CPAN.) I've never used the module, but the algorithm spends a lot of time building a finite state machine out of the search keys so finding a match is super efficient. I don't know if the CPAN implementation handles word boundaries issues.
You can use this approach:
# init
my $re = join"|", map quotemeta, sort #substrings;
$re = qr/\b(?:$re)/;
# usage
while (<>) {
found($1) if /($re)/;
}
where found is action what you want to do if substring found.
The builtin index function is the fastest general purpose way to check if a string contains a substring.
my $find = 'abc';
my $str = '123 abc xyz';
if (index($str, $find) != -1) {
# process matching $str here
}
If index still is not fast enough, and you know where in the string your substring might be, you can narrow down on it using substr and then use eq for the actual comparison:
my $find = 'abc';
my $str = '123 abc xyz';
if (substr($str, 4, 3) eq $find) {
# process matching $str here
}
You are not going to get faster than that in Perl without dropping down to C.
This sounds like the perfect job for regular expressions:
if($string =~ m/your substring/) {
say "substring found";
} else {
say "nothing found";
}

Check whether a string contains a substring

How can I check whether a given string contains a certain substring, using Perl?
More specifically, I want to see whether s1.domain.example is present in the given string variable.
To find out if a string contains substring you can use the index function:
if (index($str, $substr) != -1) {
print "$str contains $substr\n";
}
It will return the position of the first occurrence of $substr in $str, or -1 if the substring is not found.
Another possibility is to use regular expressions which is what Perl is famous for:
if ($mystring =~ /s1\.domain\.example/) {
print qq("$mystring" contains "s1.domain.example"\n);
}
The backslashes are needed because a . can match any character. You can get around this by using the \Q and \E operators.
my $substring = "s1.domain.example";
if ($mystring =~ /\Q$substring\E/) {
print qq("$mystring" contains "$substring"\n);
}
Or, you can do as eugene y stated and use the index function.
Just a word of warning: Index returns a -1 when it can't find a match instead of an undef or 0.
Thus, this is an error:
my $substring = "s1.domain.example";
if (not index($mystring, $substr)) {
print qq("$mystring" doesn't contains "$substring"\n";
}
This will be wrong if s1.domain.example is at the beginning of your string. I've personally been burned on this more than once.
Case Insensitive Substring Example
This is an extension of Eugene's answer, which converts the strings to lower case before checking for the substring:
if (index(lc($str), lc($substr)) != -1) {
print "$str contains $substr\n";
}

Resources