Lua - How to find a substring with 1 or 2 characters discrepancy - string

Say I have a string
local a = "Hello universe"
I find the substring "universe" by
a:find("universe")
Now, suppose the string is
local a = "un#verse"
The string to be searched is universe; but the substring differs by a single character.
So obviously Lua ignores it.
How do I make the function find the string even if there is a discrepancy by a single character?

If you know where the character would be, use . instead of that character: a:find("un.verse")
However, it looks like you're looking for a fuzzy string search. It is out of a scope for a Lua string library. You may want to start with this article: http://ntz-develop.blogspot.com/2011/03/fuzzy-string-search.html
As for Lua fuzzy search implementations — I haven't used any, but googing "lua fuzzy search" gives a few results. Some are based on this paper: http://web.archive.org/web/20070518080535/http://www.heise.de/ct/english/97/04/386/
Try https://github.com/ajsher/luafuzzy.

It sounds like you want something along the lines of TRE:
TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.
Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.
A Lua binding for it is available as part of lrexlib.

If you are really looking for a single character difference and do not care about performance, here is a simple approach that should work:
local a = "Hello un#verse"
local myfind = function(s,p)
local withdot = function(n)
return p:sub(1,n-1) .. '.' .. p:sub(n+1)
end
local a,b
for i=1,#s do
a,b = s:find(withdot(i))
if a then return a,b end
end
end
print(myfind(a,"universe"))

A simple roll your own approach (based on the assumption that the pattern keeps the same length):
function hammingdistance(a,b)
local ta={a:byte(1,-1)}
local tb={b:byte(1,-1)}
local res = 0
for k=1,#a do
if ta[k]~=tb[k] then
res=res+1
end
end
print(a,b,res) -- debugging/demonstration print
return res
end
function fuz(s,pat)
local best_match=10000
local best_location
for k=1,#s-#pat+1 do
local cur_diff=hammingdistance(s:sub(k,k+#pat-1),pat)
if cur_diff < best_match then
best_location = k
best_match = cur_diff
end
end
local start,ending = math.max(1,best_location),math.min(best_location+#pat-1,#s)
return start,ending,s:sub(start,ending)
end
s=[[Hello, Universe! UnIvErSe]]
print(fuz(s,'universe'))
Disclaimer: not recommended, just for fun:
If you want a better syntax (and you don't mind messing with standard type's metatables) you could use this:
getmetatable('').__sub=hammingdistance
a='Hello'
b='hello'
print(a-b)
But note that a-b does not equal b-a this way.

Related

Efficient way to insert characters between other characters in a string

What is an efficient way in MATLAB to replace/insert one symbol (in series of symbols) with several others that correspond to the one that is being replaced?
For example, consider having a string Eq: Eq = 'A*exp(-((x-xc)/w)^2)'. Is there a way to replace * with .*, / with ./,\ with .\, and ^ with .^ without writing four separate strrep() lines?
Regular expressions will do the job nicely. Regular expressions simply find patterns in text. You specify what kind of pattern you are looking for by a regular expression, and the output gives you the locations of where the pattern occurred.
For our particular case, not only do we want to find where patterns occur, we also want to replace those patterns with something else. Specifically, use the function regexprep from MATLAB to replace matches in a string with something else. What you want to do is replace all *, /, \ and ^ symbols by adding a . in front of each.
How regexprep works is that the first input is the string you're looking at, the second input is a pattern that you're trying to find. In our case, we want to find any of *, /, \ and ^. To specify this pattern, you put those desired symbols in [] brackets. Regular expressions reserve \ as a special symbol to delineate characters that can be parsed as a regular expression but actually aren't. As such, you need to use \\ for the \ character and \^ for the ^ character. The third input is what you want to replace each match with. In our case, we simply want to reuse each matched character, but we add a . at the beginning of the match. This is done by doing \.$0 in the regular expression syntax. $0 means to grab the first token produced by a match... which is essentially the matched symbol from the pattern. . is also a reserved keyword using regular expressions, so we must prepend this symbol with a \ character.
Without further ado:
>> Eq = 'A*exp(-((x-xc)/w)^2)';
>> out = regexprep(Eq, '[*/\\\^]', '\.$0')
out =
A.*exp(-((x-xc)./w).^2)
The pattern we are looking for is [*/\\\^], which means that we want to find any of *, /, \ - denoted as \\ in regex, and \^ - denoted as ^ in regex. We want to find any of these symbols and replace them with the same symbol by adding a . character in front - \.$0.
As a more complicated example, let's make sure that we include all of the symbols you're looking for in a sample equation:
>> A = 'A*exp(-((x-xc)/w)^2) \ b^2';
>> out = regexprep(A, '[*/\\\^]', '\.$0')
out =
A.*exp(-((x-xc)./w).^2) .\ b.^2
I'd go with regexp as in rayryeng's answer. But here's another approach, just to provide an alternative.
ops = '*/\^'; %// operators that need a dot
ii = find(ismember(Eq, ops)); %// find where dots should be inserted
[~, jj] = sort([1:numel(Eq) ii-.5]); %// will be used to properly order the result
result = [Eq repmat('.',1,numel(ii))]; %// insert dots at the end
result = result(jj); %// properly order the result
And a variant:
ops = '*/\^'; %// operators that need a dot
ii = find(ismember(Eq, ops)); %// find where dots should be inserted
jj = sort([1:numel(Eq) ii-.5]); %// dot locations are marked with fractional part
result = Eq(ceil(jj)); %// repeat characters where the dots will be placed
result(mod(jj,1)>0) = '.'; %// place dots at indices with fractional part
The vectorize function already does almost all of what you want except that it does not convert mldivide (\) to ldivide (.\).
By "efficient," do you mean fewer lines of code or faster? Regular expressions are almost always slower than other approaches and less readable. I don't think they're necessary or a good choice in this case. If you only need to convert your string once, then speed is less of a concern than readability (strrep will still be faster). If you need to do it many times, this simple code that you alluded to is 4–5 times faster than regexrep for short strings like your example (and much faster for longer strings):
out = strrep(Eq,'*','.*');
out = strrep(out,'/','./');
out = strrep(out,'\','.\');
out = strrep(out,'^','.^');
If you want one line, use:
out = strrep(strrep(strrep(strrep(Eq,'*','.*'),'/','./'),'\','.\'),'^','.^');
which will also be slightly faster still. Or create your own version of vectorize and call that.
Where regular expressions shine is in more complex cases, e.g., if your string is already partially vectorized: Eq = 'A.*exp(-((x-xc)/w)^2)'. Even still, the vectorize function just uses strrep and then calls strfind to "remove any possible '..*', '../', etc." and replace them with the proper element-wise operators because it's faster (symbolic math strings can get very large, for example).

Lua plain searching with string.gsub?

With Lua's string.find function, there is an optional fourth argument you can pass to enable plain searching. From the Lua wiki:
The pattern argument also allows more complex searches. See the
PatternsTutorial for more information. We can turn off the pattern
matching feature by using the optional fourth argument plain. plain
takes a boolean value and must be preceeded by index. E.g.,
= string.find("Hello Lua user", "%su") -- find a space character followed by "u"
10 11
= string.find("Hello Lua user", "%su", 1, true) -- turn on plain searches, now not found
nil
Basically, I was wondering how I can accomplish the same plain searching using Lua's string.gsub function.
I expected there to be something in the standard library for this, but there isn't. The solution, then, is to escape the special characters in the pattern so they don't perform their usual functions.
Here's the general idea:
obtain the pattern string
replace any special characters with % followed by it (for example, % becomes %%, [ becomes %[
use this as your search pattern for replacing the text
Here is a simple library function for text replacement:
function string.replace(text, old, new)
local b,e = text:find(old,1,true)
if b==nil then
return text
else
return text:sub(1,b-1) .. new .. text:sub(e+1)
end
end
This function can be called as newtext = text:replace(old,new).
Note that this only replaces the first occurrence of old in text.
Use this function to escape all magic characters (and only those) in your search string.
function escape_magic(s)
return (s:gsub('[%^%$%(%)%%%.%[%]%*%+%-%?]','%%%1'))
end

String matching without using builtin functions

I want to search for a query (a string) in a subject (another string).
The query may appear in whole or in parts, but will not be rearranged. For instance, if the query is 'da', and the subject is 'dura', it is still a match.
I am not allowed to use string functions like strfind or find.
The constraints make this actually quite straightforward with a single loop. Imagine you have two indices initially pointing at the first character of both strings, now compare them - if they don't match, increment the subject index and try again. If they do, increment both. If you've reached the end of the query at that point, you've found it. The actual implementation should be simple enough, and I don't want to do all the work for you ;)
If this is homework, I suggest you look at the explanation which precedes the code and then try for yourself, before looking at the actual code.
The code below looks for all occurrences of chars of the query string within the subject string (variables m; and related ii, jj). It then tests all possible orders of those occurrences (variable test). An order is "acceptable" if it contains all desired chars (cond1) in increasing positions (cond2). The result (variable result) is affirmative if there is at least one acceptable order.
subject = 'this is a test string';
query = 'ten';
m = bsxfun(#eq, subject.', query);
%'// m: test if each char of query equals each char of subject
[ii jj] = find(m);
jj = jj.'; %'// ii: which char of query is found within subject...
ii = ii.'; %'// jj: ... and at which position
test = nchoosek(1:numel(jj),numel(query)).'; %'// test all possible orders
cond1 = all(jj(test) == repmat((1:numel(query)).',1,size(test,2)));
%'// cond1: for each order, are all chars of query found in subject?
cond2 = all(diff(ii(test))>0);
%// cond2: for each order, are the found chars in increasing positions?
result = any(cond1 & cond2); %// final result: 1 or 0
The code could be improved by using a better approach as regards to test, i.e. not testing all possible orders given by nchoosek.
Matlab allows you to view the source of built-in functions, so you could always try reading the code to see how the Matlab developers did it (although it will probably be very complex). (thanks Luis for the correction)
Finding a string in another string is a basic computer science problem. You can read up on it in any number of resources, such as Wikipedia.
Your requirement of non-rearranging partial matches recalls the bioinformatics problem of mapping splice variants to a genomic sequence.
You may solve your problem by using a sequence alignment algorithm such as Smith-Waterman, modified to work with all English characters and not just DNA bases.
Is this question actually from bioinformatics? If so, you should tag it as such.

Any perl standard library to check if a string contains a given substring

Given a query, I would like to check if this contains a given substring (can contain more than one word) . But I don't want exhaustive search, because this substring can only start a fresh word.
Any perl standard libraries for this, so that I get something efficient and don't have to reinvent the wheel?
Thanks,
Maybe you'll find builtin index() suited for the job.
It's a very fast substring search function ( implements the Boyer-Moore algorithm ).
Just check its documentation with perldoc -f index.
I would make a hash with the key being the first word of the 9000 substrings and the value an array with all substrings with that first word. If many strings contain the same first word, you could use the first two words.
Then for each query, for each word, I would see if that word is in the hash, and then need to match only those strings in the hash's array, starting at that point in the string using the index function.
Assuming that matching is sparse, this would be pretty efficient. One hash lookup per word and minimal searching for potential matches.
As I write this it reminds me of an Aho-Corasick search. (See Algorithm::AhoCorasick in CPAN.) I've never used the module, but the algorithm spends a lot of time building a finite state machine out of the search keys so finding a match is super efficient. I don't know if the CPAN implementation handles word boundaries issues.
You can use this approach:
# init
my $re = join"|", map quotemeta, sort #substrings;
$re = qr/\b(?:$re)/;
# usage
while (<>) {
found($1) if /($re)/;
}
where found is action what you want to do if substring found.
The builtin index function is the fastest general purpose way to check if a string contains a substring.
my $find = 'abc';
my $str = '123 abc xyz';
if (index($str, $find) != -1) {
# process matching $str here
}
If index still is not fast enough, and you know where in the string your substring might be, you can narrow down on it using substr and then use eq for the actual comparison:
my $find = 'abc';
my $str = '123 abc xyz';
if (substr($str, 4, 3) eq $find) {
# process matching $str here
}
You are not going to get faster than that in Perl without dropping down to C.
This sounds like the perfect job for regular expressions:
if($string =~ m/your substring/) {
say "substring found";
} else {
say "nothing found";
}

Modifying a character in a string in Lua

Is there any way to replace a character at position N in a string in Lua.
This is what I've come up with so far:
function replace_char(pos, str, r)
return str:sub(pos, pos - 1) .. r .. str:sub(pos + 1, str:len())
end
str = replace_char(2, "aaaaaa", "X")
print(str)
I can't use gsub either as that would replace every capture, not just the capture at position N.
Strings in Lua are immutable. That means, that any solution that replaces text in a string must end up constructing a new string with the desired content. For the specific case of replacing a single character with some other content, you will need to split the original string into a prefix part and a postfix part, and concatenate them back together around the new content.
This variation on your code:
function replace_char(pos, str, r)
return str:sub(1, pos-1) .. r .. str:sub(pos+1)
end
is the most direct translation to straightforward Lua. It is probably fast enough for most purposes. I've fixed the bug that the prefix should be the first pos-1 chars, and taken advantage of the fact that if the last argument to string.sub is missing it is assumed to be -1 which is equivalent to the end of the string.
But do note that it creates a number of temporary strings that will hang around in the string store until garbage collection eats them. The temporaries for the prefix and postfix can't be avoided in any solution. But this also has to create a temporary for the first .. operator to be consumed by the second.
It is possible that one of two alternate approaches could be faster. The first is the solution offered by Paŭlo Ebermann, but with one small tweak:
function replace_char2(pos, str, r)
return ("%s%s%s"):format(str:sub(1,pos-1), r, str:sub(pos+1))
end
This uses string.format to do the assembly of the result in the hopes that it can guess the final buffer size without needing extra temporary objects.
But do beware that string.format is likely to have issues with any \0 characters in any string that it passes through its %s format. Specifically, since it is implemented in terms of standard C's sprintf() function, it would be reasonable to expect it to terminate the substituted string at the first occurrence of \0. (Noted by user Delusional Logic in a comment.)
A third alternative that comes to mind is this:
function replace_char3(pos, str, r)
return table.concat{str:sub(1,pos-1), r, str:sub(pos+1)}
end
table.concat efficiently concatenates a list of strings into a final result. It has an optional second argument which is text to insert between the strings, which defaults to "" which suits our purpose here.
My guess is that unless your strings are huge and you do this substitution frequently, you won't see any practical performance differences between these methods. However, I've been surprised before, so profile your application to verify there is a bottleneck, and benchmark potential solutions carefully.
You should use pos inside your function instead of literal 1 and 3, but apart from this it looks good. Since Lua strings are immutable you can't really do much better than this.
Maybe
"%s%s%s":format(str:sub(1,pos-1), r, str:sub(pos+1, str:len())
is more efficient than the .. operator, but I doubt it - if it turns out to be a bottleneck, measure it (and then decide to implement this replacement function in C).
With luajit, you can use the FFI library to cast the string to a list of unsigned charts:
local ffi = require 'ffi'
txt = 'test'
ptr = ffi.cast('uint8_t*', txt)
ptr[1] = string.byte('o')

Resources