I want to determine an unknown pattern in a string such as,
s=112468112468112468112468112468.
So in this string, we can clearly see that 112468 is the repeating pattern. I searched
on google quite a bit for finding some algorithms to help me, but I could only see ones which find a given pattern in a string such as Boyer-Moore algorithm etc.
What I do now to find these repeating unknown pattern is that,
for(i=0;i<Length of String;i++)
{
for(j=i+1;j<Length of String;j++)
{
if(s[i]==s[j] && s[i+1]==s[j+1] && s[i+2]==s[j+2] && s[i+3]==s[j+3])
{
patternlength=j-i;
for(k=i;k<j;k++)
{
pattern[k]=s[i+k]
}
}
}
}
Although this works for the given string by using a comparison window of 4 literals, it may very well not work for some other string. Does anybody know a better solution to this.
Thanks
This is not pattern matching, this is pattern recognition, which is fundamentally different and potentially much harder.
However, the simple kind of pattern exhibited by this string could have been found using (Python code):
def find_repeated_pattern(s):
for i in xrange(1, len(s) / 2):
if s == s[:i] * (len(s) / i):
return s[:i]
This is a naive implementation because of all its string copying, but it can be made to work in O(n²) time and constant space.
Related
I recently did a method to count the vowels in a given string and was able to solve it fairly simply, but my solution was compared to the best practices and this was the top one:
public class Vowels {
public static int getCount(String str) {
return str.replaceAll("(?i)[^aeiou]", "").length();
}
}
...which is much more elegant that what i wrote and i am trying to understand it. I don't get what exactly the "(?i)[^aeiou]" part is doing. I get that it is deleting all the characters that aren't vowels but I don't understand what the operators are doing or why they work in quotes shouldn't the program just see it as a string?
This is a regex and it is basically ignoring the case because we are only providing set of [aeiou] but it should also match with the capital ones [AEIOU]. Then ^ symbol is used to replace all the characters with empty string "" except for vowels(irrespective of their case).
(?i) - starts case-insensitive mode
(?-i) - turns off case-insensitive mode
[^...] - NOT ONE of these characters.
The plot
There is a rather complicatedly formatted string, like there's no such readable regex that parses it. And the aim is to get a specific substring for example, and to get it's original position. That substring is reached after parsing a bit, like trimming, removing the beginning something and searching the n-th element for example. I just want to demonstrate you the complexity with this example, otherwise it's pretty general.
For demonstration, see this rudimentary example. The way it is isn't really important, just to reach a pretty complicated parse model. Obviously, there can be more rule and you can write a simplier model as well.
FirstBlock{Index1, Index2} SecondBlock ThirdBlock
{ FirstBlock {Index1,Index2} SecondBlock}
{FirstBlock SecondBlock ThirdBlock FourthBlock}
I've tried to make as random as it could be. The parsing model is like:
string text = "{ FirstBlock {Index1,Index2} SecondBlock}";
text = text.Trim();
if (text.First() == '{')
{
text = text.SubString(1, text.Length - 2);
}
text = text.Trim();
string firstBlock = text.Split(new char[] { ' ', '{' })[0];
text = text.Remove(0, firstBlock.Length).Trim();
string indices = "";
if (text.First() == '{')
{
indices = text.Split(new char[] { '{', '}' })[0];
text = text.Remove(0, indices.Length).Trim();
}
string[] blocks = text.Split(' ');
The easy way
There is a way that is pretty easy to implement and straightforward. But does not give you the correct result sometimes. That way you parse the string and get the substring and then you make a re-search, for example string.IndexOf() and get the position. But if there are two match for example, you are given the first one even though it is not sure you wanted that one.
My notion
The way I think is quite elegant but still not consummate is to index the characters of the string at the beginning, then parse it, and eventually you end up with the proper characters and their position also. My problem there is that then you can't really use the functions the library gives, and I don't know a way to do that. Using the snippet above:
List<Tuple<int, char>> indexedText = text
.Select((ch, index) => new Tuple<int, char>(index, ch))
.ToList();
And with this structure you can still process the string without library methods but you are given the position indices eventually. For example, trim:
indexedText = indexedText
.SkipWhile(indexedChar => char.IsWhiteSpace(indexedChar.Item2))
.ToList();
The actual question
The question can either be a new solution or the way you can use library methods with indexed strings. The aim is to get the indices back after parsing a string. It is possible that there is a very simple way that is just out of my scope but I haven't found a proper solution yet. The solution I don't want is to simplify the parsing system, as I said it is just for demonstration.
I'd like to take a String e.g. "1234" and convert it to an Integer which represents the sum of all the characters.
I thought perhaps treating the String as a List of characters and doing a reduce / inject, would be the simplest mechanism. However, In all my attempts I have not managed to succeed in getting the syntax correct.
I attempted something along these lines without success.
int sum = myString.inject (0, { Integer accu, Character value ->
return accu + Character.getNumericValue(value)
})
Can you help me determine a simple syntax to resolve this problem (I can easily solve it in an java like verbose way with loops etc)
Try:
"1234".collect { it.toInteger() }.sum()
Solution by #dmahapatro
"1234".toList()*.toInteger().sum()
I want to search for a query (a string) in a subject (another string).
The query may appear in whole or in parts, but will not be rearranged. For instance, if the query is 'da', and the subject is 'dura', it is still a match.
I am not allowed to use string functions like strfind or find.
The constraints make this actually quite straightforward with a single loop. Imagine you have two indices initially pointing at the first character of both strings, now compare them - if they don't match, increment the subject index and try again. If they do, increment both. If you've reached the end of the query at that point, you've found it. The actual implementation should be simple enough, and I don't want to do all the work for you ;)
If this is homework, I suggest you look at the explanation which precedes the code and then try for yourself, before looking at the actual code.
The code below looks for all occurrences of chars of the query string within the subject string (variables m; and related ii, jj). It then tests all possible orders of those occurrences (variable test). An order is "acceptable" if it contains all desired chars (cond1) in increasing positions (cond2). The result (variable result) is affirmative if there is at least one acceptable order.
subject = 'this is a test string';
query = 'ten';
m = bsxfun(#eq, subject.', query);
%'// m: test if each char of query equals each char of subject
[ii jj] = find(m);
jj = jj.'; %'// ii: which char of query is found within subject...
ii = ii.'; %'// jj: ... and at which position
test = nchoosek(1:numel(jj),numel(query)).'; %'// test all possible orders
cond1 = all(jj(test) == repmat((1:numel(query)).',1,size(test,2)));
%'// cond1: for each order, are all chars of query found in subject?
cond2 = all(diff(ii(test))>0);
%// cond2: for each order, are the found chars in increasing positions?
result = any(cond1 & cond2); %// final result: 1 or 0
The code could be improved by using a better approach as regards to test, i.e. not testing all possible orders given by nchoosek.
Matlab allows you to view the source of built-in functions, so you could always try reading the code to see how the Matlab developers did it (although it will probably be very complex). (thanks Luis for the correction)
Finding a string in another string is a basic computer science problem. You can read up on it in any number of resources, such as Wikipedia.
Your requirement of non-rearranging partial matches recalls the bioinformatics problem of mapping splice variants to a genomic sequence.
You may solve your problem by using a sequence alignment algorithm such as Smith-Waterman, modified to work with all English characters and not just DNA bases.
Is this question actually from bioinformatics? If so, you should tag it as such.
Say I have a string
local a = "Hello universe"
I find the substring "universe" by
a:find("universe")
Now, suppose the string is
local a = "un#verse"
The string to be searched is universe; but the substring differs by a single character.
So obviously Lua ignores it.
How do I make the function find the string even if there is a discrepancy by a single character?
If you know where the character would be, use . instead of that character: a:find("un.verse")
However, it looks like you're looking for a fuzzy string search. It is out of a scope for a Lua string library. You may want to start with this article: http://ntz-develop.blogspot.com/2011/03/fuzzy-string-search.html
As for Lua fuzzy search implementations — I haven't used any, but googing "lua fuzzy search" gives a few results. Some are based on this paper: http://web.archive.org/web/20070518080535/http://www.heise.de/ct/english/97/04/386/
Try https://github.com/ajsher/luafuzzy.
It sounds like you want something along the lines of TRE:
TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.
Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.
A Lua binding for it is available as part of lrexlib.
If you are really looking for a single character difference and do not care about performance, here is a simple approach that should work:
local a = "Hello un#verse"
local myfind = function(s,p)
local withdot = function(n)
return p:sub(1,n-1) .. '.' .. p:sub(n+1)
end
local a,b
for i=1,#s do
a,b = s:find(withdot(i))
if a then return a,b end
end
end
print(myfind(a,"universe"))
A simple roll your own approach (based on the assumption that the pattern keeps the same length):
function hammingdistance(a,b)
local ta={a:byte(1,-1)}
local tb={b:byte(1,-1)}
local res = 0
for k=1,#a do
if ta[k]~=tb[k] then
res=res+1
end
end
print(a,b,res) -- debugging/demonstration print
return res
end
function fuz(s,pat)
local best_match=10000
local best_location
for k=1,#s-#pat+1 do
local cur_diff=hammingdistance(s:sub(k,k+#pat-1),pat)
if cur_diff < best_match then
best_location = k
best_match = cur_diff
end
end
local start,ending = math.max(1,best_location),math.min(best_location+#pat-1,#s)
return start,ending,s:sub(start,ending)
end
s=[[Hello, Universe! UnIvErSe]]
print(fuz(s,'universe'))
Disclaimer: not recommended, just for fun:
If you want a better syntax (and you don't mind messing with standard type's metatables) you could use this:
getmetatable('').__sub=hammingdistance
a='Hello'
b='hello'
print(a-b)
But note that a-b does not equal b-a this way.