string matching in matlab - string

I have two short (S with the size of 1x10) and very long (L with the size of 1x1000) strings and I am going to find the locations in L which are matched with S.
In this specific matching, I am just interested to match some specific strings in S (the black strings). Is there any function or method in matlab that can match some specific strings (for example string numbers of 1, 5, 9 in S)?

If I understand your question correctly, you want to find substrings in L that contain the same letters (characters) as S in certain positions (let's say given by array idx). Regular expressions are ideal here, so I suggest using regexp.
In regular expressions, a dot (.) matches any character, and curly braces ({}) optionally specify the number of desired occurrences. For example, to match a string of length 6, where the second character is 'a' and the fifth is 'b', our regular expression could be any of the following syntaxes:
.a..b.
.a.{2}b.
.{1}a.{2}b.{1}
any of these is correct. So let's construct a regular expression pattern first:
in = num2cell(diff([0; idx(:); numel(S) + 1]) - 1); %// Intervals
ch = num2cell(S(idx(:))); %// Matched characters
C = [in(:)'; ch(:)', {''}];
pat = sprintf('.{%d}%c', C{:}); %// Pattern for regexp
Now all is left is to feed regexp with L and the desired pattern:
loc = regexp(L, pat)
and voila!
Example
Let's assume that:
S = 'wbzder'
L = 'gabcdexybhdef'
idx = [2 4 5]
First we build a pattern:
in = num2cell(diff([0; idx(:); numel(S) + 1]) - 1);
ch = num2cell(S(idx(:)));
C = [in(:)'; ch(:)', {''}];
pat = sprintf('.{%d}%c', C{:});
The pattern we get is:
pat =
.{1}b.{1}d.{0}e.{1}
Obviously we can add code that beautifies this pattern into .b.de., but this is really an unnecessary optimization (regexp can handle the former just as well).
After we do:
loc = regexp(L, pat)
we get the following result:
loc =
2 8
Seems correct.

Related

How to better code, when looking for substrings?

I want to extract the currency (along with the $ sign) from a list, and create two different currency lists which I have done. But is there a better way to code this?
The list is as below:
['\n\n\t\t\t\t\t$59.90\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$55.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$38.50\n\t\t\t\t\n\n\n\t\t\t\t\t\t$49.90\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$62.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$68.80\n\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.80\n\t\t\t\t\n\n\n\t\t\t\t\t\t$60.50\n\t\t\t\t\t\n\n']
Python code:
pp_list = []
up_list = []
for u in usual_price_list:
rep = u.replace("\n","")
rep = rep.replace("\t","")
s = rep.rsplit("$",1)
pp_list.append(s[0])
up_list.append("$"+s[1])
For this kind of problem, I tend to use a lot the re module, as it is more readable, more maintainble and does not depend on which character surround what you are looking for :
import re
pp_list = []
up_list = []
for u in usual_price_list:
prices = re.findall(r"\$\d{2}\.\d{2}", u)
length_prices = len(prices)
if length_prices > 0:
pp_list.append(prices[0])
if length_prices > 1:
up_list.append(prices[1])
Regular Expresion Breakdown
$ is the end of string character, so we need to escape it
\d matches any digit, so \d{2} matches exactly 2 digits
. matches any character, so we need to escape it
If you want it you can modify the number of digits for the cents with \d{1,2} for matches one or two digits, or \d* to match 0 digit or more
As already pointed for doing that task re module is useful - I would use re.split following way:
import re
data = ['\n\n\t\t\t\t\t$59.90\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$55.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$68.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$38.50\n\t\t\t\t\n\n\n\t\t\t\t\t\t$49.90\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.00\n\t\t\t\t\n\n\n\t\t\t\t\t\t$62.00\n\t\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$68.80\n\t\t\t\t\n\n',
'\n\n\t\t\t\t\t$49.80\n\t\t\t\t\n\n\n\t\t\t\t\t\t$60.50\n\t\t\t\t\t\n\n']
prices = [re.split(r'[\n\t]+',i) for i in data]
prices0 = [i[1] for i in prices]
prices1 = [i[2] for i in prices]
print(prices0)
print(prices1)
Output:
['$59.90', '$55.00', '$38.50', '$49.00', '$68.80', '$49.80']
['$68.00', '$68.00', '$49.90', '$62.00', '', '$60.50']
Note that this will work assuming that there are solely \n and \t excluding prices and there is at least one \n or \t before first price and at least one \n or \t between prices.
[\n\t]+ denotes any string made from \n or \t with length 1 or greater, that is \n, \t, \n\n, \t\t, \n\t, \t\n and so on

How to convert string to a table?

I have this string text:
text = "hotkey=F4,value=,autoSend=false, hotkey=Shift+F9,value=,autoSend=false, hotkey=F5,value=,autoSend=false"
and I would like to convert it to a table like this one:
local table = {
{hotkey='F4', value=nil, autoSend=false};
{hotkey='Shift+F9', value=nil, autoSend=false};
{hotkey='F5', value=nil, autoSend=false}
}
This solution is limited in scope and will not cover all complexities in the input string. A simple pattern matching could generate tables you are looking for, but use this code to build a better/robust regex for the diversity of your strings
s = "hotkey=F4,value=,autoSend=false, hotkey=Shift+F9,value=,autoSend=false, hotkey=F5,value=,autoSend=false"
local words = {}
for w in s:gmatch("(hotkey=%g-,value=%g-,autoSend=%w*)") do
-- Split string in more managebale parts
-- i-g w = 'hotkey=F4,value=,autoSend=false, hotkey=Shift+F9'
-- Extract indivisual k,v pairs and insert into table as desired
local _hotkey = string.match(w,"hotkey=(%g-),")
local _value = string.match(w,"value=(%g-),")
local _autoSend = string.match(w,"autoSend=(%w+)")
table.insert(words,{hotkey=_hotkey, value=_value, autoSend=_autoSend})
end
for _, w in ipairs(words) do
for k, v in pairs(w) do
print(k .. ':' .. v)
end
end
Regex Explanation
(): Capture string
%g: printable characters except for spaces
%w: alphanumeric characters
* : 0 or more repetitions
- : 0 or more lazy repetitions

Find and append characters of a String by matching with a List in Python 2.7

There is a list contains with character sequences such below:
seq_list = ['C','CA','CAF','CMMVF','E','CMM','CMMF','CMMFF',...]
and a string can be defined as below:
a_str = 'CAFCMMVFCMMECMMFFCCAF'
The problem is to match the longest character sequence of seq_list in a_str from left to right iteratively, and then a character('|') should be appended if it's found.
For example,
a_str begins with 'C' but the actual character sequence is 'CAF' because 'CAF' has the longer sequence than 'C',
so that it should be achieved such below:
a_str = 'CAF|CMMVFCMMECMMFFCCAF' #actual sequence match
'C|AFCMMVFCMMECMMFFCCAF' #false sequence match
Then, remaining a_str_r should be like this a_str_r = 'CMMVFCMMECMMFFCCAF' after a character '|' has been appended. So that the iterative process has to start over again by matching the longest sequence from the list until the end of the string, and the final result should be like this:
a_str = 'CAF|CMMVF|CMM|E|CMMFF|C|CAF|'
This was one of the attempts for this problem, and still couldn't get right!
a_str_r = []
for each in seq_list:
for i in a_str:
if each in i:
a_str_r.append(i+'|')
return a_str_r
You want to search for leftmost longest match. That is a standout for a regular expression search.
import re
seq_list = ['C','CA','CAF','CMMVF','E','CMM','CMMF','CMMFF']
# Sort to put longer match strings before shorter ones
sseq_list = sorted(seq_list, key=lambda a: len(a), reverse=True)
# Turn list into a regular expression string
sseq_re = '|'.join(sseq_list)
# Compile regular expression string
rx = rx = re.compile(sseq_re)
# Put pipe characters between the matches
print '|'.join(rx.findall('CAFCMMVFCMMECMMFFCCAF'))

Matlab. Find the indices of a cell array of strings with characters all contained in a given string (without repetition)

I have one string and a cell array of strings.
str = 'actaz';
dic = {'aaccttzz', 'ac', 'zt', 'ctu', 'bdu', 'zac', 'zaz', 'aac'};
I want to obtain:
idx = [2, 3, 6, 8];
I have written a very long code that:
finds the elements with length not greater than length(str);
removes the elements with characters not included in str;
finally, for each remaining element, checks the characters one by one
Essentially, it's an almost brute force code and runs very slowly. I wonder if there is a simple way to do it fast.
NB: I have just edited the question to make clear that characters can be repeated n times if they appear n times in str. Thanks Shai for pointing it out.
You can sort the strings and then match them using regular expression. For your example the pattern will be ^a{0,2}c{0,1}t{0,1}z{0,1}$:
u = unique(str);
t = ['^' sprintf('%c{0,%d}', [u; histc(str,u)]) '$'];
s = cellfun(#sort, dic, 'uni', 0);
idx = find(~cellfun('isempty', regexp(s, t)));
I came up with this :
>> g=#(x,y) sum(x==y) <= sum(str==y);
>> h=#(t)sum(arrayfun(#(x)g(t,x),t))==length(t);
>> f=cellfun(#(x)h(x),dic);
>> find(f)
ans =
2 3 6
g & h: check if number of count of each letter in search string <= number of count in str.
f : finally use g and h for each element in dic

How to calculate word co-occurence

I have a string of characters of length 50 say representing a sequence abbcda.... for alphabets taken from the set A={a,b,c,d}.
I want to calculate how many times b is followed by another b (n-grams) where n=2.
Similarly, how many times a particular character is repeated thrice n=3 consecutively, say in the input string abbbcbbb etc so here the number of times b occurs in a sequence of 3 letters is 2.
To find the number of non-overlapping 2-grams you can use
numel(regexp(str, 'b{2}'))
and for 3-grams
numel(regexp(str, 'b{3}'))
to count overlapping 2-grams use positive lookahead
numel(regexp(str, '(b)(?=b{1})'))
and for overlapping n-grams
numel(regexp(str, ['(b)(?=b{' num2str(n-1) '})']))
EDIT
In order to find number of occurrences of an arbitrary sequence use the first element in first parenthesis and the rest after equality sign, to find ba use
numel(regexp(str, '(b)(?=a)'))
to find bda use
numel(regexp(str, '(b)(?=da)'))
Building on the proposal by Magla:
str = 'abcdabbcdaabbbabbbb'; % for example
index_single = ismember(str, 'b');
index_digram = index_single(1:end-1)&index_single(2:end);
index_trigram = index_single(1:end-2)&index_single(2:end-1)&index_single(3:end);
You may try this piece of code that uses ismember (doc).
%generate string (50 char, 'a' to 'd')
str = char(floor(97 + (101-97).*rand(1,50)))
%digram case
index_digram = ismember(str, 'aa');
%trigram case
index_trigram = ismember(str, 'aaa');
EDIT
Probabilities can be computed with
proba = sum(index_digram)/length(index_digram);
this will find all n-grams and count them:
numberOfGrams = 5;
s = char(floor(rand(1,1000)*4)+double('a'));
ngrams = cell(1);
for n = 2:numberOfGrams
strLength = size(s,2)-n+1;
indices = repmat((1:strLength)',1,n)+repmat(1:n,strLength,1)-1;
grams = s(indices);
gramNumbers = (double(grams)-double('a'))*((ones(1,n)*n).^(0:n-1))';
[uniqueGrams, gramInd] = unique(gramNumbers);
count=hist(gramNumbers,uniqueGrams);
ngrams(n) = {struct('gram',grams(gramInd,:),'count',count)};
end
edit:
the result will be:
ngrams{n}.gram %a list of all n letter sequences in the string
ngrams{n}.count(x) %the number of times the sequence ngrams{n}.gram(x) appears

Resources