Find the location of multiple strings in a cell array of strings - string

I have 2 question regarding searching for strings in MATLAB
If I have to find a string in a cell array of strings I can do the following to get the location of 'PO' in the cell array
find(strcmpi({'PO','FOO','PO1','FOO1','PO1','PO'},'PO'))
% 1 6
But, I really want to search for multiple strings ({'PO1', 'PO'}) at the same time (not using a for loop). What is the best way to do this?
Is there any function like histc() which can tell me how many times the string has occurred. Again for one string, I could do:
length(strfind({'PO','FOO','PO1','FOO1','PO1','PO'},'PO'))
But this obviously doesn't work for multiple strings at a time.

If you want to find multiple strings, then just use the second output of ismember instead to tell you which string it is. If you really need case-insensitive matching, I've added the upper call to force all inputs to be upper-case. You can omit this if you think it's already uppercase.
data = {'PO','FOO','PO1','FOO1','PO1','PO', 'PO'};
[tf, inds] = ismember(upper(data), {'PO1', 'PO'});
% 2 0 1 0 1 2 2
You can then use the second output to determine which string was found where:
% PO1 Occurrences
find(inds == 1)
% 3 5
% PO Occurrences
find(inds == 2)
% 1 6 7
If you want the equivalent of histc, you can use accumarray to do that. We can pass it all of the values of inds that are non-zero (i.e. the ones that you were actually searching for).
accumarray(inds(tf).', ones(sum(tf), 1))
% 2 3
If instead you want to get the histogram of all strings (not just the ones you're searching for) you could do the following:
[strings, ~, inds] = unique(data, 'stable');
occurrences = accumarray(inds, ones(size(inds)));
% 'PO' [3]
% 'FOO' [1]
% 'PO1' [2]
% 'FOO1' [1]

Related

Making one string the anagram of other

I have a problem where two strings of same length are given, and I have to tell how many letters I have to change in the first string to make it an anagram of the second.
Here is what I did:
count = 0
Mutable_str = ''.join(sorted("hhpddlnnsjfoyxpci"))
Ref_str = ''.join(sorted("ioigvjqzfbpllssuj"))
i = 0
while i < len(Mutable_str):
if Mutable_str[i] != Ref_str[i]:
count += 1
i += 1
print(count)
My algorithm in this case returned 16 as result. But the correct answer is 10. Can someone tell me what is wrong in my code?
Thank you very much!
You need to use str.count
So you need to add up the differences between the number of occurrences of each character in the different strings. This can be done with str.count(c) where c is each distinct character in the second string (got with set()). We then need to use max() on the difference with 0 so that if the difference is negative this doesn't effect the total differences.
So as you can see, it boils down to one neat little one-liner:
def changes(s1, s2):
return sum(max(0, s2.count(c) - s1.count(c)) for c in set(s2))
and some tests:
>>> changes("hhpddlnnsjfoyxpci", "ioigvjqzfbpllssuj")
10
>>> changes("abc", "bcd")
1
>>> changes("jimmy", "bobby")
4

Pattern Matching BASIC programming Language and Universe Database

I need to identify following patterns in string.
- "2N':'2N':'2N"
- "2N'-'2N'-'2N"
- "2N'/'2N'/'2N"
- "2N'/'2N'-'2N"
AND SO ON.....
basically i want this pattern if written in Simple language
2 NUMBERS [: / -] 2 NUMBERS [: / -] 2 NUMBERS
So is there anyway by which i could write one pattern which will cover all the possible scenarios ? or else i have to write total 9 patterns and had to match all 9 patterns to string.... and it is not the scenario in my code , i have to match 4, 2 number digits separated by [: / -] to string for which i have towrite total 27 patterns. So for understanding purpose i have taken 3 ,2 digit scenario...
Please help me...Thank you
Maybe you could try something like (Pick R83 style)
OK = X MATCH "2N1X2N1X2N" AND X[3,1]=X[6,1] AND INDEX(":/-",X[3,1],1) > 0
Where variable X is some input string like: 12-34-56
Should set variable OK to 1 if validation passes, else 0 for any invalid format.
This seems to get all your required validation into a single statement. I have assumed that the non-numeric characters have to be the same. If this is not true, the check could be changed to something like:
OK = X MATCH "2N1X2N1X2N" AND INDEX(":/-",X[3,1],1) > 0 AND INDEX(":/-",X[6,1],1) > 0
Ok, I guess the requirement of surrounding characters was not obvious to me. Still, it does not make it much harder. You just need to 'parse' the string looking for the first (I assume) such pattern (if any) in the input string. This can be done in a couple of lines of code. Here is a (rather untested ) R83 style test program:
PROMPT ":"
LOOP
LOOP
CRT 'Enter test string':
INPUT S
WHILE S # "" AND LEN(S) < 8 DO
CRT "Invalid input! Hit RETURN to exit, or enter a string with >= 8 chars!"
REPEAT
UNTIL S = "" DO
*
* Look for 1st occurrence of pattern in string..
CARDNUM = ""
FOR I = 1 TO LEN(S)-7 WHILE CARDNUM = ""
IF S[I,8] MATCH "2N1X2N1X2N" THEN
IF INDEX(":/-",S[I+2,1],1) > 0 AND INDEX(":/-",S[I+5,1],1) > 0 THEN
CARDNUM = S[I,8] ;* Found it!
END ELSE I = I + 8
END
NEXT I
*
CRT CARDNUM
REPEAT
There is only 7 or 8 lines here that actually look for the card number pattern in the source/test string.
Not quite perfect but how about 2N1X2N1X2N this gets you 2 number followed by 1 of any character followed by 2 numbers etc.
This might help:
BIG.STRING ="HELLO TILDE ~ CARD 12:34:56 IS IN THIS STRING"
TEMP.STRING = BIG.STRING
CONVERT "~:/-" TO "*~~~" IN TEMP.STRING
IF TEMP.STRING MATCHES '0X2N"~"2N"~"2N0X' THEN
FIRST.TILDE.POSN = INDEX(TEMP.STRING,"~",1)
CARD.STRING = BIG.STRING[FIRST.TILDE.POSN-2,8]
PRINT CARD.STRING
END

Does Python have a string contains how many substring method, allowing for overlap?

I want to count a long string contain how many substring, how to do it in python?
"12212"
contains 2x "12"
how to get the count number?
It must allow for overlaping substrings; for instance "1111" contains 3 "11" substrings.
"12121" contains 2 "121" substrings.
"1111".count("11")
will return 2. It does not count any overlaps.
Strings have a method count
You can do
s = '12212'
s.count('12') # this equals 2
Edited for the changing question, the answer below was posted as a comment by tobias_k
To count with overlap,
count_all = lambda string, sub: sum(string[i:i+len(sub)] == sub for i in range(len(string) - len(sub) + 1))
This can be called with,
count_all('1111', '11') # this returns 3

Recognize relevant string information by checking the first characters

I have a table with 2 columns. In column 1, I have a string information, in column 2, I have a logical index
%% Tables and their use
T={'A2P3';'A2P3';'A2P3';'A2P3 with (extra1)';'A2P3 with (extra1) and (extra 2)';'A2P3 with (extra1)';'B2P3';'B2P3';'B2P3';'B2P3 with (extra 1)';'A2P3'};
a={1 1 0 1 1 0 1 1 0 1 1 }
T(:,2)=num2cell(1);
T(3,2)=num2cell(0);
T(6,2)=num2cell(0);
T(9,2)=num2cell(0);
T=table(T(:,1),T(:,2));
class(T.Var1);
class(T.Var2);
T.Var1=categorical(T.Var1)
T.Var2=cell2mat(T.Var2)
class(T.Var1);
class(T.Var2);
if T.Var1=='A2P3' & T.Var2==1
disp 'go on'
else
disp 'change something'
end
UPDATES:
I will update this section as soon as I know how to copy my workspace into a code format
** still don't know how to do that but here it goes
*** why working with tables is a double edged sword (but still cool): I have to be very aware of the class inside the table to refer to it in an if else construct, here I had to convert two columns to categorical and to double from cell to make it work...
Here is what my data looks like:
I want to have this:
if T.Var1=='A2P3*************************' & T.Var2==1
disp 'go on'
else
disp 'change something'
end
I manage to tell matlab to do as i wish, but the whole point of this post is: how do i tell matlab to ignore what comes after A2P3 in the string, where the string length is variable? because otherwise it would be very tiring to look up every single piece of string information left on A2P3 (and on B2P3 etc) just to say thay.
How do I do that?
Assuming you are working with T (cell array) as listed in your code, you may use this code to detect the successful matches -
%%// Slightly different than yours
T={'A2P3';'NotA2P3';'A2P3';'A2P3 with (extra1)';'A2P3 with (extra1) and (extra 2)';'A2P3 with (extra1)';'B2P3';'B2P3';'NotA2P3';'B2P3 with (extra 1)';'A2P3'};
a={1 1 0 1 1 0 1 1 0 1 1 }
T(:,2)=num2cell(1);
T(3,2)=num2cell(0);
T(6,2)=num2cell(0);
T(9,2)=num2cell(0);
%%// Get the comparison results
col1_comps = ismember(char(T(:,1)),'A2P3') | ismember(char(T(:,1)),'B2P3');
comparisons = ismember(col1_comps(:,1:4),[1 1 1 1],'rows').*cell2mat(T(:,2))
One quick solution would be to make a function that takes 2 strings and checks whether the first one starts with the second one.
Later Edit:
The function will look like this:
for i = 0, i < second string's length, i = i + 1
if the first string's character at index i doesn't equal the second string's character at index i
return false
after the for, return true
This assuming the second character's lenght is always smaller the first's. Otherwise, return the function with the arguments swapped.

Optimizing count of occurrence of a string

I have to count how often a certain string is contained in a cell-array. The problem is the code is way to slow it takes almost 1 second in order to do this.
uniqueWordsSize = 6; % just a sample number
wordsCounter = zeros(uniqueWordsSize, 1);
uniqueWords = unique(words); % words is a cell-array
for i = 1:uniqueWordsSize
wordsCounter(i) = sum(strcmp(uniqueWords(i), words));
end
What I'm currently doing is to compare every word in uniqueWords with the cell-array words and use sum in order to calculate the sum of the array which gets returned by strcmp.
I hope someone can help me to optimize that.... 1 second for 6 words is just too much.
EDIT: ismember is even slower.
You can drop the loop completely by using the third output of unique together with hist:
words = {'a','b','c','a','a','c'}
[uniqueWords,~,wordOccurrenceIdx]=unique(words)
nUniqueWords = length(uniqueWords);
counts = hist(wordOccurrenceIdx,1:nUniqueWords)
uniqueWords =
'a' 'b' 'c'
wordOccurrenceIdx =
1 2 3 1 1 3
counts =
3 1 2
tricky way without using explicit fors..
clc
close all
clear all
Paragraph=lower(fileread('Temp1.txt'));
AlphabetFlag=Paragraph>=97 & Paragraph<=122; % finding alphabets
DelimFlag=find(AlphabetFlag==0); % considering non-alphabets delimiters
WordLength=[DelimFlag(1), diff(DelimFlag)];
Paragraph(DelimFlag)=[]; % setting delimiters to white space
Words=mat2cell(Paragraph, 1, WordLength-1); % cut the paragraph into words
[SortWords, Ia, Ic]=unique(Words); %finding unique words and their subscript
Bincounts = histc(Ic,1:size(Ia, 1));%finding their occurence
[SortBincounts, IndBincounts]=sort(Bincounts, 'descend');% finding their frequency
FreqWords=SortWords(IndBincounts); % sorting words according to their frequency
FreqWords(1)=[];SortBincounts(1)=[]; % dealing with remaining white space
Freq=SortBincounts/sum(SortBincounts)*100; % frequency percentage
%% plot
NMostCommon=20;
disp(Freq(1:NMostCommon))
pie([Freq(1:NMostCommon); 100-sum(Freq(1:NMostCommon))], [FreqWords(1:NMostCommon), {'other words'}]);

Resources