contains with separate result for each of multiple patterns - string

Matlab's documentation for the function TF = contains(str,pattern) states:
If pattern is an array containing multiple patterns, then contains returns 1 if it finds any element of pattern in str.
I want a result for each pattern individually however.
That is:
I have string A='a very long string' and two patterns B='very' and C='long'. I want to check if B is contained in A and if C is contained in A. I could do it like this:
result = false(2,1);
result(1) = contains(A,B);
result(2) = contains(A,C);
but for many patterns this takes quite a while. What is the fast way to do this?

I don't know or have access to that function; it must be "new", so I don't know its particular idiosyncrasies.
How I would do that is:
result = ~cellfun('isempty', regexp(A, {B C}));
EIDT
Judging from the documentation, you can do the exact same thing with contains:
result = contains(A, {B C});
except that seems to return contains(A,B) || contains(A,C) rather than the array [contains(A,B) contains(A,C)]. So I don't know, I can't test it here. But if all else fails, you can use the regex solution above.

The new text processing functions in 16b are the fastest with string. If you convert A to a string you may see much better performance.
function profFunc
n = 1E6;
A = 'a very long string';
B = 'very';
C = 'long';
tic;
for i = 1:n
result(1) = contains(A,B);
result(2) = contains(A,C);
end
toc;
tic;
for i = 1:n
x = regexp(A, {B,C});
end
toc;
A = string(A);
tic;
for i = 1:n
result(1) = contains(A,B);
result(2) = contains(A,C);
end
toc;
end
>> profFunc
Elapsed time is 7.035145 seconds.
Elapsed time is 9.494433 seconds.
Elapsed time is 0.930393 seconds.
Questions: Where do B and C come from? Do you have a lot of hard coded variables? Can you loop? Looping would probably be the fastest. Otherwise something like
cellfun(#(x)contains(A,x),{B C})
is an option.

Related

Can you group data with similar written column headings on xlswrite, matlab?

Very new to matlab and still learning the basics. I'm trying to write a script which calculates the distance between two peaks in a waveform. That part I have managed to do, and I have used xlswrite to put the values I have obtained onto an excel file.
For each file, I have between about 50-250 columns, with just two rows: the second row has the numerical value, and the first has the column headings, copied from original excel files I extracted the data from.
Some of the columns have similar, but not identical, headings, e.g. 'green227RightEyereading3' and 'green227RightEyereading4' etc. Is there a way I can group columns with similar headings, e.g. which have the same number/colour in the heading (I.e.green227) and either 'right eye' or 'left eye', and calculate an average of their numerical values? Link to file here: >https://www.dropbox.com/s/ezpyjr3raol31ts/SampleBatchForTesting.xls?dl=0>
>[Excel_file,PathName] = uigetfile('*.xls', 'Pick a File','C:\Users\User\Documents\Optometry\Year 3\Dissertation\A-scan3');
>[~,name,ext] = fileparts(Excel_file);
>sheet = 2;
>FullXLSfile = [PathName, Excel_file];
>[number_data,txt_data,raw_data] = xlsread(FullXLSfile,sheet);
>HowManyWide = size(txt_data);
>NumberOfTitles = HowManyWide(1,2);
>xlRangeA = txt_data;
>Chickens = {'Test'};
>for f = 1:xlRangeA; %%defined as top line of cells on sheet;
>Text = xlRangeA{f};
>HyphenLocations = find(Text == '-');
>R = HyphenLocations(1,1) -1;
>Chick = Text(1:R);
>Chick = cellstr(Chick);
>B = length(Chick);
>TF = strncmp(Chickens,Chick,B);
>if any(TF == 1); %do nothing
>else
>Chickens = {Chickens;Chick};
>end
>end
Here also is a link to the file that is created when I run my entire script. The values below the headings are the calculated thickesses of the tissue I'm analysing. https://www.dropbox.com/s/4p6iu9kk75ecyzl/Choroid_Thickness.xls?dl=0
Thanks very much
If the different characters are located at the very end (or the very beginning) of the heading, you can go with strncmp buit-in function and compare only part of the string. See more here. But please, provide some code and a part of your excel file. It would help.
Also, if I am not mistaken, you are saving all the data into excel and then re-call it again in order to sort it. Maybe you should consider saving only the final result in excel, it will save you some time, especially if you want to run your script many times.
EDIT:
Here is the code I came up with. It is not the best possible solution for sure, but it works with the file you uploaded. I have omitted the unnecessary lines and variables. The code works only if the numbers of each reading have the same amount of digits. They can be 4 digits as long as every entry has 4 digits. Since in each file you have waves of the same color, the only thing that you care about is whether the reading was recorded with the left or the right eye (correct?). Based on that and the code you wrote, the comparison concerns the part of the string that contains the words "Right" or "Left", i.e. the characters between the hyphens.
[Excel_file,PathName] = uigetfile('*.xls', 'Pick a File',...
'C:\Users\User\Documents\Optometry\Year 3\Dissertation\A-scan3');
sheet = 1;
FullXLSfile = [PathName,Excel_file];
[number_data,txt_data,raw_data] = xlsread(FullXLSfile,sheet);
%% data manipulation
NumberOfTitles = length(txt_data);
TextToCompare = txt_data{1};
r1 = 1; % counter for Readings1 vector
r2 = 1; % counter for Readings2 vector
for ff = 1:NumberOfTitles % in your code xlRangeA is a cell vector not a number!
Text = txt_data{ff};
HyphenLocations = find(Text == '-');
Text = Text(HyphenLocations(1,1):HyphenLocations(1,2)); % take only the part that contains the "eye" information
TextToCompare = TextToCompare(HyphenLocations(1,1):HyphenLocations(1,2)); % same here
if (Text == TextToCompare)
Readings1(r1) = number_data(ff); % store the numerical value in a vector
r1 = r1 + 1; % increase the counter of this vector
else
Readings2(r2) = number_data(ff); % same here
r2 = r2 + 1;
end
TextToCompare = txt_data{1}; % TextToCompare re-initialized for the next comparison
end
mean_readings1 = mean(Readings1); % Find the mean of the grouped values
mean_readings2 = mean(Readings2);
I am positive that this can be done in a more efficient and delicate way. I don't know exactly what kind of calculations you want to do so I only included the mean values as an example. Inside the if statement you can also store the txt_data if you need it. Below I have also included a second way which I find more delicate. Just substitute the %%data manipulation part with the part below if you want to test it:
%% more delicate way
Text_Vector = char(txt_data);
TextToCompare2 = txt_data{1};
HyphenLocations2 = find(TextToCompare2 == '-');
TextToCompare2 = TextToCompare2(HyphenLocations2(1,1):HyphenLocations2(1,2));
Text_Vector = Text_Vector(:,HyphenLocations2(1,1):HyphenLocations2(1,2));
Text_Vector = cellstr(Text_Vector);
dummy = strcmpi(Text_Vector,TextToCompare2);
Readings1 = number_data(dummy);
Readings2 = number_data(~dummy);
I hope this helps.

Getting the largest and smallest word at a string

when I run this codes the output is (" "," "),however it should be ("I","love")!!!, and there is no errors . what should I do to fix it ??
sen="I love dogs"
function Longest_word(sen)
x=" "
maxw=" "
minw=" "
minl=1
maxl=length(sen)
p=0
for i=1:length(sen)
if(sen[i]!=" ")
x=[x[1]...,sen[i]...]
else
p=length(x)
if p<min1
minl=p
minw=x
end
if p>maxl
maxl=p
maxw=x
end
x=" "
end
end
return minw,maxw
end
As #David mentioned, another and may be better solution can be achieved by using split function:
function longest_word(sentence)
sp=split(sentence)
len=map(length,sp)
return (sp[indmin(len)],sp[indmax(len)])
end
The idea of your code is good, but there are a few mistakes.
You can see what's going wrong by debugging a bit. The easiest way to do this is with #show, which prints out the value of variables. When code doesn't work like you expect, this is the first thing to do -- just ask it what it's doing by printing everything out!
E.g. if you put
if(sen[i]!=" ")
x=[x[1]...,sen[i]...]
#show x
and run the function with
Longest_word("I love dogs")
you will see that it is not doing what you want it to do, which (I believe) is add the ith letter to the string x.
Note that the ith letter accessed like sen[i] is a character not a string.
You can try converting it to a string with
string(sen[i])
but this gives a Unicode string, not an ASCII string, in recent versions of Julia.
In fact, it would be better not to iterate over the string using
for i in 1:length(sen)
but iterate over the characters in the string (which will also work if the string is Unicode):
for c in sen
Then you can initialise the string x as
x = UTF8String("")
and update it with
x = string(x, c)
Try out some of these possibilities and see if they help.
Also, you have maxl and minl defined wrong initially -- they should be the other way round. Also, the names of the variables are not very helpful for understanding what should happen. And the strings should be initialised to empty strings, "", not a string with a space, " ".
#daycaster is correct that there seems to be a min1 that should be minl.
However, in fact there is an easier way to solve the problem, using the split function, which divides a string into words.
Let us know if you still have a problem.
Here is a working version following your idea:
function longest_word(sentence)
x = UTF8String("")
maxw = ""
minw = ""
maxl = 0 # counterintuitive! start the "wrong" way round
minl = length(sentence)
for i in 1:length(sentence) # or: for c in sentence
if sentence[i] != ' ' # or: if c != ' '
x = string(x, sentence[i]) # or: x = string(x, c)
else
p = length(x)
if p < minl
minl = p
minw = x
end
if p > maxl
maxl = p
maxw = x
end
x = ""
end
end
return minw, maxw
end
Note that this function does not work if the longest word is at the end of the string. How could you modify it for this case?

Finding location of specified substring in a specified string (MATLAB)

I have a simple question that I need help on. My code,I believe, is almost complete but im having trouble with the a specific line of code.
I have an assignment question (2 parts) that asks me to find whether a protein (string), has the specified motif (substring) at that particular location (location). This is the first part, and the function and code looks like this:
function output = Motif_Match(motif,protein,location)
%This code wil print a '1' if the motif occurs in the protein starting
at the given location, else it wil print a '0'
for k = 1:location %Iterates through specified location
if protein(1, [k, k+1]) == motif; % if the location matches the protein and motif
output = 1;
else
output = 0;
end
end
This part I was able to get correctly, and example of this is as follows:
p = 'MGNAAAAKKGN'
m = 'GN'
Motif_Match(m,p,2)
ans =
1
The second part of the question, which I am stuck on, is to take the motif and protein and return a vector containing the locations at which the motif occurs in the protein. To do this, I am using calls to my previous code and I am not supposed to use any functions that make this easy such as strfind, find, hist, strcmp etc.
My code for this, so far, is:
function output = Motif_Find(motif,protein)
[r,c] = size(protein)
output = zeros(r,c)
for k = 1:c-1
if Motif_Match(motif,protein,k) == 1;
output(k) = protein(k)
else
output = [];
end
end
I belive something is wrong at line 6 of this code. My thinking on this is that I want the output to give me the locations to me and that this code on this line is incorrect, but I can't seem to think of anything else. An example of what should happen is as follows:
p = 'MGNAAAAKKGN';
m = 'GN';
Motif_Find(m,p)
ans =
2 10
So my question is, how can I get my code to give me the locations? I've been stuck on this for quite a while and can't seem to get anywhere with this. Any help will be greatly appreciated!
Thank you all!
you are very close.
output(k) = protein(k)
should be
output(k) = k
This is because we want just the location K of the match. Using protien(k) will gives us the character at position K in the protein string.
Also the very last thing I would do is only return the nonzero elements. The easiest way is to just use the find command with no arguments besides the vector/matrix
so after your loop just do this
output = find(output); %returns only non zero elements
edit
I just noticed another problem output = []; means set output to an empty array. this isn't what you want i think what you meant was output(k) = 0; this is why you weren't getting the result you expected. But REALLY since you already made the whole array zeros, you don't need that at all. all together, the code should look like this. I also replaced your size with length since your proteins are linear sequences, not 2d matricies
function output = Motif_Find(motif,protein)
protein_len = length(protein)
motif_len = length(motif)
output = zeros(1,protein_len)
%notice here I changed this to motif_length. think of it this way, if the
%length is 4, we don't need to search the last 3,2,or 1 protein groups
for k = 1:protein_len-motif_len + 1
if Motif_Match(motif,protein,k) == 1;
output(k) = k;
%we don't really need these lines, since the array already has zeros
%else
% output(k) = 0;
end
end
%returns only nonzero elements
output = find(output);

Matlab, order cells of strings according to the first one

I have 2 cell of strings and I would like to order them according to the first one.
A = {'a';'b';'c'}
B = {'b';'a';'c'}
idx = [2,1,3] % TO FIND
B=B(idx);
I would like to find a way to find idx...
Use the second output of ismember. ismember tells you whether or not values in the first set are anywhere in the second set. The second output tells you where these values are located if we find anything. As such:
A = {'a';'b';'c'}
B = {'b';'a';'c'}
[~,idx] = ismember(A, B);
Note that there is a minor typo when you declared your cell arrays. You have a colon in between b and c for A and a and c for B. I placed a semi-colon there for both for correctness.
Therefore, we get:
idx =
2
1
3
Benchmarking
We have three very good algorithms here. As such, let's see how this performs by doing a benchmarking test. What I'm going to do is generate a 10000 x 1 random character array of lower case letters. This will then be encapsulated into a 10000 x 1 cell array, where each cell is a single character array. I construct A this way, and B is a random permutation of the elements in A. This is the code that I wrote to do this for us:
letters = char(97 + (0:25));
rng(123); %// Set seed for reproducibility
ind = randi(26, [10000, 1]);
lettersMat = letters(ind);
A = mat2cell(lettersMat, ones(10000,1), 1);
B = A(randperm(10000));
Now... here comes the testing code:
clear all;
close all;
letters = char(97 + (0:25));
rng(123); %// Set seed for reproducibility
ind = randi(26, [10000, 1]);
lettersMat = letters(ind);
A = mat2cell(lettersMat, 1, ones(10000,1));
B = A(randperm(10000));
tic;
[~,idx] = ismember(A,B);
t = toc;
fprintf('ismember: %f\n', t);
clear idx; %// Make sure test is unbiased
tic;
[~,idx] = max(bsxfun(#eq,char(A),char(B)'));
t = toc;
fprintf('bsxfun: %f\n', t);
clear idx; %// Make sure test is unbiased
tic;
[~, indA] = sort(A);
[~, indB] = sort(B);
idx = indB(indA);
t = toc;
fprintf('sort: %f\n', t);
This is what I get for timing:
ismember: 0.058947
bsxfun: 0.110809
sort: 0.006054
Luis Mendo's approach is the fastest, followed by ismember, and then finally bsxfun. For code compactness, ismember is preferred but for performance, sort is better. Personally, I think bsxfun should win because it's such a nice function to use ;).
This seems to be significantly faster than using ismember (although admittedly less clear than #rayryeng's answer). With thanks to #Divakar for his correction on this answer.
[~, indA] = sort(A);
[~, indB] = sort(B);
idx = indA(indB);
I had to jump in as it seems runtime performance could be a criteria here :)
Assuming that you are dealing with scalar strings(one character in each cell), here's my take that works even when you have not-commmon elements between A and B and uses the very powerful bsxfun and as such I am really hoping this would be runtime-efficient -
[v,idx] = max(bsxfun(#eq,char(A),char(B)'));
idx = v.*idx
Example -
A =
'a' 'b' 'c' 'd'
B =
'b' 'a' 'c' 'e'
idx =
2 1 3 0
For a specific case when you have no not-common elements between A and B, it becomes a one-liner -
[~,idx] = max(bsxfun(#eq,char(A),char(B)'))
Example -
A =
'a' 'b' 'c'
B =
'b' 'a' 'c'
idx =
2 1 3

MATLAB generate combination from a string

I've a string like this "FBECGHD" and i need to use MATLAB and generate all the required possible permutations? In there a specific MATLAB function that does this task or should I define a custom MATLAB function that perform this task?
Use the perms function. A string in matlab is a list of characters, so it will permute them:
A = 'FBECGHD';
perms(A)
You can also store the output (e.g. P = perms(A)), and, if A is an N-character string, P is a N!-by-N array, where each row corresponds to a permutation.
If you are interested in unique permutations, you can use:
unique(perms(A), 'rows')
to remove duplicates (otherwise something like 'ABB' would give 6 results, instead of the 3 that you might expect).
As Richante answered, P = perms(A) is very handy for this. You may also notice that P is of type char and it's not convenient to subset/select individual permutation. Below worked for me:
str = 'FBECGHD';
A = perms(str);
B = cellstr(reshape(A,7,[])');
C = unique(B);
It also appears that unique(A, 'rows') is not removing duplicate values:
>> A=[11, 11];
>> unique(A, 'rows')
ans =
11 11
However, unique(A) would:
>> unique(A)
ans =
11
I am not a matlab pro by any means and I didn't investigate this exhaustively but at least in some cases it appears that reshape is not what you want. Notice that below gives 999 and 191 as permutations of 199 which isn't true. The reshape function as written appears to operate "column-wise" on A:
>> str = '199';
A = perms(str);
B = cellstr(reshape(A,3,[])');
C = unique(B);
>> C
C =
'191'
'199'
'911'
'919'
'999'
Below does not produce 999 or 191:
B = {};
index = 1;
while true
try
substring = A(index,:);
B{index}=substring;
index = index + 1;
catch
break
end
end
C = unique(B)
C =
'199' '919' '991'

Resources