Finding location of specified substring in a specified string (MATLAB) - string

I have a simple question that I need help on. My code,I believe, is almost complete but im having trouble with the a specific line of code.
I have an assignment question (2 parts) that asks me to find whether a protein (string), has the specified motif (substring) at that particular location (location). This is the first part, and the function and code looks like this:
function output = Motif_Match(motif,protein,location)
%This code wil print a '1' if the motif occurs in the protein starting
at the given location, else it wil print a '0'
for k = 1:location %Iterates through specified location
if protein(1, [k, k+1]) == motif; % if the location matches the protein and motif
output = 1;
else
output = 0;
end
end
This part I was able to get correctly, and example of this is as follows:
p = 'MGNAAAAKKGN'
m = 'GN'
Motif_Match(m,p,2)
ans =
1
The second part of the question, which I am stuck on, is to take the motif and protein and return a vector containing the locations at which the motif occurs in the protein. To do this, I am using calls to my previous code and I am not supposed to use any functions that make this easy such as strfind, find, hist, strcmp etc.
My code for this, so far, is:
function output = Motif_Find(motif,protein)
[r,c] = size(protein)
output = zeros(r,c)
for k = 1:c-1
if Motif_Match(motif,protein,k) == 1;
output(k) = protein(k)
else
output = [];
end
end
I belive something is wrong at line 6 of this code. My thinking on this is that I want the output to give me the locations to me and that this code on this line is incorrect, but I can't seem to think of anything else. An example of what should happen is as follows:
p = 'MGNAAAAKKGN';
m = 'GN';
Motif_Find(m,p)
ans =
2 10
So my question is, how can I get my code to give me the locations? I've been stuck on this for quite a while and can't seem to get anywhere with this. Any help will be greatly appreciated!
Thank you all!

you are very close.
output(k) = protein(k)
should be
output(k) = k
This is because we want just the location K of the match. Using protien(k) will gives us the character at position K in the protein string.
Also the very last thing I would do is only return the nonzero elements. The easiest way is to just use the find command with no arguments besides the vector/matrix
so after your loop just do this
output = find(output); %returns only non zero elements
edit
I just noticed another problem output = []; means set output to an empty array. this isn't what you want i think what you meant was output(k) = 0; this is why you weren't getting the result you expected. But REALLY since you already made the whole array zeros, you don't need that at all. all together, the code should look like this. I also replaced your size with length since your proteins are linear sequences, not 2d matricies
function output = Motif_Find(motif,protein)
protein_len = length(protein)
motif_len = length(motif)
output = zeros(1,protein_len)
%notice here I changed this to motif_length. think of it this way, if the
%length is 4, we don't need to search the last 3,2,or 1 protein groups
for k = 1:protein_len-motif_len + 1
if Motif_Match(motif,protein,k) == 1;
output(k) = k;
%we don't really need these lines, since the array already has zeros
%else
% output(k) = 0;
end
end
%returns only nonzero elements
output = find(output);

Related

How to find positions of the last occurrence of a pattern in a string, and use these to extract a substring from another string

I need some help with a specific problem, which I cannot seem to find on this website.
I have a result which looks something like this:
result = "ooooooooooooooooooooooMMMMMMooooooooooooooooooMMMMMMooooooooooMMMMMMMMoo"
This is a transmembrane prediction. So for this string, I have another string of the same length, but is an amino acid code, for example:
amino_acid_code = "MSDENKSTPIVKASDITDKLKEDILTISKDALDKNTWHVIVGKNFGSYVTHEKGHFVYFYIGPLAFLVFKTA"
I want to do some research on the last "M" region. This can vary in length, as well as the "o" that comes after. So in this case I need to extract "PLAFLVFK" from the last string, which corresponds to the last "M" region.
I have something like this already, but I cannot figure out how to obtain the start position, and I also believe a simpler (or computationally better) solution is possible.
end = result.rfind('M')
start = ?
region_I_need = amino_acid_code[start:end]
Thanks in advance
To also find the start position, use rfind again after slicing off the characters after the end of the result string:
result = "ooooooooooooooooooooooMMMMMMooooooooooooooooooMMMMMMooooooooooMMMMMMMMoo"
amino_acid_code = "MSDENKSTPIVKASDITDKLKEDILTISKDALDKNTWHVIVGKNFGSYVTHEKGHFVYFYIGPLAFLVFKTA"
# add 1 to the indices to get the correct positions
end = result.rfind('M') + 1
start = result[:end].rfind('o') + 1
region_I_need = amino_acid_code[start:end]
print(start, end)
print(amino_acid_code[start:end])
>>> 62 70
>>> PLAFLVFK

Error: "can't multiply sequence by non-int of type 'float'"

It says the error is when h (altitude) is between 11000 and 25000, so I only posted the initial stuff outside all my if loops and the specific loop where the problem is happening. Here is my code:
import math;
T = 0.0;
P = 0.0;
hString = ("What is the altitude in meters?");
h = int(hString);
e = math.exp(0.000157*h);
elif 11000 < h < 25000:
T = -56.46;
P = (22.65)*[(1.73)-e];
When you use mathematical operations you need to be careful with brackets.
P = (22.65)*((1.73)-e); #will be right way of using
[ ] using will create a list which you, do not need in this program.
Here is a link which will help you learn much more about type conversions and proper use of brackets while doing mathematics on it.
Also in your code you have not used
hString =input ("What is the altitude in meters?");
h = int(hString);
input will allow you to take value from user and then int(your_input) will help you convert to integer
The square brackets in the last line ([(1.73)-e]) create a list. In this case, it's a list with one element, namely (1.73)-e. I imagine you intended those to be parens. Make that change and it will work.
The final line becomes:
P = (22.65)*((1.73)-e);

Can you group data with similar written column headings on xlswrite, matlab?

Very new to matlab and still learning the basics. I'm trying to write a script which calculates the distance between two peaks in a waveform. That part I have managed to do, and I have used xlswrite to put the values I have obtained onto an excel file.
For each file, I have between about 50-250 columns, with just two rows: the second row has the numerical value, and the first has the column headings, copied from original excel files I extracted the data from.
Some of the columns have similar, but not identical, headings, e.g. 'green227RightEyereading3' and 'green227RightEyereading4' etc. Is there a way I can group columns with similar headings, e.g. which have the same number/colour in the heading (I.e.green227) and either 'right eye' or 'left eye', and calculate an average of their numerical values? Link to file here: >https://www.dropbox.com/s/ezpyjr3raol31ts/SampleBatchForTesting.xls?dl=0>
>[Excel_file,PathName] = uigetfile('*.xls', 'Pick a File','C:\Users\User\Documents\Optometry\Year 3\Dissertation\A-scan3');
>[~,name,ext] = fileparts(Excel_file);
>sheet = 2;
>FullXLSfile = [PathName, Excel_file];
>[number_data,txt_data,raw_data] = xlsread(FullXLSfile,sheet);
>HowManyWide = size(txt_data);
>NumberOfTitles = HowManyWide(1,2);
>xlRangeA = txt_data;
>Chickens = {'Test'};
>for f = 1:xlRangeA; %%defined as top line of cells on sheet;
>Text = xlRangeA{f};
>HyphenLocations = find(Text == '-');
>R = HyphenLocations(1,1) -1;
>Chick = Text(1:R);
>Chick = cellstr(Chick);
>B = length(Chick);
>TF = strncmp(Chickens,Chick,B);
>if any(TF == 1); %do nothing
>else
>Chickens = {Chickens;Chick};
>end
>end
Here also is a link to the file that is created when I run my entire script. The values below the headings are the calculated thickesses of the tissue I'm analysing. https://www.dropbox.com/s/4p6iu9kk75ecyzl/Choroid_Thickness.xls?dl=0
Thanks very much
If the different characters are located at the very end (or the very beginning) of the heading, you can go with strncmp buit-in function and compare only part of the string. See more here. But please, provide some code and a part of your excel file. It would help.
Also, if I am not mistaken, you are saving all the data into excel and then re-call it again in order to sort it. Maybe you should consider saving only the final result in excel, it will save you some time, especially if you want to run your script many times.
EDIT:
Here is the code I came up with. It is not the best possible solution for sure, but it works with the file you uploaded. I have omitted the unnecessary lines and variables. The code works only if the numbers of each reading have the same amount of digits. They can be 4 digits as long as every entry has 4 digits. Since in each file you have waves of the same color, the only thing that you care about is whether the reading was recorded with the left or the right eye (correct?). Based on that and the code you wrote, the comparison concerns the part of the string that contains the words "Right" or "Left", i.e. the characters between the hyphens.
[Excel_file,PathName] = uigetfile('*.xls', 'Pick a File',...
'C:\Users\User\Documents\Optometry\Year 3\Dissertation\A-scan3');
sheet = 1;
FullXLSfile = [PathName,Excel_file];
[number_data,txt_data,raw_data] = xlsread(FullXLSfile,sheet);
%% data manipulation
NumberOfTitles = length(txt_data);
TextToCompare = txt_data{1};
r1 = 1; % counter for Readings1 vector
r2 = 1; % counter for Readings2 vector
for ff = 1:NumberOfTitles % in your code xlRangeA is a cell vector not a number!
Text = txt_data{ff};
HyphenLocations = find(Text == '-');
Text = Text(HyphenLocations(1,1):HyphenLocations(1,2)); % take only the part that contains the "eye" information
TextToCompare = TextToCompare(HyphenLocations(1,1):HyphenLocations(1,2)); % same here
if (Text == TextToCompare)
Readings1(r1) = number_data(ff); % store the numerical value in a vector
r1 = r1 + 1; % increase the counter of this vector
else
Readings2(r2) = number_data(ff); % same here
r2 = r2 + 1;
end
TextToCompare = txt_data{1}; % TextToCompare re-initialized for the next comparison
end
mean_readings1 = mean(Readings1); % Find the mean of the grouped values
mean_readings2 = mean(Readings2);
I am positive that this can be done in a more efficient and delicate way. I don't know exactly what kind of calculations you want to do so I only included the mean values as an example. Inside the if statement you can also store the txt_data if you need it. Below I have also included a second way which I find more delicate. Just substitute the %%data manipulation part with the part below if you want to test it:
%% more delicate way
Text_Vector = char(txt_data);
TextToCompare2 = txt_data{1};
HyphenLocations2 = find(TextToCompare2 == '-');
TextToCompare2 = TextToCompare2(HyphenLocations2(1,1):HyphenLocations2(1,2));
Text_Vector = Text_Vector(:,HyphenLocations2(1,1):HyphenLocations2(1,2));
Text_Vector = cellstr(Text_Vector);
dummy = strcmpi(Text_Vector,TextToCompare2);
Readings1 = number_data(dummy);
Readings2 = number_data(~dummy);
I hope this helps.

How to tangle/scramble/rearrange a string in MATLAB?

For an example exam question, I've been asked to "tangle" a string as shown:
tangledWord('today')='otady'
tangledWord('12345678')='21436587'
I understand this is an extremely simple problem but it's got me stumped.
I can make it produce a tangled word when the length is even, but I'm having trouble when it's odd, here's my function:
function tangledWord(s)
n=length(s);
a=s(1:2:n);
b=s(2:2:n);
s(1:2:n)=b;
s(2:2:n)=a;
disp(s);
end
For odd word length, you need to reduce n by 1 to leave the last char untouched. Use mod to detect odd word length.
If you want to scramble every char randomly, you can try:
string = '1234567';
shuffled = string(randperm(numel(string)))
shuffled = 5741326
If you want to change the first two chars:
tangled = [string(2) string(1) string(3:end)]
tangled = 2134567
If you want to change every two chars:
n = ( numel(string)-mod(numel(string),2));
tangled2 = [flipud(reshape(string(1:n),[],n/2))(:); string(n+1:end)]'
tangled2 = 2143657
function tangledWord(s)
n=length(s);
if mod(n,2) == 0
a=s(1:2:n);
b=s(2:2:n);
s(1:2:n)=b;
s(2:2:n)=a;
disp(s)
elseif mod(n,2) ~= 0
a=s(1:2:end-1);
b=s(2:2:end-1);
s(1:2:end-1)=b;
s(2:2:end-1)=a;
disp(s)
end
end

MATLAB generate combination from a string

I've a string like this "FBECGHD" and i need to use MATLAB and generate all the required possible permutations? In there a specific MATLAB function that does this task or should I define a custom MATLAB function that perform this task?
Use the perms function. A string in matlab is a list of characters, so it will permute them:
A = 'FBECGHD';
perms(A)
You can also store the output (e.g. P = perms(A)), and, if A is an N-character string, P is a N!-by-N array, where each row corresponds to a permutation.
If you are interested in unique permutations, you can use:
unique(perms(A), 'rows')
to remove duplicates (otherwise something like 'ABB' would give 6 results, instead of the 3 that you might expect).
As Richante answered, P = perms(A) is very handy for this. You may also notice that P is of type char and it's not convenient to subset/select individual permutation. Below worked for me:
str = 'FBECGHD';
A = perms(str);
B = cellstr(reshape(A,7,[])');
C = unique(B);
It also appears that unique(A, 'rows') is not removing duplicate values:
>> A=[11, 11];
>> unique(A, 'rows')
ans =
11 11
However, unique(A) would:
>> unique(A)
ans =
11
I am not a matlab pro by any means and I didn't investigate this exhaustively but at least in some cases it appears that reshape is not what you want. Notice that below gives 999 and 191 as permutations of 199 which isn't true. The reshape function as written appears to operate "column-wise" on A:
>> str = '199';
A = perms(str);
B = cellstr(reshape(A,3,[])');
C = unique(B);
>> C
C =
'191'
'199'
'911'
'919'
'999'
Below does not produce 999 or 191:
B = {};
index = 1;
while true
try
substring = A(index,:);
B{index}=substring;
index = index + 1;
catch
break
end
end
C = unique(B)
C =
'199' '919' '991'

Resources