MATLAB: Find locations of multiple characters inside string - string

How can I find the locations of certains characters within a string. This is my attempt:
Example = "Hello, this is Tom. I wonder, should I go run?";
SearchedCharacters = {'.','!',',','?'};
%Plan one
Locations = strfind(Example, SearchedCharacters);
%Plan two
Locations = cellfun(#(s)find(~cellfun('isempty',strfind(C,s))),SearchedCharacters,'uni',0);
Both of my plans give errors.
Finally. Having the locations of the characters within the string, I would like to determine, the second last character of interest in the string. In this case it would be ","(Just after the word "wonder"), in location = 29.
Help will be appreciated.
Thanks.

You can use ismember and find.
Find the second last location:
Example = 'Hello, this is Tom. I wonder, should I go run?' ;
SearchedCharacters = '.!,?' ;
idx = ismember (Example, SearchedCharacters);
Loc = find (idx, 2, 'last');
if numel (Loc) < 2
error ('the requested character cannot be found')
end
SecondLast = Loc (1);
Find all locations:
Locations = find (idx);

Related

How to find positions of the last occurrence of a pattern in a string, and use these to extract a substring from another string

I need some help with a specific problem, which I cannot seem to find on this website.
I have a result which looks something like this:
result = "ooooooooooooooooooooooMMMMMMooooooooooooooooooMMMMMMooooooooooMMMMMMMMoo"
This is a transmembrane prediction. So for this string, I have another string of the same length, but is an amino acid code, for example:
amino_acid_code = "MSDENKSTPIVKASDITDKLKEDILTISKDALDKNTWHVIVGKNFGSYVTHEKGHFVYFYIGPLAFLVFKTA"
I want to do some research on the last "M" region. This can vary in length, as well as the "o" that comes after. So in this case I need to extract "PLAFLVFK" from the last string, which corresponds to the last "M" region.
I have something like this already, but I cannot figure out how to obtain the start position, and I also believe a simpler (or computationally better) solution is possible.
end = result.rfind('M')
start = ?
region_I_need = amino_acid_code[start:end]
Thanks in advance
To also find the start position, use rfind again after slicing off the characters after the end of the result string:
result = "ooooooooooooooooooooooMMMMMMooooooooooooooooooMMMMMMooooooooooMMMMMMMMoo"
amino_acid_code = "MSDENKSTPIVKASDITDKLKEDILTISKDALDKNTWHVIVGKNFGSYVTHEKGHFVYFYIGPLAFLVFKTA"
# add 1 to the indices to get the correct positions
end = result.rfind('M') + 1
start = result[:end].rfind('o') + 1
region_I_need = amino_acid_code[start:end]
print(start, end)
print(amino_acid_code[start:end])
>>> 62 70
>>> PLAFLVFK

Is there a way to substring, which is between two words in the string in Python?

My question is more or less similar to:
Is there a way to substring a string in Python?
but it's more specifically oriented.
How can I get a par of a string which is located between two known words in the initial string.
Example:
mySrting = "this is the initial string"
Substring = "initial"
knowing that "the" and "string" are the two known words in the string that can be used to get the substring.
Thank you!
You can start with simple string manipulation here. str.index is your best friend there, as it will tell you the position of a substring within a string; and you can also start searching somewhere later in the string:
>>> myString = "this is the initial string"
>>> myString.index('the')
8
>>> myString.index('string', 8)
20
Looking at the slice [8:20], we already get close to what we want:
>>> myString[8:20]
'the initial '
Of course, since we found the beginning position of 'the', we need to account for its length. And finally, we might want to strip whitespace:
>>> myString[8 + 3:20]
' initial '
>>> myString[8 + 3:20].strip()
'initial'
Combined, you would do this:
startIndex = myString.index('the')
substring = myString[startIndex + 3 : myString.index('string', startIndex)].strip()
If you want to look for matches multiple times, then you just need to repeat doing this while looking only at the rest of the string. Since str.index will only ever find the first match, you can use this to scan the string very efficiently:
searchString = 'this is the initial string but I added the relevant string pair a few more times into the search string.'
startWord = 'the'
endWord = 'string'
results = []
index = 0
while True:
try:
startIndex = searchString.index(startWord, index)
endIndex = searchString.index(endWord, startIndex)
results.append(searchString[startIndex + len(startWord):endIndex].strip())
# move the index to the end
index = endIndex + len(endWord)
except ValueError:
# str.index raises a ValueError if there is no match; in that
# case we know that we’re done looking at the string, so we can
# break out of the loop
break
print(results)
# ['initial', 'relevant', 'search']
You can also try something like this:
mystring = "this is the initial string"
mystring = mystring.strip().split(" ")
for i in range(1,len(mystring)-1):
if(mystring[i-1] == "the" and mystring[i+1] == "string"):
print(mystring[i])
I suggest using a combination of list, split and join methods.
This should help if you are looking for more than 1 word in the substring.
Turn the string into array:
words = list(string.split())
Get the index of your opening and closing markers then return the substring:
open = words.index('the')
close = words.index('string')
substring = ''.join(words[open+1:close])
You may want to improve a bit with the checking for the validity before proceeding.
If your problem gets more complex, i.e multiple occurrences of the pair values, I suggest using regular expression.
import re
substring = ''.join(re.findall(r'the (.+?) string', string))
The re should store substrings separately if you view them in list.
I am using the spaces between the description to rule out the spaces between words, you can modify to your needs as well.

how to find two different strings in the same line in matlab

I have a cell obtained from text scan and I want to find the index of lines containing particular string,
fid = fopen('data.txt');
E = textscan(fid, '%s', 'Delimiter', '\n');
and I wanted to know the line numbers (index) of those lines which have a specific text, e.g. I wanted to find the rows that have the keyword "2016":
rows = find(contains(E{1},"2016" );
but I want to find the index of those lines which have two keywords "2016" and "Mathew Perry" (only those lines which have both the keywords).
I tried using this code but does not work
rows = find(contains(E{1},"2016" && contains(E{1},"Mathew Perry");
the error I get is:
Operands to the || and && operators must be convertible to logical scalar values.
To find a single string:
idx = strfind(E{1}, '2016');
idx = find(not(cellfun('isempty', idx)));
Use strfind instead of find. YOu may try the above with and/or. If it works fine, then no problem; if not, get the indices separately for each word and get the intersection of the indices.

Finding location of specified substring in a specified string (MATLAB)

I have a simple question that I need help on. My code,I believe, is almost complete but im having trouble with the a specific line of code.
I have an assignment question (2 parts) that asks me to find whether a protein (string), has the specified motif (substring) at that particular location (location). This is the first part, and the function and code looks like this:
function output = Motif_Match(motif,protein,location)
%This code wil print a '1' if the motif occurs in the protein starting
at the given location, else it wil print a '0'
for k = 1:location %Iterates through specified location
if protein(1, [k, k+1]) == motif; % if the location matches the protein and motif
output = 1;
else
output = 0;
end
end
This part I was able to get correctly, and example of this is as follows:
p = 'MGNAAAAKKGN'
m = 'GN'
Motif_Match(m,p,2)
ans =
1
The second part of the question, which I am stuck on, is to take the motif and protein and return a vector containing the locations at which the motif occurs in the protein. To do this, I am using calls to my previous code and I am not supposed to use any functions that make this easy such as strfind, find, hist, strcmp etc.
My code for this, so far, is:
function output = Motif_Find(motif,protein)
[r,c] = size(protein)
output = zeros(r,c)
for k = 1:c-1
if Motif_Match(motif,protein,k) == 1;
output(k) = protein(k)
else
output = [];
end
end
I belive something is wrong at line 6 of this code. My thinking on this is that I want the output to give me the locations to me and that this code on this line is incorrect, but I can't seem to think of anything else. An example of what should happen is as follows:
p = 'MGNAAAAKKGN';
m = 'GN';
Motif_Find(m,p)
ans =
2 10
So my question is, how can I get my code to give me the locations? I've been stuck on this for quite a while and can't seem to get anywhere with this. Any help will be greatly appreciated!
Thank you all!
you are very close.
output(k) = protein(k)
should be
output(k) = k
This is because we want just the location K of the match. Using protien(k) will gives us the character at position K in the protein string.
Also the very last thing I would do is only return the nonzero elements. The easiest way is to just use the find command with no arguments besides the vector/matrix
so after your loop just do this
output = find(output); %returns only non zero elements
edit
I just noticed another problem output = []; means set output to an empty array. this isn't what you want i think what you meant was output(k) = 0; this is why you weren't getting the result you expected. But REALLY since you already made the whole array zeros, you don't need that at all. all together, the code should look like this. I also replaced your size with length since your proteins are linear sequences, not 2d matricies
function output = Motif_Find(motif,protein)
protein_len = length(protein)
motif_len = length(motif)
output = zeros(1,protein_len)
%notice here I changed this to motif_length. think of it this way, if the
%length is 4, we don't need to search the last 3,2,or 1 protein groups
for k = 1:protein_len-motif_len + 1
if Motif_Match(motif,protein,k) == 1;
output(k) = k;
%we don't really need these lines, since the array already has zeros
%else
% output(k) = 0;
end
end
%returns only nonzero elements
output = find(output);

Remove characters from a cell array of strings (Matlab)

I have a cell array of strings. I need to extract say 1-to-n characters for each item. Strings are always longer than n characters. Please see:
data = { 'msft05/01/2010' ;
'ap01/01/2013' }
% For each string, last 10 characters are removed and put it in the next column
answer = { 'msft' '05/01/2010' ;
'ap' '01/01/2013' }
Is there a vectorized solution possible? I have tried using cellfun but wasn't successful. Thanks.
data = { 'msft05/01/2010' ;
'ap01/01/2013' };
for i = 1:length(data)
s = data{i};
data{i} = {s(1:end-10) s(end-9:end)};
end
Sorry, didn't notice that you need vectorized... Perhaps I can suggest only one-liner...
data = cellfun(#(s) {s(1:end-10) s(end-9:end)}, data, 'UniformOutput', false);

Resources