how to find two different strings in the same line in matlab - string

I have a cell obtained from text scan and I want to find the index of lines containing particular string,
fid = fopen('data.txt');
E = textscan(fid, '%s', 'Delimiter', '\n');
and I wanted to know the line numbers (index) of those lines which have a specific text, e.g. I wanted to find the rows that have the keyword "2016":
rows = find(contains(E{1},"2016" );
but I want to find the index of those lines which have two keywords "2016" and "Mathew Perry" (only those lines which have both the keywords).
I tried using this code but does not work
rows = find(contains(E{1},"2016" && contains(E{1},"Mathew Perry");
the error I get is:
Operands to the || and && operators must be convertible to logical scalar values.

To find a single string:
idx = strfind(E{1}, '2016');
idx = find(not(cellfun('isempty', idx)));
Use strfind instead of find. YOu may try the above with and/or. If it works fine, then no problem; if not, get the indices separately for each word and get the intersection of the indices.

Related

How to split a string after a dot UNLESS the characters after the dot are numbers

I need to take only the letters and numbers at the beginning of a string, but some numbers are decimals. The strings are not all formatted the same. Here are a few examples of some of the data and what I would need returned:
HB61 .M16 1973 I need HB61 returned
HB97.52 .R6163 1982 I need HB97.52 returned
HB98.V38 1994 I need HB98 returned
HB 119.G74 A3 2007 I need HB119 returned
I'm very new to coding so I'm hoping there's some simple solution that I just don't know?
I was going to just split it at the first dot and then get rid of the spaces, but this wouldn't allow me to keep the decimals such as HB97.52 which I need. I currently have code written just to test one string at a time. The code is as follows:
data = input("Data: ")
components = data.split(".")
str(components)
print(components[0].replace(" ", ""))
This works as expected except for the strings with decimals. for HB97.52 .R6163 1982 I would like HB97.52 returned but it only returns HB97.
The following regular expression extracts the letters at the beginning of a string, followed by optional spaces, followed by a [possibly floating point] number:
s = ['HB61 .M16 1973', 'HB97.52 .R6163 1982',
'HB98.V38 1994', 'HB 119.G74 A3 2007']
import re
pattern = r"^[a-z]+\s*\d+(?:\.\d+)?"
[re.findall(pattern, part, flags=re.I)[0] for part in s]
#['HB61', 'HB97.52', 'HB98', 'HB 119']
If you do not want the spaces in the output, this slightly different pattern extracts the letter part and the number part separately, and then they are joined:
pattern = r"(^[a-z]+)\s*(\d+(?:\.\d+)?)"
list(map("".join, [re.findall(pattern, part, flags=re.I)[0] for part in s]))
#['HB61', 'HB97.52', 'HB98', 'HB119']
For something like HB61.45.78.R5000 what do you want? If you want HB61.45.78 then use this first snippet:
data = data.replace(' ', '')
data = data.split('.')
wanted = data[0]
for i in range(1,len(data)):
if data[i][0].isalpha():
break
else:
wanted += '.' + data[i]
Otherwise, if you want only HB61.45 then use
data = data.replace(' ', '')
data = data.split('.')
wanted = data[0]
if not data[1][0].isalpha():
wanted += '.' + data[1]

Octave - return the position of the first occurrence of a string in a cell array

Is there a function in Octave that returns the position of the first occurrence of a string in a cell array?
I found findstr but this returns a vector, which I do not want. I want what index does but it only works for strings.
If there is no such function, are there any tips on how to go about it?
As findstr is being deprecated, a combination of find and strcmpi may prove useful. strcmpi compares strings by ignoring the case of the letters which may be useful for your purposes. If this is not what you want, use the function without the trailing i, so strcmp. The input into strcmpi or strcmp are the string to search for str and for your case the additional input parameter is a cell array A of strings to search in. The output of strcmpi or strcmp will give you a vector of logical values where each location k tells you whether the string k in the cell array A matched with str. You would then use find to find all locations of where the string matched, but you can further restrain it by specifying the maximum number of locations n as well as where to constrain your search - specifically if you want to look at the first or last n locations where the string matched.
If the desired string is in str and your cell array is stored in A, simply do:
index = find(strcmpi(str, A)), 1, 'first');
To reiterate, find will find all locations where the string matched, while the second and third parameters tell you to only return the first index of the result. Specifically, this will return the first occurrence of the desired searched string, or the empty array if it can't be found.
Example Run
octave:8> A = {'hello', 'hello', 'how', 'how', 'are', 'you'};
octave:9> str = 'hello';
octave:10> index = find(strcmpi(str, A), 1, 'first')
index = 1
octave:11> str = 'goodbye';
octave:12> index = find(strcmpi(str, A), 1, 'first')
index = [](1x0)

sort string according to first characters matlab

I have an cell array composed by several strings
names = {'2name_19surn', '3name_2surn', '1name_2surn', '10name_1surn'}
and I would like to sort them according to the prefixnumber.
I tried
[~,index] = sortrows(names.');
sorted_names = names(index);
but I get
sorted_names = {'10name_1surn', '1name_2surn', '2name_19surn', '3name_2surn'}
instead of the desired
sorted_names = {'1name_2surn', '2name_19surn', '3name_2surn','10name_1surn'}
any suggestion?
Simple approach using regular expressions:
r = regexp(names,'^\d+','match'); %// get prefixes
[~, ind] = sort(cellfun(#(c) str2num(c{1}), r)); %// convert to numbers and sort
sorted_names = names(ind); %// use index to build result
As long as speed is not a concern you can loop through all strings and save the first digets in an array. Subsequently sort the array as usual...
names = {'2name_2', '3name', '1name', '10name'}
number_in_string = zeros(1,length(names));
% Read numbers from the strings
for ii = 1:length(names)
number_in_string(ii) = sscanf(names{ii}, '%i');
end
% Sort names using number_in_string
[sorted, idx] = sort(number_in_string)
sorted_names = names(idx)
Take the file sort_nat from here
Then
names = {'2name', '3name', '1name', '10name'}
sort_nat(names)
returns
sorted_names = {'1name', '2name', '3name','10name'}
You can deal with arbitrary patterns using a regular expression:
names = {'2name', '3name', '1name', '10name'}
match = regexpi(names,'(?<number>\d+)\D+','names'); % created with regex editor on rubular.com
match = cell2mat(match); % cell array to struct array
clear numbersStr
[numbersStr{1:length(match)}] = match.number; % cell array with number strings
numbers = str2double(numbersStr); % vector of numbers
[B,I] = sort(numbers); % sorted vector of numbers (B) and the indices (I)
clear namesSorted
[namesSorted{1:length(names)}] = names{I} % cell array with sorted name strings

Matlab fints doesn't like a string value I pass as an argument

I have a program that takes the columns of a fints-object, multiplies them together pairwise in all combinations and output the result in a new fints object. I have the code for the data, but I also want the series labels to carry through so that the product of column a and b has label a*b.
function tsB = MulTS(tsA)
anames = fieldnames(tsA,1)';
A = fts2mat(tsA);
[i,j] = meshgrid(1:size(A,2),1:size(A,2));
B = Mul(A(:,i(:)),A(:,j(:)));
q = [anames(:,i(:)); anames(:,j(:))];
bnames = strcat(q(1,:),'*', q(2,:));
tsB=fints(tsA.dates, B, bnames);
end
I get warnings when I run it.
tsA= fints([1 2 3]', [[1 1 1]' [2 2 2]'],{'a','b'}');
MulTS(tsA)
??? Error using ==> fints.fints at 188
Illegal name(s) detected. Please check the name(s).
Error in ==> MulTS at 10
tsB=fints(tsA.dates, B, bnames);"
It seems Matlab doesn't like the format of bnames. I've tried googling stuff like "convert cell array to string matlab" and trying things like b = {bnames}. What am I doing wrong?
Your datanames (bnames in MulTS) seems to contain a "*" character, which is illegal according to fints documentation:
datanames
Cell array of data series names. Overrides the default data series names. Default data series names are series1, series2, and so on.
Note: Not all strings are accepted as datanames parameters. Supported data series names cannot start with a number and must contain only these characters:
Lowercase Latin alphabet, a to z
Uppercase Latin alphabet, A to Z
Underscore, _
Try replacing the "*" with "_" or something else.

How do I read a delimited file with strings/numbers with Octave?

I am trying to read a text file containing digits and strings using Octave. The file format is something like this:
A B C
a 10 100
b 20 200
c 30 300
d 40 400
e 50 500
but the delimiter can be space, tab, comma or semicolon. The textread function works fine if the delimiter is space/tab:
[A,B,C] = textread ('test.dat','%s %d %d','headerlines',1)
However it does not work if delimiter is comma/semicolon. I tried to use dklmread:
dlmread ('test.dat',';',1,0)
but it does not work because the first column is a string.
Basically, with textread I can't specify the delimiter and with dlmread I can't specify the format of the first column. Not with the versions of these functions in Octave, at least. Has anybody ever had this problem before?
textread allows you to specify the delimiter-- it honors the property arguments of strread. The following code worked for me:
[A,B,C] = textread( 'test.dat', '%s %d %d' ,'delimiter' , ',' ,1 )
I couldn't find an easy way to do this in Octave currently. You could use fopen() to loop through the file and manually extract the data. I wrote a function that would do this on arbitrary data:
function varargout = coltextread(fname, delim)
% Initialize the variable output argument
varargout = cell(nargout, 1);
% Initialize elements of the cell array to nested cell arrays
% This syntax is due to {:} producing a comma-separated
[varargout{:}] = deal(cell());
fid = fopen(fname, 'r');
while true
% Get the current line
ln = fgetl(fid);
% Stop if EOF
if ln == -1
break;
endif
% Split the line string into components and parse numbers
elems = strsplit(ln, delim);
nums = str2double(elems);
nans = isnan(nums);
% Special case of all strings (header line)
if all(nans)
continue;
endif
% Find the indices of the NaNs
% (i.e. the indices of the strings in the original data)
idxnans = find(nans);
% Assign each corresponding element in the current line
% into the corresponding cell array of varargout
for i = 1:nargout
% Detect if the current index is a string or a num
if any(ismember(idxnans, i))
varargout{i}{end+1} = elems{i};
else
varargout{i}{end+1} = nums(i);
endif
endfor
endwhile
endfunction
It accepts two arguments: the file name, and the delimiter. The function is governed by the number of return variables that are specified, so, for example, [A B C] = coltextread('data.txt', ';'); will try to parse three different data elements from each row in the file, while A = coltextread('data.txt', ';'); will only parse the first elements. If no return variable is given, then the function won't return anything.
The function ignores rows that have all-strings (e.g. the 'A B C' header). Just remove the if all(nans)... section if you want everything.
By default, the 'columns' are returned as cell arrays, although the numbers within those arrays are actually converted numbers, not strings. If you know that a cell array contains only numbers, then you can easily convert it to a column vector with: cell2mat(A)'.

Resources