cell array of strings to matrix - string

A = {'a','b','c','b','a',...}
A is a <1X400> cell array and I want to create a matrix from A such that if the cell is a, the matrix shows 1, if it is b, it shows as 2 in the matrix and 3 for c.
Thank you.

Specific Case
For a simple specific case as listed in the question, you can use char to convert all the cell elements to characters and then subtract 96 from it, which is ascii equivalent of 'a'-1 -
A_numeric = char(A)-96
Sample run -
>> A
A =
'a' 'b' 'c' 'b' 'a'
>> A_numeric = char(A)-96
A_numeric =
1
2
3
2
1
Generic Case
For a generic substitution case, you need to do a bit more of work like so -
%// Inputs
A = {'correct','boss','cat','boss','correct','cat'}
newcellval = {'correct','cat','boss'}
newnumval = [8,2,5]
[unqcell,~,idx] = unique(A,'stable')
[~,newcell_idx,unqcell_idx] = intersect(newcellval,unqcell,'stable')
A_numeric = newnumval(changem(idx,newcell_idx,unqcell_idx))
Sample input-output -
>> A,newcellval,newnumval
A =
'correct' 'boss' 'cat' 'boss' 'correct' 'cat'
newcellval =
'correct' 'cat' 'boss'
newnumval =
8 2 5
>> A_numeric
A_numeric =
8 5 2 5 8 2

That's easy:
result = cell2mat(A)-'a'+1
For a generic association of letters to numbers 1,2,3...:
letters2numbers = 'abc'; %// 'a'->1, 'b'->2 etc.
[~, result] = ismember(cell2mat(A), letters2numbers)
For a generic association of strings to numbers 1,2,3...:
strings2numbers = {'hi', 'hello', 'hey', 'good morning', 'howdy'};
A = {'hello', 'hi', 'hello', 'howdy', 'bye'};
[~, result] = ismember(A, strings2numbers)
In this example,
result =
2 1 2 5 0

use a For Loop which iterate over A and convert character to number
for loop = 1:length(A)
outMat(loop) = char(A(loop)) - 96
end
I hope it works.

Related

Count number of continuous matching elements in two different numbers in Python

Suppose
we have two numbers a and b we need to calculate the continuous matching digits between the two numbers.
some examples are shown below:
a = 123456 b = 456 ==> I need count as : 3 digits matching
a = 556789 b = 55678 ==> I need count as : 5 digits matching
I don't want unique but continuous matching numbers and need the count. Also display the matching ones will be helpful. Also can we can we do in two different lists if numbers?
I am very new to python and trying out few things. Thanks
Given two numbers a and b:
a = 123456
b = 456
First you need to covert them to strings:
a_str = str(a)
b_str = str(b)
Then you need to check if there is a continuous match of b_str in a_str:
if b_str in a_str:
...
Finally you can check the length of b_str:
len(b_str)
This is the complete function:
def count_matching_elements(a, b):
a_str, b_str = str(a), str(b)
if b_str in a_str:
return len(b_str)
else:
return -1 # no matches
What you want here is know as the Longest common substring, you can find it like this (this code can be found here Find common substring between two strings, just a little difference that you actually want the len(answer)) :
def longestSubstringFinder(string1, string2):
answer = ""
len1, len2 = len(string1), len(string2)
for i in range(len1):
match = ""
for j in range(len2):
if (i + j < len1 and string1[i + j] == string2[j]):
match += string2[j]
else:
if (len(match) > len(answer)): answer = match
match = ""
return len(answer)
Note that a and b would have to be strings

Given two strings, how do I find number of reoccurences of one in another?

For example, s1='abc', s2='kokoabckokabckoab'.
Output should be 3. (number of times s1 appears in s2).
Not allowed to use for or strfind. Can only use reshape,repmat,size.
I thought of reshaping s2, so it would contain all of the possible strings of 3s:
s2 =
kok
oko
koa
oab
.... etc
But I'm having troubles from here..
Assuming you have your matrix reshaped into the format you have in your post, you can replicate s1 and stack the string such that it has as many rows as there are in the reshaped s2 matrix, then do an equality operator. Rows that consist of all 1s means that we have found a match and so you would simply search for those rows where the total sum is equal to the total length of s1. Referring back to my post on dividing up a string into overlapping substrings, we can decompose your string into what you have posted in your question like so:
%// Define s1 and s2 here
s1 = 'abc';
len = length(s1);
s2 = 'kokoabckokabckoab';
%// Hankel starts here
c = (1 : len).';
r = (len : length(s2)).';
nr = length(r);
nc = length(c);
x = [ c; r((2:nr)') ]; %-- build vector of user data
cidx = (1:nc)';
ridx = 0:(nr-1);
H = cidx(:,ones(nr,1)) + ridx(ones(nc,1),:); % Hankel subscripts
ind = x(H); % actual data
%// End Hankel script
%// Now get our data
subseqs = s2(ind.');
%// Case where string length is 1
if len == 1
subseqs = subseqs.';
end
subseqs contains the matrix of overlapping characters that you have alluded to in your post. You've noticed a small bug where if the length of the string is 1, then the algorithm won't work. You need to make sure that the reshaped substring matrix consists of a single column vector. If we ran the above code without checking the length of s1, we would get a row vector, and so simply transpose the result if this is the case.
Now, simply replicate s1 for as many times as we have rows in subseqs so that all of these strings get stacked into a 2D matrix. After, do an equality operator.
eqs = subseqs == repmat(s1, size(subseqs,1), 1);
Now, find the column-wise sum and see which elements are equal to the length of your string. This will produce a single column vector where 1 indicates that we have found a match, and zero otherwise:
sum(eqs, 2) == len
ans =
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
Finally, to add up how many times the substring matched, you just have to add up all elements in this vector:
out = sum(sum(eqs, 2) == len)
out =
2
As such, we have two instances where abc is found in your string.
Here is another one,
s1='abc';
s2='bkcokbacaabcsoabckokabckoabc';
[a,b] = ismember(s2,s1);
b = [0 0 b 0 0];
a1=circshift(b,[0 -1]);
a2=circshift(b,[0 -2]);
sum((b==1)&(a1==2)&(a2==3))
It gives 3 for your input and 4 for my example, and it seems to work well if ismember is okey.
Just for the fun of it: this can be done with nlfilter from the Image Processing Toolbox (I just discovered this function today and am eager to apply it!):
ds1 = double(s1);
ds2 = double(s2);
result = sum(nlfilter(ds2, [1 numel(ds1)], #(x) all(x==ds1)));

Matlab, order cells of strings according to the first one

I have 2 cell of strings and I would like to order them according to the first one.
A = {'a';'b';'c'}
B = {'b';'a';'c'}
idx = [2,1,3] % TO FIND
B=B(idx);
I would like to find a way to find idx...
Use the second output of ismember. ismember tells you whether or not values in the first set are anywhere in the second set. The second output tells you where these values are located if we find anything. As such:
A = {'a';'b';'c'}
B = {'b';'a';'c'}
[~,idx] = ismember(A, B);
Note that there is a minor typo when you declared your cell arrays. You have a colon in between b and c for A and a and c for B. I placed a semi-colon there for both for correctness.
Therefore, we get:
idx =
2
1
3
Benchmarking
We have three very good algorithms here. As such, let's see how this performs by doing a benchmarking test. What I'm going to do is generate a 10000 x 1 random character array of lower case letters. This will then be encapsulated into a 10000 x 1 cell array, where each cell is a single character array. I construct A this way, and B is a random permutation of the elements in A. This is the code that I wrote to do this for us:
letters = char(97 + (0:25));
rng(123); %// Set seed for reproducibility
ind = randi(26, [10000, 1]);
lettersMat = letters(ind);
A = mat2cell(lettersMat, ones(10000,1), 1);
B = A(randperm(10000));
Now... here comes the testing code:
clear all;
close all;
letters = char(97 + (0:25));
rng(123); %// Set seed for reproducibility
ind = randi(26, [10000, 1]);
lettersMat = letters(ind);
A = mat2cell(lettersMat, 1, ones(10000,1));
B = A(randperm(10000));
tic;
[~,idx] = ismember(A,B);
t = toc;
fprintf('ismember: %f\n', t);
clear idx; %// Make sure test is unbiased
tic;
[~,idx] = max(bsxfun(#eq,char(A),char(B)'));
t = toc;
fprintf('bsxfun: %f\n', t);
clear idx; %// Make sure test is unbiased
tic;
[~, indA] = sort(A);
[~, indB] = sort(B);
idx = indB(indA);
t = toc;
fprintf('sort: %f\n', t);
This is what I get for timing:
ismember: 0.058947
bsxfun: 0.110809
sort: 0.006054
Luis Mendo's approach is the fastest, followed by ismember, and then finally bsxfun. For code compactness, ismember is preferred but for performance, sort is better. Personally, I think bsxfun should win because it's such a nice function to use ;).
This seems to be significantly faster than using ismember (although admittedly less clear than #rayryeng's answer). With thanks to #Divakar for his correction on this answer.
[~, indA] = sort(A);
[~, indB] = sort(B);
idx = indA(indB);
I had to jump in as it seems runtime performance could be a criteria here :)
Assuming that you are dealing with scalar strings(one character in each cell), here's my take that works even when you have not-commmon elements between A and B and uses the very powerful bsxfun and as such I am really hoping this would be runtime-efficient -
[v,idx] = max(bsxfun(#eq,char(A),char(B)'));
idx = v.*idx
Example -
A =
'a' 'b' 'c' 'd'
B =
'b' 'a' 'c' 'e'
idx =
2 1 3 0
For a specific case when you have no not-common elements between A and B, it becomes a one-liner -
[~,idx] = max(bsxfun(#eq,char(A),char(B)'))
Example -
A =
'a' 'b' 'c'
B =
'b' 'a' 'c'
idx =
2 1 3

produce a table/array with string column and numbers

I have a cell structure of strings such as.
my_cell = 'apple.csv' 'banana.csv' 'orange.csv'
from reading in datasets
I have a vector of data.
my_number = [1 2 3]
I want to output a table/array that has the names in the first column and some numbers in the second.
my_output=['apple' 1; 'banana' 2; 'orange' 3]
i think this will only work if your vector is the same size as your cell
my_output=cell(length(my_cell),2)
for i=1:length(my_cell)
my_output(i,:)=[my_cell(i),my_number(i)];
end
In order to mix strings and numbers in the same array, your "table" needs to be a cell array:
my_cell = {'apple.csv','banana.csv','orange.csv'}; % data
my_number = [1 2 3]; % data
my_output = cell(length(my_cell),2); % initialize output cell array
[my_output{:,1}] = deal(my_cell{:}); % asign first column of cell array
my_number_cell = num2cell(my_number); % convert vector to cell
[my_output{:,2}] = deal(my_number_cell{:}); % asign first column of cell array
gives
>> disp(my_output)
'apple.csv' [1]
'banana.csv' [2]
'orange.csv' [3]
Other than a cell array, you can also use dataset(), which after some initial overhead are lighter than than the former and also allow you to access its fields with . (dot) syntax, i.e. 'struct' syntax:
% Example input
my_cell = repmat({'apple.csv'; 'banana.csv'; 'orange.csv'} ,1000,1);
my_number = repmat([1; 2; 3],1000,1);
% a is a cell array, b is a dataset
a = [my_cell(:), num2cell(my_number(:))]
b = dataset({my_cell(:), 'name'},{my_number(:),'number'})
Displayed variables:
a =
'apple.csv' [1]
'banana.csv' [2]
'orange.csv' [3]
b =
name number
'apple.csv' 1
'banana.csv' 2
'orange.csv' 3
Alternative ways to index a dataset():
b(:,1)
ans =
name
'apple.csv'
'banana.csv'
'orange.csv'
b.name
ans =
'apple.csv'
'banana.csv'
'orange.csv'
b(:,'number')
ans =
number
1
2
3

Equivalent of adding strings to a loop, for strings (matlab)?

How would I be able to do the equivalent of this with strings:
a = [1 2 3; 4 5 6];
c = [];
for i=1:5
b = a(1,:)+i;
c = [c;b];
end
c =
2 3 4
3 4 5
4 5 6
5 6 7
6 7 8
Basically looking to combine several strings into a Matrix.
You're growing a variable in a loop, which is a kind of sin in Matlab :) So I'm going to show you some better ways of doing array concatenation.
There's cell strings:
>> C = {
'In a cell string, it'
'doesn''t matter'
'if the strings'
'are not of equal lenght'};
>> C{2}
ans =
doesn't matter
Which you could use in a loop like so:
% NOTE: always pre-allocate everything before a loop
C = cell(5,1);
for ii = 1:5
% assign some random characters
C{ii} = char( '0'+round(rand(1+round(rand*10),1)*('z'-'0')) );
end
There's ordinary arrays, which have as a drawback that you have to know the size of all your strings beforehand:
a = [...
'testy' % works
'droop'
];
b = [...
'testing' % ERROR: CAT arguments dimensions
'if this works too' % are not consistent.
];
for these cases, use char:
>> b = char(...
'testing',...
'if this works too'...
);
b =
'testing '
'if this works too'
Note how char pads the first string with spaces to fit the length of the second string. Now again: don't use this in a loop, unless you've pre-allocated the array, or if there really is no other way to go.
Type help strfun on the Matlab command prompt to get an overview of all string-related functions available in Matlab.
You mean storing a string on each matrix position? You can't do that, since matrices are defined over basic types. You can have a CHAR on each position:
>> a = 'bla';
>> b = [a; a]
b <2x3 char> =
bla
bla
>> b(2,3) = 'e'
b =
bla
ble
If you want to store matrices, use a cell array (MATLAB reference, Blog of Loren Shure), which are kind of similar but using "{}" instead of "()":
>> c = {a; a}
c =
'bla'
'bla'
>> c{2}
ans =
bla

Resources