How to find common elements in string cells? - string

I want to find the common elements in multiple (>=2) cell arrays of strings.
A related question is here, and the answer proposes to use the function intersect(), however it works for only 2 inputs.
In my case, I have more than two cells, and I want to obtain a single common subset. Here is an example of what I want to achieve:
c1 = {'a','b','c','d'}
c2 = {'b','c','d'}
c3 = {'c','d'}
c_common = my_fun({c1,c2,c3});
in the end, I want c_common={'c','d'}, since only these two strings occur in all the inputs.
How can I do this with MATLAB?
Thanks in advance,
P.S. I also need the indices from each input, but I can probably do that myself using the output c_common, so not necessary in the answer. But if anyone wants to tackle that too, my actual output will be like this:
[c_common, indices] = my_fun({c1,c2,c3});
where indices = {[3,4], [2,3], [1,2]} for this case.
Thanks,

Listed in this post is a vectorized approach to give us the common strings and indices using unique and accumarray. This would work even when the strings are not sorted within each cell array to give us indices corresponding to their positions within it, but they have to be unique. Please have a look at the sample input, output section* to see such a case run. Here's the implementation -
C = {c1,c2,c3}; % Add more cell arrays here
% Get unique strings and ID each of the strings based on their uniqueness
[unqC,~,unqID] = unique([C{:}]);
% Get count of each ID and the IDs that have counts equal to the number of
% cells arrays in C indicate that they are present in all cell arrays and
% thus are the ones to be finally selected
match_ID = find(accumarray(unqID(:),1)==numel(C));
common_str = unqC(match_ID)
% ------------ Additional work to get indices ----------------
N_str = numel(common_str);
% Store matches as a logical array to be used at later stages
matches = ismember(unqID,match_ID);
% Use ismember to find all those indices in unqID and subtract group
% lengths from them to give us the indices within each cell array
clens = [0 cumsum(cellfun('length',C(1:end-1)))];
match_index = reshape(find(matches),N_str,[]);
% Sort match_index along each column based on the respective unqID elements
[m,n] = size(match_index);
[~,sidx] = sort(reshape(unqID(matches),N_str,[]),1);
sorted_match_index = match_index(bsxfun(#plus,sidx,(0:n-1)*m));
% Subtract cumulative group lens to give us indices corres. to each cell array
common_idx = bsxfun(#minus,sorted_match_index,clens).'
Please note that at the step that calculates match_ID : accumarray(unqID(:),1) could be replaced by histc(unqID,1:max(unqID)). Also, histcounts be another alternative there.
*Sample input, output -
c1 =
'a' 'b' 'c' 'd'
c2 =
'b' 'c' 'a' 'd'
c3 =
'c' 'd' 'a'
common_str =
'a' 'c' 'd'
common_idx =
1 3 4
3 2 4
3 1 2

As noted in the comments to this question, there is a file in File Exchange called "MINTERSECT -- Multiple set intersection." at http://www.mathworks.com/matlabcentral/fileexchange/6144-mintersect-multiple-set-intersection that contains simple code to generalize intersect to multiple sets. In a nutshell, the code gets the output from performing intersect on the first pair of cells and then perform intersect on this output with the next cell. This process continues until all cells have been compared. Note that the author points out that the code is not particularly efficient but it may be sufficient for your use case.

Related

Count number of occurences of a string and relabel

I have a n x 1 cell that contains something like this:
chair
chair
chair
chair
table
table
table
table
bike
bike
bike
bike
pen
pen
pen
pen
chair
chair
chair
chair
table
table
etc.
I would like to rename these elements so they will reflect the number of occurrences up to that point. The output should look like this:
chair_1
chair_2
chair_3
chair_4
table_1
table_2
table_3
table_4
bike_1
bike_2
bike_3
bike_4
pen_1
pen_2
pen_3
pen_4
chair_5
chair_6
chair_7
chair_8
table_5
table_6
etc.
Please note that the dash (_) is necessary Could anyone help? Thank you.
Interesting problem! This is the procedure that I would try:
Use unique - the third output parameter in particular to assign each string in your cell array to a unique ID.
Initialize an empty array, then create a for loop that goes through each unique string - given by the first output of unique - and creates a numerical sequence from 1 up to as many times as we have encountered this string. Place this numerical sequence in the corresponding positions where we have found each string.
Use strcat to attach each element in the array created in Step #2 to each cell array element in your problem.
Step #1
Assuming that your cell array is defined as a bunch of strings stored in A, we would call unique this way:
[names, ~, ids] = unique(A, 'stable');
The 'stable' is important as the IDs that get assigned to each unique string are done without re-ordering the elements in alphabetical order, which is important to get the job done. names will store the unique names found in your array A while ids would contain unique IDs for each string that is encountered. For your example, this is what names and ids would be:
names =
'chair'
'table'
'bike'
'pen'
ids =
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
1
1
1
1
2
2
names is actually not needed in this algorithm. However, I have shown it here so you can see how unique works. Also, ids is very useful because it assigns a unique ID for each string that is encountered. As such, chair gets assigned the ID 1, followed by table getting assigned the ID of 2, etc. These IDs will be important because we will use these IDs to find the exact locations of where each unique string is located so that we can assign those linear numerical ranges that you desire. These locations will get stored in an array computed in the next step.
Step #2
Let's pre-allocate this array for efficiency. Let's call it loc. Then, your code would look something like this:
loc = zeros(numel(A), 1);
for idx = 1 : numel(names)
id = find(ids == idx);
loc(id) = 1 : numel(id);
end
As such, for each unique name we find, we look for every location in the ids array that matches this particular name found. find will help us find those locations in ids that match a particular name. Once we find these locations, we simply assign an increasing linear sequence from 1 up to as many names as we have found to these locations in loc. The output of loc in your example would be:
loc =
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
5
6
7
8
5
6
Notice that this corresponds with the numerical sequence (the right most part of each string) of your desired output.
Step #3
Now all we have to do is piece loc together with each string in our cell array. We would thus do it like so:
out = strcat(A, '_', num2str(loc));
What this does is that it takes each element in A, concatenates a _ character and then attaches the corresponding numbers to the end of each element in A. Because we want to output strings, you need to convert the numbers stored in loc into strings. To do this, you must use num2str to convert each number in loc into their corresponding string equivalents. Once you find these, you would concatenate each number in loc with each element in A (with the _ character of course). The output is stored in out, and we thus get:
out =
'chair_1'
'chair_2'
'chair_3'
'chair_4'
'table_1'
'table_2'
'table_3'
'table_4'
'bike_1'
'bike_2'
'bike_3'
'bike_4'
'pen_1'
'pen_2'
'pen_3'
'pen_4'
'chair_5'
'chair_6'
'chair_7'
'chair_8'
'table_5'
'table_6'
For your copying and pasting pleasure, this is the full code. Be advised that I've nulled out the first output of unique as we don't need it for your desired output:
[~, ~, ids] = unique(A, 'stable');
loc = zeros(numel(A), 1);
for idx = 1 : numel(names)
id = find(ids == idx);
loc(id) = 1 : numel(id);
end
out = strcat(A, '_', num2str(loc));
If you want an alternative to unique, you can work with a hash table, which in Matlab would entail to using the containers.Map object. You can then store the occurrences of each individual label and create the new labels on the go, like in the code below.
data={'table','table','chair','bike','bike','bike'};
map=containers.Map(data,zeros(numel(data),1)); % labels=keys, counts=values (zeroed)
new_data=data; % initialize matrix that will have outputs
for ii=1:numel(data)
map(data{ii}) = map(data{ii})+1; % increment counts of current labels
new_data{ii} = sprintf('%s_%d',data{ii},map(data{ii})); % format outputs
end
This is similar to rayryeng's answer but replaces the for loop by bsxfun. After the strings have been reduced to unique labels (line 1 of code below), bsxfun is applied to create a matrix of pairwise comparisons between all (possibly repeated) labels. Keeping only the lower "half" of that matrix and summing along rows gives how many times each label has previously appeared (line 2). Finally, this is appended to each original string (line 3).
Let your cell array of strings be denoted as c.
[~, ~, labels] = unique(c); %// transform each string into a unique label
s = sum(tril(bsxfun(#eq, labels, labels.')), 2); %'// accumulated occurrence number
result = strcat(c, '_', num2str(x)); %// build result
Alternatively, the second line could be replaced by the more memory-efficient
n = numel(labels);
M = cumsum(full(sparse(1:n, labels, 1)));
s = M((1:n).' + (labels-1)*n);
I'll give you a psuedocode, try it yourself, post the code if it doesn't work
Initiate a counter to 1
Iterate over the cell
If counter > 1 check with previous value if the string is same
then increment counter
else
No- reset counter to 1
end
sprintf the string value + counter into a new array
Hope this helps!

What is the most efficient format for storing strings from a for loop?

I have a script that runs through a series of strings and using regex pulls out certain strings (approx 4 output strings per input string).
e.g. HelloStackOverflowWorld
-> Hello; Stack; Overflow; World;
The final output would ideally be a table where I can filter based upon the strings in the columns. Using the case above, column 1 row 1 would have 'Hello', column 2 row 1 would have 'Stack' and so on.
The problem is, the size of the output will change depending on the input so I am unsure of what output format to use.
At the moment I used something similar to this:
if strfind(missing{ii},'hello')
miss.exch = [miss.exch;'hello'];
temp.exc = regexp(missing{ii},'(?<=\d[Q|T])(\w*?)(?=[q])','match');
miss.exc = [miss.exc;temp.exc];
temp.TQ= regexp(missing{ii},'(Qc|Tc)','match');
if strcmp(temp.TQ{1,1}, 'Tc')
miss.TQ = [miss.TQ;'variableA'];
elseif temp.TQ{1,1} == 'Qc'
miss.TQ = [miss.TQ;'variableB'];
end
else if .........
end
Which obviously results in a 1x1 struct consisting of a number of fields each with many cells. This makes filtering on strings an issue!
How can I define and add data into a 'table of strings' that I can then filter?
I think you are just looking for a cell array. Here is a simple example of what they can do:
C = {'Abc','Bcd';'Cde',[]}
strcmp(C,'Cde')
Results in:
ans =
0 0
1 0
Make sure to check doc cell to see how you can access them.

Counting the occurence of substrings in matlab

I have a cell, something like this P= {Face1 Face6 Scene6 Both9 Face9 Scene11 Both12 Face15}. I would like to count how many Face values, Scene values, Both values in P. I don't care about the numeric values after the string (i.e., Face1 and Face23 would be counted as two). I've tried the following (for the Face) but I got the error "If any of the input arguments are cell arrays, the first must be a cell array of strings and the second must be a character array".
strToSearch='Face';
numel(strfind(P,strToSearch));
Does anyone have any suggestion? Thank you!
Use regexp to find strings that start (^) with the desired text (such as 'Face'). The result will be a cell array, where each cell contains 1 if there is a match, or [] otherwise. So determine if each cell is nonempty (~cellfun('isempty', ...): will give a logical 1 for nonempty cells, and 0 for empty cells), and sum the results (sum):
>> P = {'Face1' 'Face6' 'Scene6' 'Both9' 'Face9' 'Scene11' 'Both12' 'Face15'};
>> sum(~cellfun('isempty', regexp(P, '^Face')))
ans =
4
>> sum(~cellfun('isempty', regexp(P, '^Scene')))
ans =
2
Your example should work with some small tweaks, provided all of P contains strings, but may give the error you get if there are any non-string values in the cell array.
P= {'Face1' 'Face6' 'Scene6' 'Both9' 'Face9' 'Scene11' 'Both12' 'Face15'};
strToSearch='Face';
n = strfind(P,strToSearch);
numel([n{:}])
(returns 4)

Union of cell array of cells

I'm looking for the way to do the union of two cell arrays of cell arrays of strings. For example:
A = {{'one' 'two'};{'three' 'four'};{'five' 'six'}};
B = {{'five' 'six'};{'seven' 'eight'};{'nine' 'ten'}};
And I'd like to get something like:
C = {{'one' 'two'};{'three' 'four'};{'five' 'six'};{'seven' 'eight'};{'nine' 'ten'}};
But when I use C = union(A, B) MATLAB returns an error saying:
Input A of class cell and input B of class cell must be cell arrays of strings, unless one is a string.
Does anyone know how to do something like this in a hopefully simple way? I'd greatly appreciate it.
ALTERNATIVE: A way to have a cell array of separated strings in any other way than a cell array of cell array of strings would be also useful, but as far as I know, it's not possible.
Thank you!
C=[A;B]
allWords=unique([A{:};B{:}])
F=cell2mat(cellfun(#(x)(ismember(allWords,x{1})+2*ismember(allWords,x{2}))',C,'uni',false))
[~,uniqueindices,~]=unique(F,'rows')
C(sort(uniqueindices))
What my code does: it builds up a list of all words allwords, then this list is used to build up a matrix which contains the correlation between the rows and which word they contain. 1=Match for first wird, 2=Match for second word. Finally, on this numeric matrix unique can be applied to get the indices.
Including my update, now the 2 words per cell is hardcoded. To get rid of this limitation it would be neseccary to replace the anonymous function (#(x)(ismember(allWords,x{1})+2*ismember(allWords,x{2}))) with a more generic implementation. Probably using cellfun again.
Union doesn't seem like compatible for cell arrays of cells. So, we need to look for some workaround.
One approach would be to get the data from A and B concatenated vertically. Then, along each column assign each cell of strings an unique ID. Those IDs can then be combined into a double array that opens up the possibility of of using unique with 'rows' option to get us the desired output. This is precisely achieved here.
%// Slightly complicated input for safest verification of results
A = {{'three' 'four'};
{'five' 'six'};
{'five' 'seven'};
{'one' 'two'}};
B = {{'seven' 'eight'};
{'five' 'six'};
{'nine' 'ten'};
{'three' 'six'};};
t1 = [A ; B] %// concatenate all cells from A and B vertically
t2 = vertcat(t1{:}) %// Get all the cells of strings from A and B
t22 = mat2cell(t2,size(t2,1),ones(1,size(t2,2)));
[~,~,row_ind] = cellfun(#(x) unique(x,'stable'),t22,'uni',0)
mat1 = horzcat(row_ind{:})
[~,ind] = unique(mat1,'rows','stable')
out1 = t2(ind,:) %// output as a cell array of strings, used for verification too
out = mat2cell(out1, ones(1,size(out1,1)),size(out1,2)) %//desired output
Output -
out1 =
'three' 'four'
'five' 'six'
'five' 'seven'
'one' 'two'
'seven' 'eight'
'nine' 'ten'
'three' 'six'

Finding but not removing duplicates in Matlab

I have a big array in Matlab like this:
A =
{1x5 cell}
{1x7 cell}
{1x27 cell}
{1x11 cell}
...
where the cells look like this:
C{1}
ans =
'apple' 'banana' 'kiwi' 'orange'
I want to find where in A find cells containing double information, like:
C{27}
ans =
'turtle' 'kiwi' 'fox' 'badger'
I.e. here I want to see if C(1) and C(27) has a duplicate word 'kiwi'.
So I can manually look at them and decide where I should remove the duplicate where I see fit.
Sorry I'm not going to provide a coded solution, more the process I'd use so that you can start coding, if you then have any specific problems fell free to post an question
I would use nchoosek to generate an array of all the permutations of the cell array C so
nCells = length(C);
nPerms = nchoosek(1:nCells,2);
You can then loop over all the permutations using intersect to see if there are common strings.
result(i) = intersect(C{nPerms(i,1)},C{nPerms(i,2)});
This will give you an array listing all common strings and with the nPerms array you'll have the two rows with the common string. However if you try to run this it will fail as intersect likes to have the same number of element in each cell array.
So I'd create a temporary cell array padded out with blank cells so that each element in C is the same length, prior to the loop.
This will calculate the longest cell in the array C by calculating the number of elements (#numel) in each cell, followed by calculating the maximum.
cSize = cellfun(#numel,C);
maxSize = max(cSize);
We can then define a function to pad out blank cells
fcn = #(x) [x cell(1,maxSize - numel(x))];
paddedC = cellfun(fcn,C,'UniformOutput',false);
This should give you a cell array with same number of elements in each cell. You can then use this cell array in your loop testing each permutation.
No doubt someone will turn up with a one line cellfun solution but I hope that this is enough to get you started.

Resources