Removing 'NaN' strings and [] cells from cell array in Matlab

Removing 'NaN' strings and [] cells from cell array in Matlab - string

I have a cell array, given as
raw = {100 3.2 38 1;
100 3.7 38 1;
100 'NaN' 'NaN' 1;
100 3.8 38 [];
'NaN' 'NaN' 'NaN' 'NaN';
'NaN' 'NaN' 'NaN' [];
100 3.8 38 1};
How can I remove the rows which have at least one 'NaN' string and empty cell []? Thus, in this case, I want to remove 3rd, 4th, 5th and 6th row from the above-mentioned cell array. Thanks in advance!

In your cellarray the values NaN are defined as string and not as the "special" value NaN
In this case, you can use the functions isempty and isfloat to identify which elements of the cellarray are either empty or of type float:
% Remove rows with empty cells
idx=any(cell2mat(cellfun(#isempty,raw,'UniformOutput',false)),2)
raw(idx,:)=[]
% Remove rows with 'NaN'
idx=all(cell2mat(cellfun(#isfloat,raw,'UniformOutput',false)),2)
raw(~idx,:)=[]
In the first step you look for the empty cells using the function isempty, since the input is a cellarray you have to use cellfun to apply the functino to all the elements of the cell array.
isempty returns a cellarray of 0 and 1 where 1 identifies an empty cell, so, after having converted it into an array (with the functino cell2mat) you can identify the indices of the roww with an empty cell using the function any.
IN the second step, with a similar approach, you can identify the rows containing floating values with the function `isfloat.
The same approach can be used in case the NaN in your cellarray are defined as "values" and not as strings:
idx=any(cell2mat(cellfun(#isempty,raw,'UniformOutput',false)),2)
raw(idx,:)=[]
idx=any(cell2mat(cellfun(#isnan,raw,'UniformOutput',false)),2)
raw(idx,:)=[]

To find which row has 'NaN's run:
idxNan = any(cellfun(#(x) isequal(x,'NaN'),raw),2);
Similarly, to find which rows have empty cells run:
idxEmpty = any(cellfun(#(x) isempty(x),raw),2);
Then you can ommit rows you don't want using 'or'
raw(idxNan | idxEmpty,:) = [];
replace | with & if that what you meant

Related

Textjoin values of column B if duplicates are present in column A

I want to consolidate the data of column B into a single cell ONLY IF the index (ie., Column A) is duplicated.
For example:
Currently, I'm doing manually for each duplicated index by using the following formula:
=TEXTJOIN(", ",TRUE,B4:B6)
Is there a better way to do this all at once?
Any help is appreciated.

There may easier way but you can try this formula-
=BYROW(A2:A17,LAMBDA(p,IF(INDEX(MAP(A2:A17,LAMBDA(x,SUM(--(A2:INDEX(A2:A17,ROW(x)-1)=x)))),ROW(p)-1,1)=1,TEXTJOIN(", ",1,FILTER(B2:B17,A2:A17=p)),"")))

Using REDUCE might be possible for a more succinct solution, though try this for now:
=BYROW(A2:A17,LAMBDA(ζ,LET(α,A2:A17,IF((COUNTIF(α,ζ)>1)*(COUNTIF(INDEX(α,1):ζ,ζ)=1),TEXTJOIN(", ",,FILTER(B2:B17,α=ζ)),""))))

For the sake of alternatives about how to solve it:
Using XMATCH/UNIQUE
=LET(A, A2:A17, ux, UNIQUE(A),idx, FILTER(XMATCH(ux, A), COUNTIF(A, ux)>1),
MAP(SEQUENCE(ROWS(A)), LAMBDA(s, IF(ISNA(XMATCH(s, idx)), "", TEXTJOIN(",",,
FILTER(B2:B17, A=INDEX(A,s)))))))
or using SMALL/INDEX to identify the first element of the repetition:
=LET(A, A2:A17, n, ROWS(A), s, SEQUENCE(n),
MAP(A, s, LAMBDA(aa,ss, LET(f, FILTER(B2:B17, A=aa), IF((ROWS(f)>1)
* (INDEX(s, SMALL(IF(A=aa, s, n+1),1))=ss), TEXTJOIN(",",, f), "")))))
Here is the output:
Explanation
XMATCH and UNIQUE
The main idea here is to identify the first unique elements of column A via ux, and find their corresponding index position in A via XMATCH(ux, A). It is an array of the same size as ux. Then COUNTIF(A, ux)>1) returns an array of the same size as XMATCH output indicating where we have a repetition.
Here is the intermediate result:
XMATCH(ux, A) COUNTIF(A, ux)>1)
1 FALSE
2 FALSE
3 TRUE
6 FALSE
7 TRUE
9 TRUE
11 FALSE
12 TRUE
15 FALSE
16 FALSE
so FILTER takes only the rows form the first column where the second column is TRUE, i.e the index position (idx) where the repetition starts. For our sample it will be: {3;7;9;12}.
Now we iterate over the sequence of index positions (s) via MAP . If s is found in idx via XMATCH (also XLOOKUP(s, idx, TRUE, FALSE) can be used for the same purpose) then we join the values of column B filtered by column A equal to INDEX(A,s).
SMALL and INDEX
This is a more flexible approach because in the case we want to do the concatenation in another position of the repetition you just need to specify the order and the formula doesn't change.
We iterate via MAP through elements of column A and index position (s). The name f has the filtered values from column B where column A is equal to a given value of the iteration aa. We need to identify only filtered rows with repetition, so the first condition ROWS(f) > 1 ensures it.
The second condition identifies only the first element of the repetition:
INDEX(s, SMALL(IF(A=aa, s, n+1),1))=ss
The second argument of SMALL indicates we want the first smallest value, but it could be the second, third, etc.
Where A is equal to aa, IF assigns the corresponding value of the sequence (remember IF works as an array formula), if not then it assigns a value that will never be the smallest one, for example, n+1, where n represents the number of rows of column B. SMALL returns the smallest index position. If the current index position ss is not the smallest one, the conditions FALSE.
Finally, we do a TEXTJOIN only when both conditions are met (we multiply them to ensure an AND condition).

How to find out Max<List<List>> in Excel?

I'd like to find out largest sum of numbers separated by empty row. In this example I am looking to get number 6 (3+3)
1
1
2
2
3
3
Brute forcing this I would =MAX(SUM(A1:A2),SUM(A4:A5),SUM(A7:A8)) which does the job but obviously not practical. How can I express above more elegantly without hardcoding anything?
Thinking out loud, I would like to
Ingest all numbers, split by empty row into some kind of List<List>
Iterate over this list, sum numbers in child list and pick a winner
How can this be done in Excel?

There are multiple ways of doing it, this is just one of them. In cell C1 you can put the following formula:
=LET(set, A1:A9, matrix, 1*TEXTSPLIT(SUBSTITUTE(TEXTJOIN(",",
FALSE, set),",,",";"),",",";", TRUE),m, COLUMNS(matrix), ones, SEQUENCE(m,1,,0),
MAX(MMULT(matrix, ones))
)
and here is the output:
Note: The third input argument of TEXTSPLIT set to TRUE ensures you can have more than one empty row in the middle of the range. The second input argument of TEXTJOIN set to FALSE is required to ensure to generate of more than one comma (,), which is our condition to replace by the row delimiter (;) so we can split by row and columns. MMULT requires numbers and TEXTSPLIT converts the information into texts. we need to coerce the result into a number by multiplying it by 1.
The formula follows the approach you suggested, you can test the intermediate step. Instead of having as output MAX result the variable you want to verify, for example:
=LET(set, A1:A9, matrix, 1*TEXTSPLIT(SUBSTITUTE(TEXTJOIN(",",
FALSE, set),",,",";"),",",";", TRUE),m, COLUMNS(matrix), ones, SEQUENCE(m,1,,0),
TMP, MAX(MMULT(matrix, ones)), matrix
)
will produce the following output:
1 1
2 2
3 3
An alternative to MULT is to use BYROW array function (less verbose):
=LET(set, A1:A8, matrix, 1*TEXTSPLIT(SUBSTITUTE(TEXTJOIN(",",
FALSE, set),",,",";"),",",";", TRUE),MAX(BYROW(matrix, LAMBDA(m, SUM(m))))
)

Can't drop na with pandas read excel file in Python

I try to remove all NaN rows from a dataframe which I get by pd.read_excel("test.xlsx", sheet_name = "Sheet1"), I have tried with df = df.dropna(how='all') and df.dropna(how='all', inplace=True), both cannot remove the last empty rows which I printed as follows: df.tail(1).
a b c
3463 NaN NaN
I noticed the value in column c is not null but empty. Someone could help to deal with this issue? Thank you.

Maybe you want replace empty values to missing before:
df = df.replace(r'^\s+$', np.nan, regex=True).dropna(how='all')
Regex ^\s+$ means:
^ is start of string
\s+ is one or more whitespaces
$ means end of string

Here NaN is also value and empty will also be treated as a part of row.
In case of NaN, you must drop or replace with something:
dropna()
If you use this function then whenever python finds NaN in a row, it will return True and will remove whole row, doesn't matter if any value is there or not besides NaN.
fillna() to fill some values instead of NaN
In your case :
df['C'].fillna(values="Any value")
Note: It is important to specify columns in which you want to fill values otherwise it will update whole dataframe respective to NaN
Now if there is empty row then try this :
df[df['C']==" "]="Anyvalue"
I have not tried this but my assumption on above is:
Lets break down:
a. df['C']==""
This will return boolean values
b. df[df['C']==""]="Anyvalue"
wherever python finds True, value "Anyvalue" will get applied.

Counting the occurence of substrings in matlab

I have a cell, something like this P= {Face1 Face6 Scene6 Both9 Face9 Scene11 Both12 Face15}. I would like to count how many Face values, Scene values, Both values in P. I don't care about the numeric values after the string (i.e., Face1 and Face23 would be counted as two). I've tried the following (for the Face) but I got the error "If any of the input arguments are cell arrays, the first must be a cell array of strings and the second must be a character array".
strToSearch='Face';
numel(strfind(P,strToSearch));
Does anyone have any suggestion? Thank you!

Use regexp to find strings that start (^) with the desired text (such as 'Face'). The result will be a cell array, where each cell contains 1 if there is a match, or [] otherwise. So determine if each cell is nonempty (~cellfun('isempty', ...): will give a logical 1 for nonempty cells, and 0 for empty cells), and sum the results (sum):
>> P = {'Face1' 'Face6' 'Scene6' 'Both9' 'Face9' 'Scene11' 'Both12' 'Face15'};
>> sum(~cellfun('isempty', regexp(P, '^Face')))
ans =
4
>> sum(~cellfun('isempty', regexp(P, '^Scene')))
ans =
2

Your example should work with some small tweaks, provided all of P contains strings, but may give the error you get if there are any non-string values in the cell array.
P= {'Face1' 'Face6' 'Scene6' 'Both9' 'Face9' 'Scene11' 'Both12' 'Face15'};
strToSearch='Face';
n = strfind(P,strToSearch);
numel([n{:}])
(returns 4)

Union of cell array of cells

I'm looking for the way to do the union of two cell arrays of cell arrays of strings. For example:
A = {{'one' 'two'};{'three' 'four'};{'five' 'six'}};
B = {{'five' 'six'};{'seven' 'eight'};{'nine' 'ten'}};
And I'd like to get something like:
C = {{'one' 'two'};{'three' 'four'};{'five' 'six'};{'seven' 'eight'};{'nine' 'ten'}};
But when I use C = union(A, B) MATLAB returns an error saying:
Input A of class cell and input B of class cell must be cell arrays of strings, unless one is a string.
Does anyone know how to do something like this in a hopefully simple way? I'd greatly appreciate it.
ALTERNATIVE: A way to have a cell array of separated strings in any other way than a cell array of cell array of strings would be also useful, but as far as I know, it's not possible.
Thank you!

C=[A;B]
allWords=unique([A{:};B{:}])
F=cell2mat(cellfun(#(x)(ismember(allWords,x{1})+2*ismember(allWords,x{2}))',C,'uni',false))
[~,uniqueindices,~]=unique(F,'rows')
C(sort(uniqueindices))
What my code does: it builds up a list of all words allwords, then this list is used to build up a matrix which contains the correlation between the rows and which word they contain. 1=Match for first wird, 2=Match for second word. Finally, on this numeric matrix unique can be applied to get the indices.
Including my update, now the 2 words per cell is hardcoded. To get rid of this limitation it would be neseccary to replace the anonymous function (#(x)(ismember(allWords,x{1})+2*ismember(allWords,x{2}))) with a more generic implementation. Probably using cellfun again.

Union doesn't seem like compatible for cell arrays of cells. So, we need to look for some workaround.
One approach would be to get the data from A and B concatenated vertically. Then, along each column assign each cell of strings an unique ID. Those IDs can then be combined into a double array that opens up the possibility of of using unique with 'rows' option to get us the desired output. This is precisely achieved here.
%// Slightly complicated input for safest verification of results
A = {{'three' 'four'};
{'five' 'six'};
{'five' 'seven'};
{'one' 'two'}};
B = {{'seven' 'eight'};
{'five' 'six'};
{'nine' 'ten'};
{'three' 'six'};};
t1 = [A ; B] %// concatenate all cells from A and B vertically
t2 = vertcat(t1{:}) %// Get all the cells of strings from A and B
t22 = mat2cell(t2,size(t2,1),ones(1,size(t2,2)));
[~,~,row_ind] = cellfun(#(x) unique(x,'stable'),t22,'uni',0)
mat1 = horzcat(row_ind{:})
[~,ind] = unique(mat1,'rows','stable')
out1 = t2(ind,:) %// output as a cell array of strings, used for verification too
out = mat2cell(out1, ones(1,size(out1,1)),size(out1,2)) %//desired output
Output -
out1 =
'three' 'four'
'five' 'six'
'five' 'seven'
'one' 'two'
'seven' 'eight'
'nine' 'ten'
'three' 'six'

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Removing 'NaN' strings and [] cells from cell array in Matlab - string

Related

Textjoin values of column B if duplicates are present in column A

How to find out Max<List<List>> in Excel?

Can't drop na with pandas read excel file in Python

Counting the occurence of substrings in matlab

Union of cell array of cells

Categories

Resources