Can't drop na with pandas read excel file in Python - python-3.x

I try to remove all NaN rows from a dataframe which I get by pd.read_excel("test.xlsx", sheet_name = "Sheet1"), I have tried with df = df.dropna(how='all') and df.dropna(how='all', inplace=True), both cannot remove the last empty rows which I printed as follows: df.tail(1).
a b c
3463 NaN NaN
I noticed the value in column c is not null but empty. Someone could help to deal with this issue? Thank you.

Maybe you want replace empty values to missing before:
df = df.replace(r'^\s+$', np.nan, regex=True).dropna(how='all')
Regex ^\s+$ means:
^ is start of string
\s+ is one or more whitespaces
$ means end of string

Here NaN is also value and empty will also be treated as a part of row.
In case of NaN, you must drop or replace with something:
dropna()
If you use this function then whenever python finds NaN in a row, it will return True and will remove whole row, doesn't matter if any value is there or not besides NaN.
fillna() to fill some values instead of NaN
In your case :
df['C'].fillna(values="Any value")
Note: It is important to specify columns in which you want to fill values otherwise it will update whole dataframe respective to NaN
Now if there is empty row then try this :
df[df['C']==" "]="Anyvalue"
I have not tried this but my assumption on above is:
Lets break down:
a. df['C']==""
This will return boolean values
b. df[df['C']==""]="Anyvalue"
wherever python finds True, value "Anyvalue" will get applied.

Related

How to replace text in column by the value contained in the columns named in this text

In pyspark, I'm trying to replace multiple text values in a column by the value that are present in the columns which names are present in the calc column (formula).
So to be clear, here is an example :
Input:
|param_1|param_2|calc
|-------|-------|--------
|Cell 1 |Cell 2 |param_1-param_2
|Cell 3 |Cell 4 |param_2/param_1
Output needed:
|param_1|param_2|calc
|-------|-------|--------
|Cell 1 |Cell 2 |Cell 1-Cell 2
|Cell 3 |Cell 4 |Cell 4/Cell 3
In the column calc, the default value is a formula. It can be something as much as simple as the ones provided above or it can be something like "2*(param_8-param_4)/param_2-(param_3/param_7)".
What I'm looking for is something to substitute all the param_x by the values in the related columns regarding the names.
I've tried a lot of things but nothing works at all and most of the time when I use replace or regex_replace with a column for the replacement value, the error the column is not iterable occurs.
Moreover, the columns param_1, param_2, ..., param_x are generated dynamically and the calc column values can some of these columns but not necessary all of them.
Could you help me on the subject with a dynamic solution ?
Thank you so much.
Best regards
Update: Turned out I misunderstood the requirement. This would work:
for exp in ["regexp_replace(calc, '"+col+"', "+col+")" for col in df.schema.names]:
df=df.withColumn("calc", F.expr(exp))
Yet Another Update: To Handle Null Values add coalesce:
for exp in ["coalesce(regexp_replace(calc, '"+col+"', "+col+"), calc)" for col in df.schema.names]:
df=df.withColumn("calc", F.expr(exp))
Input/Output:
------- Keeping the below section for a while just for reference -------
You can't directly do that - as you won't be able to use column value directly unless you collect in a python object (which is obviously not recommended).
This would work with the same:
df = spark.createDataFrame([["1","2", "param_1 - param_2"],["3","4", "2*param_1 + param_2"]]).toDF("param_1", "param_2", "calc");
df.show()
df=df.withColumn("row_num", F.row_number().over(Window.orderBy(F.lit("dummy"))))
as_dict = {row.asDict()["row_num"]:row.asDict()["calc"] for row in df.select("row_num", "calc").collect()}
expression = f"""CASE {' '.join([f"WHEN row_num ='{k}' THEN ({v})" for k,v in as_dict.items()])} \
ELSE NULL END""";
df.withColumn("Result", F.expr(expression)).show();
Input/Output:

Removing 'NaN' strings and [] cells from cell array in Matlab

I have a cell array, given as
raw = {100 3.2 38 1;
100 3.7 38 1;
100 'NaN' 'NaN' 1;
100 3.8 38 [];
'NaN' 'NaN' 'NaN' 'NaN';
'NaN' 'NaN' 'NaN' [];
100 3.8 38 1};
How can I remove the rows which have at least one 'NaN' string and empty cell []? Thus, in this case, I want to remove 3rd, 4th, 5th and 6th row from the above-mentioned cell array. Thanks in advance!
In your cellarray the values NaN are defined as string and not as the "special" value NaN
In this case, you can use the functions isempty and isfloat to identify which elements of the cellarray are either empty or of type float:
% Remove rows with empty cells
idx=any(cell2mat(cellfun(#isempty,raw,'UniformOutput',false)),2)
raw(idx,:)=[]
% Remove rows with 'NaN'
idx=all(cell2mat(cellfun(#isfloat,raw,'UniformOutput',false)),2)
raw(~idx,:)=[]
In the first step you look for the empty cells using the function isempty, since the input is a cellarray you have to use cellfun to apply the functino to all the elements of the cell array.
isempty returns a cellarray of 0 and 1 where 1 identifies an empty cell, so, after having converted it into an array (with the functino cell2mat) you can identify the indices of the roww with an empty cell using the function any.
IN the second step, with a similar approach, you can identify the rows containing floating values with the function `isfloat.
The same approach can be used in case the NaN in your cellarray are defined as "values" and not as strings:
idx=any(cell2mat(cellfun(#isempty,raw,'UniformOutput',false)),2)
raw(idx,:)=[]
idx=any(cell2mat(cellfun(#isnan,raw,'UniformOutput',false)),2)
raw(idx,:)=[]
To find which row has 'NaN's run:
idxNan = any(cellfun(#(x) isequal(x,'NaN'),raw),2);
Similarly, to find which rows have empty cells run:
idxEmpty = any(cellfun(#(x) isempty(x),raw),2);
Then you can ommit rows you don't want using 'or'
raw(idxNan | idxEmpty,:) = [];
replace | with & if that what you meant

writetable replace NaN with blanks in Matlab

Given a Matlab table that contains many NaN, how can I write this table as an excel or csv files where the NaN are replaced by blanks?
I use the following function:
T = table(NaN(5,2),'VariableNames',{'A','C'})
writetable(T, filename)
I do not want to replace it with zeros. I want that the output file:
has blanks for NaN and
that the variable names are included in the output.
You just need xlswrite for that. It replaces NaNs with blanks itself. Use table2cell or the combination of table2array and num2cell to convert your table to a cell array first. Use the VariableNames property of the table to retrieve the variable names and pad them with the cell array.
data= [T.Properties.VariableNames; table2cell(T)];
%or data= [T.Properties.VariableNames; num2cell(table2array(T))];
xlswrite('output',data);
Sample run for:
T = table([1;2;3],[NaN; 410; 6],[31; NaN; 27],'VariableNames',{'One' 'Two' 'Three'})
T =
3×3 table
One Two Three
___ ___ _____
1 NaN 31
2 410 NaN
3 6 27
yields:
Although the above solution is simpler in my opinion but if you really want to use writetable then:
tmp = table2cell(T); %Converting the table to a cell array
tmp(isnan(T.Variables)) = {[]}; %Replacing the NaN entries with []
T = array2table(tmp,'VariableNames',T.Properties.VariableNames); %Converting back to table
writetable(T,'output.csv'); %Writing to a csv file
I honestly think the most straight-forward way to output the data in the format you describe is to use xlswrite as Sardar did in his answer. However, if you really want to use writetable, the only option I can think of is to encapsulate every value in the table in a cell array and replace the nan entries with empty cells. Starting with this sample table T with random data and nan values:
T = table(rand(5,1), [nan; rand(3,1); nan], 'VariableNames', {'A', 'C'});
T =
A C
_________________ _________________
0.337719409821377 NaN
0.900053846417662 0.389738836961253
0.369246781120215 0.241691285913833
0.111202755293787 0.403912145588115
0.780252068321138 NaN
Here's a general way to do the conversion:
for name = T.Properties.VariableNames % Loop over variable names
temp = num2cell(T.(name{1})); % Convert numeric array to cell array
temp(cellfun(#isnan, temp)) = {[]}; % Set cells with NaN to empty
T.(name{1}) = temp; % Place back into table
end
And here's what the table T ends up looking like:
T =
A C
___________________ ___________________
[0.337719409821377] []
[0.900053846417662] [0.389738836961253]
[0.369246781120215] [0.241691285913833]
[0.111202755293787] [0.403912145588115]
[0.780252068321138] []
And now you can output it to a file with writetable:
writetable(T, 'sample.csv');

Counter python 3

I read in a [data set(https://outcomestat.baltimorecity.gov/Transportation/100EBaltimoreST/k7ux-mv7u/about) with pandas.read_csv() with no modifying args.
In the stolenVehicleFlag column there are 0, 1, and NaN.
The nans returnFalse when compared to np.nan or np.NaN.
The column is typed numpy.float64 so I tried typing the np.nans
to that from the float-type that they normally are but it still
returns False.
I also tried using a Counter to roll them up but each nan returns its
own count of 1.
Any ideas on how this is happening and how to deal with it?
I'm not sure what you are expecting to do but may be this could help if you want to get rid of this NaN values considering "df" your dataframre use:
df.dropna()
This will help you with NaN values,
You can check for more information here : pandas.DataFrame.dropna

Comparing strings in same series (row) but different columns

I ran into this problem with comparing strings between two columns. What I want to do is to: For each row, check whether the string is column A is included in column B and if so, print a new string 'Yes' in column C.
Column A contains NaN values (blank cells in the csv I imported).
I have tried:
df['C']=df['B'].str.contains(df.loc['A'])
df.loc[df['A'].isin(df['B']), 'C']='Yes'
They both didn't work as I couldn't find the right way to compare strings.
This uses list comprehension, so it may not be the fastest solution, but works and is concise.
df['C'] = pd.Series(['Yes' if a in b else 'No' for a,b in zip(df['A'],df['B'])])
EDIT: If you don't want to keep the values in C instead of overwriting them with 'No', you can do it like this:
df['C'] = pd.Series(['Yes' if a in b else c for a,b,c in zip(df['A'],df['B'], df['C'])])
df = pd.DataFrame([['ab', 'abc'],
['abc', 'ab']], columns=list('AB'))
df['C'] = np.where(df.apply(lambda x: x.A in x.B, axis=1), 'Yes', 'No')
df
Try regex: https://docs.python.org/2/library/re.html if you already made the code to id every cell or value have to work with.

Resources