Pandas apply with eval not giving NAN as result when NAN in column its calculating on - python-3.x

I have to support the ability for user to run any formula against a frame to produce a new column.
I may have a frame that looks like
dim01 dim02 msr01
0 A 25 1.0
1 B 26 5.3
2 C 53 NaN
I interpret user code to allow them to run a formula using supported functions/ standard operators / other columns
So a formula might look like SQRT([msr01]*100+7)
I convert the user input to Python syntax so this would evaluate to something like
formula_str = '(math.sqrt((row.msr01*100)+7))'
I then apply it to my pandas dataframe like this
data_frame['msr002'] = data_frame.apply(lambda row: eval(formula_str), axis=1)
This was working good until I hit data with a NaN in a column used in the calculation. I noticed that when this case happens I get a frame like this in return.
dim01 dim02 msr01 msr02
0 A 25 1.0 10.344
1 B 26 5.3 23.173
2 C 53 NaN 7.342
So it appears that the eval is not evaluating the NaN correctly.
I am using a lexer/parser to ensure that the user sent formula isnt dangerous and to convert from everyday user syntax to use python functions and make it work against pandas columns.
Any advice on how to fix this?
Perhaps I should include something in the lambda that looks if any required column is NaN and just hardcode to Nan in that case? But that doesn't seem like the best solution to me.
I did see this question which is similar but didnt think it answered my exact need.

So you can try with
df.msr01.mul(100).add(7)**0.5
Out[716]:
0 10.34408
1 23.17326
2 NaN
Name: msr01, dtype: float64
Also with your original code
df.apply(lambda row: eval(formula_str), axis=1)
Out[714]:
0 10.34408
1 23.17326
2 NaN
dtype: float64

Related

pandas df are being read as dict

I'm having some trouble with pandas. I opened a .xlsx file with pandas, but when I try to filter any information, it shows me the error
AttributeError: 'dict' object has no attribute 'head' #(or iloc, or loc, or anything else from DF/pandas)#
So, I did some research and realized that my table turned into a dictionary (why?).
I'm trying to convert this mess into a proper dictionary, so I can convert it into a properly df, because right now, it shows some characteristics from both. I need a df, just it.
Here is the code:
import pandas as pd
df = pd.read_excel('report.xlsx', sheet_name = ["May"])
print(df)
Result: it shows the table plus "[60 rows x 24 columns]"
But when I try to filter or iterate, it shows all dicts possible attibute errors.
Somethings I tried: .from_dict, xls.parse/(df.to_dict).
When I try to convert df to dict properly, it shows
ValueError: If using all scalar values, you must pass an index
I tried this link: [https://stackoverflow.com/questions/17839973/constructing-pandas-dataframe-from-values-in-variables-gives-valueerror-if-usi)][1], but it didn't work. For some reason, it said in one of the errors that I should provide 2-d parameters, that's why I tried to create a new dict and do a sort of 'append', but it didn't work too...
Then I tried all stuff to set an index, but it doesn't let me rename columns because it says .iloc is not an attribute from dict)
I'm new in python, but I never saw a 'pd.read_excel' open a DataFrame as 'dict'. What should I do?
tks!
[1]: Constructing pandas DataFrame from values in variables gives "ValueError: If using all scalar values, you must pass an index"
if its a dict of DataFrames try...
>>> dict_df = {"a":pd.DataFrame([{1:2,3:4},{1:4,4:6}]), "b":pd.DataFrame([{7:9},{1:4}])}
>>> dict_df
{'a': 1 3 4
0 2 4.0 NaN
1 4 NaN 6.0, 'b': 7 1
0 9.0 NaN
1 NaN 4.0}
>>> pd.concat(dict_df.values(),keys=dict_df.keys(), axis=1)
a b
1 3 4 7 1
0 2 4.0 NaN 9.0 NaN
1 4 NaN 6.0 NaN 4.0

What is the difference between interpolation and imputation?

I just learned that you can handle missing data/ NaN with imputation and interpolation, what i just found is interpolation is a type of estimation, a method of constructing new data points within the range of a discrete set of known data points while imputation is replacing the missing data of the mean of the column. But is there any differences more than that? When is the best practice to use each of them?
Interpolation
Interpolation (linear) is basically a straight line between two given points where data points between these two are missing:
Two red points are known
Blue point is missing
source: wikipedia
Oke nice explanation, but show me with data.
First of all the formula for linear interpolation is the following:
(y1-y0) / (x1-x0)
Let's say we have the three data points from the graph above:
df = pd.DataFrame({'Value':[0, np.NaN, 3]})
Value
0 0.0
1 NaN
2 3.0
As we can see row 1 (blue point) is missing.
So following formula from above:
(3-0) / (2-0) = 1.5
If we interpolate these using the pandas method Series.interpolate:
df['Value'].interpolate()
0 0.0
1 1.5
2 3.0
Name: Value, dtype: float64
For a bigger dataset it would look as follows:
df = pd.DataFrame({'Value':[1, np.NaN, 4, np.NaN, np.NaN,7]})
Value
0 1.0
1 NaN
2 4.0
3 NaN
4 NaN
5 7.0
df['Value'].interpolate()
0 1.0
1 2.5
2 4.0
3 5.0
4 6.0
5 7.0
Name: Value, dtype: float64
Imputation
When we impute the data with the (arithmetic) mean, we follow the following formula:
sum(all points) / n
So for our second dataframe we get:
(1 + 4 + 7) / 3 = 4
So if we impute our dataframe with Series.fillna and Series.mean:
df['Value'].fillna(df['Value'].mean())
0 1.0
1 4.0
2 4.0
3 4.0
4 4.0
5 7.0
Name: Value, dtype: float64
I will answer the second part of your question i.e. when to use what.
We use both techniques depending upon the use case.
Imputation:
If you are given a dataset of patients with a disease (say Pneumonia) and there is a feature called body temperature. So, if there are null values for this feature then you can replace it by average value i.e. Imputation.
Interpolation:
If you are given a dataset of the share price of a company, you know that every Saturday and Sunday are off. So those are missing values. Now, these values can be filled by the average of Friday value and Monday value i.e. Interpolation.
So, you can choose the technique depending upon the use case.

Python dataframe with value 'NA' not fetching

I am trying to read a excel with below data:
But when i tried to debug the dataframe its showing only:
Could you explain why the NA is not showing in the dataframe.
Also is there any way to fetch NA .
Python version : 3.7
In pd.read_excel there's an argument for this called na_values.
Quoted from the documentation:
Additional strings to recognize as NA/NaN.
Furthermore you have to overwrite the default NaN values, which is also empty cell '', with the parameter keep_default_na=False
Again quoting from the documentation:
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.
So the following should help your problem:
df = pd.read_excel('Filename.xlsx', na_values='NA', keep_default_na=False)
Output
Item Status
0 Soap NaN
1 butter
2 Rice NaN
3 pen Available

Fill the Null values of the first row of dataframe with 100 [duplicate]

This question already has answers here:
pandas fillna not working
(5 answers)
Closed 3 years ago.
I have a dataframe which looks like this:
51183 53423 51989 52483 51342
100 NaN NaN 83.33 NaN
NaN NaN 50 25 12.5
Here , '51183' , '53423'....are column names. I want to fill the null value present in the first row with 100.
I tried doing this:
df[:1].fillna(100)
It just changes the null values in the first row to 100 but it doesn't update it in the dataframe.
I want the result to look like this:
51183 53423 51989 52483 51342
100 100 100 83.33 100
NaN NaN 50 25 12.5
If you could help me achieve that , I'll greatly appreciate it.
To update the row, try this:
df[:1] = df[:1].fillna(100)
Your try was almost OK.
df[:1] gets the initial row, but treats it as a copy of this row.
Then .fillna(100) changes all NaN values to 100, but in this copy,
not in the table.
An attempt to add inplace=True:
df[:1].fillna(100, inplace=True)
does the job, but issues also a SettingWithCopyWarning warning.
A method to do the job without this warning is e.g. to use .iloc and then .fillna:
df.iloc[0].fillna(100, inplace=True)

Getting an error when calculating standard deviation using Pandas

I am trying to calculate standard deviation of multiple columns using two variables in the groupby. However, my code throws in an error and I am having a hard time figuring it out.
I am using https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/ as a guide.
Below is a sample dataframe:
Book Home Num Pointspread odds
A P -135 -2.5 -110.0
B P NaN -3 -101.0
B P NaN -3 -110.0
C P NaN -3 -120.0
B P NaN -3 -100.0
and this the the code I wrote
home_std_dev = home_analysis_data.groupby('Book','Home').agg({'Num':'std',
'Pointspread':'std',
'odds':'std'})
The code above gives me an error
ValueError: No axis named Home for object type <class 'type'>
I don't know what this error means and how to solve the issue. I am expecting to see a table with the standard deviation of the columns grouped by the two variables. Any help will be appreciated.
Since I'm quite new to python, please let me know if there is a better way to approach this issue. Thank you!
Use list in groupby - ['Book','Home'] for grouping by multiple columns:
home_std_dev = home_analysis_data.groupby(['Book','Home']).agg({'Num':'std',
'Pointspread':'std',
'odds':'std'})

Resources