What is the difference between interpolation and imputation? - python-3.x

I just learned that you can handle missing data/ NaN with imputation and interpolation, what i just found is interpolation is a type of estimation, a method of constructing new data points within the range of a discrete set of known data points while imputation is replacing the missing data of the mean of the column. But is there any differences more than that? When is the best practice to use each of them?

Interpolation
Interpolation (linear) is basically a straight line between two given points where data points between these two are missing:
Two red points are known
Blue point is missing
source: wikipedia
Oke nice explanation, but show me with data.
First of all the formula for linear interpolation is the following:
(y1-y0) / (x1-x0)
Let's say we have the three data points from the graph above:
df = pd.DataFrame({'Value':[0, np.NaN, 3]})
Value
0 0.0
1 NaN
2 3.0
As we can see row 1 (blue point) is missing.
So following formula from above:
(3-0) / (2-0) = 1.5
If we interpolate these using the pandas method Series.interpolate:
df['Value'].interpolate()
0 0.0
1 1.5
2 3.0
Name: Value, dtype: float64
For a bigger dataset it would look as follows:
df = pd.DataFrame({'Value':[1, np.NaN, 4, np.NaN, np.NaN,7]})
Value
0 1.0
1 NaN
2 4.0
3 NaN
4 NaN
5 7.0
df['Value'].interpolate()
0 1.0
1 2.5
2 4.0
3 5.0
4 6.0
5 7.0
Name: Value, dtype: float64
Imputation
When we impute the data with the (arithmetic) mean, we follow the following formula:
sum(all points) / n
So for our second dataframe we get:
(1 + 4 + 7) / 3 = 4
So if we impute our dataframe with Series.fillna and Series.mean:
df['Value'].fillna(df['Value'].mean())
0 1.0
1 4.0
2 4.0
3 4.0
4 4.0
5 7.0
Name: Value, dtype: float64

I will answer the second part of your question i.e. when to use what.
We use both techniques depending upon the use case.
Imputation:
If you are given a dataset of patients with a disease (say Pneumonia) and there is a feature called body temperature. So, if there are null values for this feature then you can replace it by average value i.e. Imputation.
Interpolation:
If you are given a dataset of the share price of a company, you know that every Saturday and Sunday are off. So those are missing values. Now, these values can be filled by the average of Friday value and Monday value i.e. Interpolation.
So, you can choose the technique depending upon the use case.

Related

Pandas apply with eval not giving NAN as result when NAN in column its calculating on

I have to support the ability for user to run any formula against a frame to produce a new column.
I may have a frame that looks like
dim01 dim02 msr01
0 A 25 1.0
1 B 26 5.3
2 C 53 NaN
I interpret user code to allow them to run a formula using supported functions/ standard operators / other columns
So a formula might look like SQRT([msr01]*100+7)
I convert the user input to Python syntax so this would evaluate to something like
formula_str = '(math.sqrt((row.msr01*100)+7))'
I then apply it to my pandas dataframe like this
data_frame['msr002'] = data_frame.apply(lambda row: eval(formula_str), axis=1)
This was working good until I hit data with a NaN in a column used in the calculation. I noticed that when this case happens I get a frame like this in return.
dim01 dim02 msr01 msr02
0 A 25 1.0 10.344
1 B 26 5.3 23.173
2 C 53 NaN 7.342
So it appears that the eval is not evaluating the NaN correctly.
I am using a lexer/parser to ensure that the user sent formula isnt dangerous and to convert from everyday user syntax to use python functions and make it work against pandas columns.
Any advice on how to fix this?
Perhaps I should include something in the lambda that looks if any required column is NaN and just hardcode to Nan in that case? But that doesn't seem like the best solution to me.
I did see this question which is similar but didnt think it answered my exact need.
So you can try with
df.msr01.mul(100).add(7)**0.5
Out[716]:
0 10.34408
1 23.17326
2 NaN
Name: msr01, dtype: float64
Also with your original code
df.apply(lambda row: eval(formula_str), axis=1)
Out[714]:
0 10.34408
1 23.17326
2 NaN
dtype: float64

Pandas converts integer numbers to real numbers when reading from Excel

I recently started exploring python for analyzing excel data.
I have an excel file with two worksheets, each one with one matrix (with m = 1000 rows and n= 999 columns).The elements of both matrices are related to each other: one of the matrices concerns diplacement values and the other matrix concerns the force values corresponding to each displacement. The displacements and corresponding forces are obtained from m=1000 numerical simulations and n= 999 increments. Is it possible to identify the force values that correspond only to displacement values that are integer numbers? Or, as an alternative, is it possible to replace all the decimal numbers from the matrix of displacements by 0? I tried to read the excel file into a Pandas dataframe, however all values from the matrix of displacements seem presented as "real numbers" (e.g. numbers "1", "2", "3", etc. from excel are presented with a floating point as "1.", "2.", "3." in python).
Thank you for your attention.
Let's make an example in a smaller scale (3 * 3).
I prepared an Excel file with 2 sheets and read them:
displ = pd.read_excel('Input_2.xlsx', 'Displ')
forces = pd.read_excel('Input_2.xlsx', 'Forces')
Both DataFrames contain:
displ forces
C1 C2 C3 C1 C2 C3
0 10.0 12.1 11.3 0 120.1 130.2 140.3
1 12.5 13.0 13.5 1 150.4 160.5 170.6
2 12.6 13.6 13.8 2 180.7 190.8 200.9
To identify elements of displ containing integer numbers
(actually, still float numbers, but with the fractional
parts == 0.0), you can run:
displ.mod(1.0) == 0.0
and you will get:
C1 C2 C3
0 True False False
1 False True False
2 False False False
And to get corresponding force values and NaN
for other values, you can run:
forces.where(displ.mod(1.0) == 0.0)
getting:
C1 C2 C3
0 120.1 NaN NaN
1 NaN 160.5 NaN
2 NaN NaN NaN
Another option is to get a list of indices in displ where
the corresponding element has zero fractional part.
Actually it is a Numpy function, so it operates on the
underlying Numpy array and returns integer (zero-based)
indices:
ind = np.nonzero((displ.mod(1.0) == 0.0).values)
The result is:
(array([0, 1], dtype=int64), array([0, 1], dtype=int64))
so it is a 2-tuple of indices:
row indices,
column indices.
You can also retrieve a list of indicated elements from
forces, actually also from the underlying Numpy array,
running:
forces.values[ind]
The result is:
array([120.1, 160.5])
To replace "integer" elements of displ with zeroes, you
can run:
displ.mask(displ.mod(1.0) == 0.0, 0, inplace=True)
Now displ contains:
C1 C2 C3
0 0.0 12.1 11.3
1 12.5 0.0 13.5
2 12.6 13.6 13.8
Note that the "wanted" elements are still float zeroes,
but this is a feature of Pandas that each column has one
type, fitting all elements in this column (in this case just float).

Fixing Pandas NaN when making a new column?

I have two Panda's Dataframes
id volume
1 100
2 200
3 300
and
id 2020-07-01 2020-07-02 ...
1 12 14
2 5 1
3 7 8
I am trying to make a new column in the first table based on the values in the second table.
df['Total_Change'] = df2.iloc[:, 0] - df2.iloc[:, -1]
df['Change_MoM'] = df2.iloc[:, -2] - df2.iloc[:, -1]
This works, but the values are all shifted down in the table by one so that the first value is NaN and the last value is lost, so that my result is
id volume Total_Change Change_MoM
1 100 NaN NaN
2 200 -2 -2
3 300 4 4
Why is this happening? I've already double checked that the df2.iloc statements are grabbing the correct values, but I don't understand why my first table is shifting the values down a row. I've also tried shifting the table up one, but that left an NaN at the bottom.
The two tables are the same size. To be clear, I want to know how to prevent the NaN from occurring in the first place, not to replace it with some other value.
Both dfs have different index a quick fix is add reset_index()
df=df.reset_index(drop=True)
df2=df2.reset_index(drop=True)

How to change the location of the column name in pandas?

Exercise 1: What are the minimum and maximum prices of each variety of wine? Create a Dataframe whose index is the variety category from the min and max thereof.
answ = pd.DataFrame()
answ['price_max_variety'] = reviews.groupby('variety').price.max()
answ['price_min_variety'] = reviews.groupby('variety').price.min()
answ.head()
Output:
price_max_variety
price_min_variety
variety
# this is my problem
# what is this row?
Abouriou
75.0
15.0
Agiorgitiko
66.0
10.0
Aglianico
180.0
6.0
Aidani
27.0
27.0
Airen
10.0
8.0
I would like it to do it right. I have no idea how to make this look properly and because of the generic words can't find relevant information.
The blank grey row that you've pointed out is there to make room for the name of the DataFrame's index, which is variety. That index came from the default behaviour of df.groupby(): the grouped-by column(s) end up in the index of the resulting DataFrame.
To override this, try df.groupby('variety', as_index=False). Or, if you have a DataFrame with an index that you want to move into a column, run df.reset_index().

Ignore #N/As in Excel LINEST function with multiple independent variables (known_x's)

I am trying to find the equation of a plane of best fit to a set of x,y,z data using the LINEST function. Some of the z data is missing, meaning that there are #N/As in the z column. For example:
A B C
(x) (y) (z)
1 1 1 5.1
2 2 1 5.4
3 3 1 5.7
4 1 2 #N/A
5 2 2 5.2
6 3 2 5.5
7 1 3 4.7
8 2 3 5
9 3 3 5.3
I would like to do =LINEST(C1:C9,A1:B9), but the #N/A causes this to return a value error.
I found a solution for a single independent variable (one column of known_x's, i.e. fitting a line to x,y data), but I have not been able to extend it for two independent variables (two known_x's columns, i.e. fitting a plane to x,y,z data). The solution I found is here: http://www.excelforum.com/excel-general/647448-linest-question.html, and the formula (slightly modified for my application) is:
=LINEST(
N(OFFSET(C1:C9,SMALL(IF(ISNUMBER(C1:C9),ROW(C1:C9)-ROW(C1)),
ROW(INDIRECT("1:"&COUNT(C1:C9)))),0,1)),
N(OFFSET(A1:A9,SMALL(IF(ISNUMBER(C1:C9),ROW(C1:C9)-ROW(C1)),
ROW(INDIRECT("1:"&COUNT(C1:C9)))),0,1)),
)
which is equivalent to =LINEST(C1:C9,A1:A9), ignoring the row containing the #N/A.
The formula from the posted link could probably be adapted but it is unwieldy. Least squares with missing data can be viewed as a regression with weight 1 for numeric values and weight 0 for non-numeric values. Based on this observation you could try this (with Ctrl+Shift+Enter in a 1x3 range):
=LINEST(IF(ISNUMBER(C1:C9),C1:C9,),IF(ISNUMBER(C1:C9),CHOOSE({1,2,3},1,A1:A9,B1:B9),),)
This gives the equation of the plane as z=-0.2x+0.3y+5 which can be checked against the results of using LINEST(C1:C8,A1:B8) with the error row removed.

Resources