How do I fill these `NaN` values properly? - python-3.x

Here's my original dataframe with NaN values which I'm trying to fill;
https://prnt.sc/i40j33
If I use df.interpolate(axis=1) to fill up the NaN values, only some of the rows fill up properly with a number.
For e.g
https://prnt.sc/i40mgq
As you can see in the screenshot column:1981 and row:3 which had a NaN value has filled up properly with a value other than NaN. I want to fill the rest of NaN as well like that? Any idea how do I do that?

Using DataFrame.interpolate()
In your case it is failing because there are no columns to the left, and therefore the interpolate method doesn't know what to interpolate it to: missing_value = (left_value + right_value)/2
So you could, for example, insert a column to the left with all 0's (if you would like to impute your missing values on the first column with half of the next value), as such:
df.insert(loc=0, column='allZeroes', value=0)
After this, you could interpolate as you are doing and remove the column
General missing value imputation
Either use df.fillna('DEFAULT-VALUE') as Alex mentioned in the comments to the question. Docs here
or do something like:
df.my_col[df.my_col.isnull()] = 'DEFAULT-VALUE'
I'd recommend using the fillna as you can use methods such as forward fill (ffill) -- impute the missings with the previous value -- and other similar methods.

It seems like you might want to interpolate on axis=0, column-wise:
>>> df = pd.DataFrame(np.arange(35, dtype=float).reshape(5,7),
columns=[1951, 1961, 1971, 1981, 1991, 2001, 2001],
index=range(0, 5))
>>> df.iloc[1:3, 0] = np.nan
>>> df.iloc[3, 3] = np.nan
>>> df.interpolate(axis=0)
1951 1961 1971 1981 1991 2001 2001
0 0.0 1.0 2.0 3.0 4.0 5.0 6.0
1 7.0 8.0 9.0 10.0 11.0 12.0 13.0
2 14.0 15.0 16.0 17.0 18.0 19.0 20.0
3 21.0 22.0 23.0 24.0 25.0 26.0 27.0
4 28.0 29.0 30.0 31.0 32.0 33.0 34.0
Currently you're interpolating row-wise. NaNs that "begin" a Series aren't padded by a value on either side, making interpolation impossible for them.
Update: pandas is adding some more optionality for this in v 0.23.0.

Related

how to select values from multiple columns based on a condition

I have a dataframe which has information about people with balance in their different accounts. It looks something like below.
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['John', 'Jacob', 'Mary', 'Sue', 'Harry', 'Clara'],
'accnt_1':[2, np.nan, 13, np.nan, np.nan, np.nan],
'accnt_2':[32, np.nan, 12, 21, 32, np.nan],
'accnt_3':[11,21,np.nan,np.nan,2,np.nan]})
df
I want to get balance for each person as if accnt_1 is not empty that is the balance of that person. If accnt_1 is empty and accnt_2 is not, number in accnt_2 is the balance. If both accnt_1 and accnt_2 are empty, whatever is in accnt_3 is the balance.
In the end the output should look like
out_df = pd.DataFrame({'name':['John', 'Jacob', 'Mary', 'Sue', 'Harry', 'Clara'],
'balance':[2, 21, 13, 21, 32, np.nan]})
out_df
I will always know the priority of columns. I can write a simple function and apply on this dataframe. But I was thinking is there a better and faster way to do using pandas/numpy?
If balanced means first not missing values after name you can convert name to index, then back filling missing values and select first column by position:
df = df.set_index('name').bfill(axis=1).iloc[:, 0].rename('balance').reset_index()
print (df)
name balance
0 John 2.0
1 Jacob 21.0
2 Mary 13.0
3 Sue 21.0
4 Harry 32.0
5 Clara NaN
If need specify columns names in order by list:
cols = ['accnt_1','accnt_2','accnt_3']
df = df.set_index('name')[cols].bfill(axis=1).iloc[:, 0].rename('balance').reset_index()
Or if need filter only accnt columns use DataFrame.filter:
df = df.set_index('name').filter(like='accnt').bfill(axis=1).iloc[:, 0].rename('balance').reset_index()
You can simply chain fillna methods onto each other to achieve your desired result. The chaining can be read in plain english closely to: "take the values in accnt_1, fill the missing values in accnt_1 with values from accnt_2. Then if there are still remaining NaN after this, fill those missing values with the values from accnt_3"
>>> df["balance"] = df["accnt_1"].fillna(df["accnt_2"]).fillna(df["accnt_3"])
>>> df[["name", "balance"]]
name balance
0 John 2.0
1 Jacob 21.0
2 Mary 13.0
3 Sue 21.0
4 Harry 32.0
5 Clara NaN
df['balance']=df.name.map(df.set_index('name').stack().groupby('name').first())
name accnt_1 accnt_2 accnt_3 balance
0 John 2.0 32.0 11.0 2.0
1 Jacob NaN NaN 21.0 21.0
2 Mary 13.0 12.0 NaN 13.0
3 Sue NaN 21.0 NaN 21.0
4 Harry NaN 32.0 2.0 32.0
5 Clara NaN NaN NaN NaN
How it works
#setting name as index gives you an opportunity to get it as a column name when you unstack
df.set_index('name').stack().groupby('name').first()
name
John accnt_1 2.0
accnt_2 32.0
accnt_3 11.0
Jacob accnt_3 21.0
Mary accnt_1 13.0
accnt_2 12.0
Sue accnt_2 21.0
Harry accnt_2 32.0
accnt_3 2.0
dtype: float64
#Chaining .first() gets you the first index value that is non NaN because when you stack NaN is dropped
df.set_index('name').stack().groupby('name').first()
#.map() allows you to map output above to the original dataframe
df.name.map(df.set_index('name').stack().groupby('name').first())
0 2.0
1 21.0
2 13.0
3 21.0
4 32.0
5 NaN

Removing outliers based on column variables or multi-index in a dataframe

This is another IQR outlier question. I have a dataframe that looks something like this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df
I would like to find and remove the outliers for each condition (i.e. Spring Placebo, Spring Drug, etc). Not the whole row, just the cell. And would like to do it for each of the 'red', 'yellow', 'green' columns.
Is there way to do this without breaking the dataframe into a whole bunch of sub dataframes with all of the conditions broken out separately? I'm not sure if this would be easier if 'Season' and 'Treatment' were handled as columns or indices. I'm fine with either way.
I've tried a few things with .iloc and .loc but I can't seem to make it work.
If need replace outliers by missing values use GroupBy.transform with DataFrame.quantile, then compare for lower and greater values by DataFrame.lt and DataFrame.gt, chain masks by | for bitwise OR and set missing values in DataFrame.mask, default replacement, so not specified:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)
c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)
print (df)
Season Treatment red yellow green
0 Spring Placebo NaN NaN 67.0
1 Spring Placebo 67.0 91.0 3.0
2 Spring Placebo 71.0 56.0 29.0
3 Spring Placebo 48.0 32.0 24.0
4 Spring Placebo 74.0 9.0 51.0
.. ... ... ... ... ...
95 Fall Drug 90.0 35.0 55.0
96 Fall Drug 40.0 55.0 90.0
97 Fall Drug NaN 54.0 NaN
98 Fall Drug 28.0 50.0 74.0
99 Fall Drug NaN 73.0 11.0
[100 rows x 5 columns]

How to plot multiple charts using matplotlib from unstacked dataframe with Pandas

This is a sample of the dataset I have using the following piece of code
ComplaintCity = nyc_df.groupby(['City','Complaint Type']).size().sort_values().unstack()
top5CitiesByComplaints = ComplaintCity[top5Complaints].rename_axis(None, axis=1)
top5CitiesByComplaints
Blocked Driveway Illegal Parking Noise - Street/Sidewalk Noise - Commercial Derelict Vehicle
City
ARVERNE 35.0 58.0 29.0 2.0 27.0
ASTORIA 2734.0 1281.0 500.0 1554.0 363.0
BAYSIDE 377.0 514.0 15.0 40.0 198.0
BELLEROSE 95.0 106.0 13.0 37.0 89.0
BREEZY POINT 3.0 15.0 1.0 4.0 3.0
BRONX 12754.0 7859.0 8890.0 2433.0 1952.0
BROOKLYN 28147.0 27461.0 13354.0 11458.0 5179.0
CAMBRIA HEIGHTS 147.0 76.0 25.0 12.0 115.0
CENTRAL PARK NaN 2.0 95.0 NaN NaN
COLLEGE POINT 435.0 352.0 33.0 35.0 184.0
CORONA 2761.0 660.0 238.0 248.0
I want to be able to plot the same as a horizontal bar chart for each complaint. It should display the Cities with the highest count of complaints. Something similar to the image below. I am not sure how to go about it.
You can create a list of axis instances with subplots and plot the columns one-by-one:
fig, axes = plt.subplots(3,2,figsize=(10,6))
for c,ax in zip(df.columns, axes.ravel()):
df[c].sort_values().plot.barh(ax=ax)
fig.tight_layout()
Then you would get something like this:

Replace missing values based on another column

I am trying to replace the missing values in a dataframe based on filtering of another column, "Country"
>>> data.head()
Country Advanced skiers, freeriders Snow parks
0 Greece NaN NaN
1 Switzerland 5.0 5.0
2 USA NaN NaN
3 Norway NaN NaN
4 Norway 3.0 4.0
Obviously this is just a small snippet of the data, but I am looking to replace all the NaN values with the average value for each feature.
I have tried grouping the data by the country and then calculating the mean of each column. When I print out the resulting array, it comes up with the expected values. However, when I put it into the .fillna() method, the data appears unchanged
I've tried #DSM's solution from this similar post, but I am not sure how to apply it to multiple columns.
listOfRatings = ['Advanced skiers, freeriders', 'Snow parks']
print (data.groupby('Country')[listOfRatings].mean().fillna(0))
-> displays the expected results
data[listOfRatings] = data[listOfRatings].fillna(data.groupby('Country')[listOfRatings].mean().fillna(0))
-> appears to do nothing to the dataframe
Assuming this is the complete dataset, this is what I would expect the results to be.
Country Advanced skiers, freeriders Snow parks
0 Greece 0.0 0.0
1 Switzerland 5.0 5.0
2 USA 0.0 0.0
3 Norway 3.0 4.0
4 Norway 3.0 4.0
Can anyone explain what I am doing wrong, and how to fix the code?
You can use transform for return new DataFrame with same size as original filled by aggregated values:
print (data.groupby('Country')[listOfRatings].transform('mean').fillna(0))
Advanced skiers, freeriders Snow parks
0 0.0 0.0
1 5.0 5.0
2 0.0 0.0
3 3.0 4.0
4 3.0 4.0
#dynamic generate all columns names without Country
listOfRatings = data.columns.difference(['Country'])
df1 = data.groupby('Country')[listOfRatings].transform('mean').fillna(0)
data[listOfRatings] = data[listOfRatings].fillna(df1)
print (data)
print (data)
Country Advanced skiers, freeriders Snow parks
0 Greece 0.0 0.0
1 Switzerland 5.0 5.0
2 USA 0.0 0.0
3 Norway 3.0 4.0
4 Norway 3.0 4.0

Pandas DataFrame Apply Efficiency

I have a dataframe to which I wan't to add a column with a kind of status if there is a matching value in another dataframe. I have the current code which works:
df1['NewColumn'] = df1['ComparisonColumn'].apply(lambda x: 'Match' if any(df2.ComparisonColumn == x) else ('' if x is None else 'Missing'))
I know the line is ugly, but I get the impression that its inefficient. Can you suggest a better way to make this comparison?
You can use np.where, isin, and isnull:
Create some dummy data:
np.random.seed(123)
df = pd.DataFrame({'ComparisonColumn':np.random.randint(10,20,20)})
df.iloc[4] = np.nan #Create missing data
df2 = pd.DataFrame({'ComparisonColumn':np.random.randint(15,30,20)})
Do matching with np.where:
df['NewColumn'] = np.where(df.ComparisonColumn.isin(df2.ComparisonColumn),'Matched',np.where(df.ComparisonColumn.isnull(),'Missing',''))
Output:
ComparisonColumn NewColumn
0 12.0
1 12.0
2 16.0 Matched
3 11.0
4 NaN Missing
5 19.0 Matched
6 16.0 Matched
7 11.0
8 10.0
9 11.0
10 19.0 Matched
11 10.0
12 10.0
13 19.0 Matched
14 13.0
15 14.0
16 10.0
17 10.0
18 14.0
19 11.0

Resources