Replace missing values based on another column - python-3.x

I am trying to replace the missing values in a dataframe based on filtering of another column, "Country"
>>> data.head()
Country Advanced skiers, freeriders Snow parks
0 Greece NaN NaN
1 Switzerland 5.0 5.0
2 USA NaN NaN
3 Norway NaN NaN
4 Norway 3.0 4.0
Obviously this is just a small snippet of the data, but I am looking to replace all the NaN values with the average value for each feature.
I have tried grouping the data by the country and then calculating the mean of each column. When I print out the resulting array, it comes up with the expected values. However, when I put it into the .fillna() method, the data appears unchanged
I've tried #DSM's solution from this similar post, but I am not sure how to apply it to multiple columns.
listOfRatings = ['Advanced skiers, freeriders', 'Snow parks']
print (data.groupby('Country')[listOfRatings].mean().fillna(0))
-> displays the expected results
data[listOfRatings] = data[listOfRatings].fillna(data.groupby('Country')[listOfRatings].mean().fillna(0))
-> appears to do nothing to the dataframe
Assuming this is the complete dataset, this is what I would expect the results to be.
Country Advanced skiers, freeriders Snow parks
0 Greece 0.0 0.0
1 Switzerland 5.0 5.0
2 USA 0.0 0.0
3 Norway 3.0 4.0
4 Norway 3.0 4.0
Can anyone explain what I am doing wrong, and how to fix the code?

You can use transform for return new DataFrame with same size as original filled by aggregated values:
print (data.groupby('Country')[listOfRatings].transform('mean').fillna(0))
Advanced skiers, freeriders Snow parks
0 0.0 0.0
1 5.0 5.0
2 0.0 0.0
3 3.0 4.0
4 3.0 4.0
#dynamic generate all columns names without Country
listOfRatings = data.columns.difference(['Country'])
df1 = data.groupby('Country')[listOfRatings].transform('mean').fillna(0)
data[listOfRatings] = data[listOfRatings].fillna(df1)
print (data)
print (data)
Country Advanced skiers, freeriders Snow parks
0 Greece 0.0 0.0
1 Switzerland 5.0 5.0
2 USA 0.0 0.0
3 Norway 3.0 4.0
4 Norway 3.0 4.0

Related

how to get quartiles and classify a value according to this quartile range

I have this df:
d = pd.DataFrame({'Name':['Andres','Lars','Paul','Mike'],
'target':['A','A','B','C'],
'number':[10,12.3,11,6]})
And I want classify each number in a quartile. I am doing this:
(d.groupby(['Name','target','number'])['number']
.quantile([0.25,0.5,0.75,1]).unstack()
.reset_index()
.rename(columns={0.25:"1Q",0.5:"2Q",0.75:"3Q",1:"4Q"})
)
But as you can see, the 4 quartiles are all equal because the code above is calculating per row so if there's one 1 number per row all quartiles are equal.
If a run instead:
d['number'].quantile([0.25,0.5,0.75,1])
Then I have the 4 quartiles I am looking for:
0.25 9.000
0.50 10.500
0.75 11.325
1.00 12.300
What I need as output(showing only first 2 rows)
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.30 1
1 Lars A 12.3 9.0 10.5 11.325 12.30 4
you can see all quartiles has the the values considering tall values in the number column. Besides that, now we have a column names Rank that classify the number according to it's quartile. ex. In the first row 10 is within the 1st quartile.
Here's one way that build on the quantiles you've created by making it a DataFrame and joining it to d. Also assigns "Rank" column using rank method:
out = (d.join(d['number'].quantile([0.25,0.5,0.75,1])
.set_axis([f'{i}Q' for i in range(1,5)], axis=0)
.to_frame().T
.pipe(lambda x: x.loc[x.index.repeat(len(d))])
.reset_index(drop=True))
.assign(Rank=d['number'].rank(method='dense')))
Output:
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.3 2.0
1 Lars A 12.3 9.0 10.5 11.325 12.3 4.0
2 Paul B 11.0 9.0 10.5 11.325 12.3 3.0
3 Mike C 6.0 9.0 10.5 11.325 12.3 1.0

how to select values from multiple columns based on a condition

I have a dataframe which has information about people with balance in their different accounts. It looks something like below.
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['John', 'Jacob', 'Mary', 'Sue', 'Harry', 'Clara'],
'accnt_1':[2, np.nan, 13, np.nan, np.nan, np.nan],
'accnt_2':[32, np.nan, 12, 21, 32, np.nan],
'accnt_3':[11,21,np.nan,np.nan,2,np.nan]})
df
I want to get balance for each person as if accnt_1 is not empty that is the balance of that person. If accnt_1 is empty and accnt_2 is not, number in accnt_2 is the balance. If both accnt_1 and accnt_2 are empty, whatever is in accnt_3 is the balance.
In the end the output should look like
out_df = pd.DataFrame({'name':['John', 'Jacob', 'Mary', 'Sue', 'Harry', 'Clara'],
'balance':[2, 21, 13, 21, 32, np.nan]})
out_df
I will always know the priority of columns. I can write a simple function and apply on this dataframe. But I was thinking is there a better and faster way to do using pandas/numpy?
If balanced means first not missing values after name you can convert name to index, then back filling missing values and select first column by position:
df = df.set_index('name').bfill(axis=1).iloc[:, 0].rename('balance').reset_index()
print (df)
name balance
0 John 2.0
1 Jacob 21.0
2 Mary 13.0
3 Sue 21.0
4 Harry 32.0
5 Clara NaN
If need specify columns names in order by list:
cols = ['accnt_1','accnt_2','accnt_3']
df = df.set_index('name')[cols].bfill(axis=1).iloc[:, 0].rename('balance').reset_index()
Or if need filter only accnt columns use DataFrame.filter:
df = df.set_index('name').filter(like='accnt').bfill(axis=1).iloc[:, 0].rename('balance').reset_index()
You can simply chain fillna methods onto each other to achieve your desired result. The chaining can be read in plain english closely to: "take the values in accnt_1, fill the missing values in accnt_1 with values from accnt_2. Then if there are still remaining NaN after this, fill those missing values with the values from accnt_3"
>>> df["balance"] = df["accnt_1"].fillna(df["accnt_2"]).fillna(df["accnt_3"])
>>> df[["name", "balance"]]
name balance
0 John 2.0
1 Jacob 21.0
2 Mary 13.0
3 Sue 21.0
4 Harry 32.0
5 Clara NaN
df['balance']=df.name.map(df.set_index('name').stack().groupby('name').first())
name accnt_1 accnt_2 accnt_3 balance
0 John 2.0 32.0 11.0 2.0
1 Jacob NaN NaN 21.0 21.0
2 Mary 13.0 12.0 NaN 13.0
3 Sue NaN 21.0 NaN 21.0
4 Harry NaN 32.0 2.0 32.0
5 Clara NaN NaN NaN NaN
How it works
#setting name as index gives you an opportunity to get it as a column name when you unstack
df.set_index('name').stack().groupby('name').first()
name
John accnt_1 2.0
accnt_2 32.0
accnt_3 11.0
Jacob accnt_3 21.0
Mary accnt_1 13.0
accnt_2 12.0
Sue accnt_2 21.0
Harry accnt_2 32.0
accnt_3 2.0
dtype: float64
#Chaining .first() gets you the first index value that is non NaN because when you stack NaN is dropped
df.set_index('name').stack().groupby('name').first()
#.map() allows you to map output above to the original dataframe
df.name.map(df.set_index('name').stack().groupby('name').first())
0 2.0
1 21.0
2 13.0
3 21.0
4 32.0
5 NaN

Pandas pivot with multiple items per column, how to avoid aggregating them?

Follow up to this question, in particular this comment.
Consider following dataframe:
df = pd.DataFrame({
'Person': ['Adam', 'Adam', 'Cesar', 'Diana', 'Diana', 'Diana', 'Erika', 'Erika'],
'Belonging': ['House', 'Car', 'Car', 'House', 'Car', 'Bike', 'House', 'Car'],
'Value': [300.0, 10.0, 12.0, 450.0, 15.0, 2.0, 600.0, 11.0],
})
Which looks like this:
Person Belonging Value
0 Adam House 300.0
1 Adam Car 10.0
2 Cesar Car 12.0
3 Diana House 450.0
4 Diana Car 15.0
5 Diana Bike 2.0
6 Erika House 600.0
7 Erika Car 11.0
Using a pivot_table() is a nice way to reshape this data that will allow querying it by Person and see all of their belongings in a single row, making it really easy to answer queries such as "How to find the Value of Persons Car, if they have a House valued more than 400.0?"
A pivot_table() can be easily built for this data set with:
df_pivot = df.pivot_table(
values='Value',
index='Person',
columns='Belonging',
)
Which will look like:
Belonging Bike Car House
Person
Adam NaN 10.0 300.0
Cesar NaN 12.0 NaN
Diana 2.0 15.0 450.0
Erika NaN 11.0 600.0
But this gets limited when a Person has more than one of the same type of Belonging, for example two Cars, two Houses or two Bikes.
Consider the updated data:
df = pd.DataFrame({
'Person': ['Adam', 'Adam', 'Cesar', 'Diana', 'Diana', 'Diana', 'Erika', 'Erika', 'Diana', 'Adam'],
'Belonging': ['House', 'Car', 'Car', 'House', 'Car', 'Bike', 'House', 'Car', 'Car', 'House'],
'Value': [300.0, 10.0, 12.0, 450.0, 15.0, 2.0, 600.0, 11.0, 21.0, 180.0],
})
Which looks like:
Person Belonging Value
0 Adam House 300.0
1 Adam Car 10.0
2 Cesar Car 12.0
3 Diana House 450.0
4 Diana Car 15.0
5 Diana Bike 2.0
6 Erika House 600.0
7 Erika Car 11.0
8 Diana Car 21.0
9 Adam House 180.0
Now that same pivot_table() will return the average of Diana's two cars, or Adam's two houses:
Belonging Bike Car House
Person
Adam NaN 10.0 240.0
Cesar NaN 12.0 NaN
Diana 2.0 18.0 450.0
Erika NaN 11.0 600.0
So we can pass pivot_table() an aggfunc='sum' or aggfunc=np.sum to get the sum rather than the average, which will give us 480.0 and 36.0 and is probably a better representation of the total value a Person owns in Belongings of a certain type. But we're missing details.
We can use aggfunc=list which will preserve them:
df_pivot = df.pivot_table(
values='Value',
index='Person',
columns='Belonging',
aggfunc=list,
)
Belonging Bike Car House
Person
Adam NaN [10.0] [300.0, 180.0]
Cesar NaN [12.0] NaN
Diana [2.0] [15.0, 21.0] [450.0]
Erika NaN [11.0] [600.0]
This keeps the detail about multiple Belongings per Person, but on the other hand is quite inconvenient in that it is using Python lists rather than native Pandas types and columns, so it makes some queries such as the total Values in Houses difficult to answer.
Using aggfunc=np.sum, we could simply use pd_pivot['House'].sum() to get the total of 1530.0. Even questions such as the one above, Cars for Persons with a House worth more than 400.0 are now harder to answer.
What's a better way to reshape this data that will:
Allow easy querying a Person's Belongings in a single row, like the pivot_table() does;
Preserve the details of Persons who have multiple Belongings of a certain type;
Use native Pandas columns and data types that make it possible to use Pandas methods for querying and summarizing the data.
I thought of updating the Belonging descriptions to include a counter, such as "House 1", "Car 2", etc. Perhaps sorting so that the most valuable one comes first (to help answer questions such as "has a house worth more than 400.0" looking at "House 1" only.)
Or perhaps using a pd.MultiIndex to still be able to access all "House" columns together.
But unsure how to actually reshape the data in such a way.
Or are there better suggestions on how to reshape it (other than adding a count per belonging) that would preserve the features described above? How would you reshape it and how would you answer all these queries I mentioned above?
Perhaps sth like this:
given your Pivot table in the following dataframe:
pv = df_pivot = df.pivot_table(
values='Value',
index='Person',
columns='Belonging',
aggfunc=list,
)
then apply pd.Series to all columns.
For proper naming of columns, calculate maximum length of lists in each column and then use 'set_axis' for renaming:
new_pv = pd.DataFrame(index=pv.index)
for col in pv:
n = int(pv[col].str.len().max())
new_pv = pd.concat([new_pv, pv[col].apply(pd.Series).set_axis([f'{col}_{i}' for i in range(n)], 1, inplace = False)], 1)
# Bike_0 Car_0 Car_1 House_0 House_1
# Person
# Adam NaN 10.0 NaN 300.0 180.0
# Cesar NaN 12.0 NaN NaN NaN
# Diana 2.0 15.0 21.0 450.0 NaN
# Erika NaN 11.0 NaN 600.0 NaN
counting of houses:
new_pv.filter(like='House').count(1)
# Person
# Adam 2
# Cesar 0
# Diana 1
# Erika 1
# dtype: int64
sum of all house's values:
new_pv.filter(like='House').sum().sum()
# 1530.0
Using groupby, you could achieve something like this.
df_new = df.groupby(['Person', 'Belonging']).agg(('sum', 'count', 'min', 'max'))
which would give.
Value
sum count min max
Person Belonging
Adam Car 10.0 1 10.0 10.0
House 480.0 2 180.0 300.0
Cesar Car 12.0 1 12.0 12.0
Diana Bike 2.0 1 2.0 2.0
Car 36.0 2 15.0 21.0
House 450.0 1 450.0 450.0
Erika Car 11.0 1 11.0 11.0
House 600.0 1 600.0 600.0
You could define your own functions in the .agg method to provide more suitable descriptions also.
Edit
Alternatively, you could try
df['Belonging'] = df["Belonging"] + "_" + df.groupby(['Person','Belonging']).cumcount().add(1).astype(str)
Person Belonging Value
0 Adam House_1 300.0
1 Adam Car_1 10.0
2 Cesar Car_1 12.0
3 Diana House_1 450.0
4 Diana Car_1 15.0
5 Diana Bike_1 2.0
6 Erika House_1 600.0
7 Erika Car_1 11.0
8 Diana Car_2 21.0
9 Adam House_2 180.0
Then you can just use pivot
df.pivot('Person', 'Belonging')
Value
Belonging Bike_1 Car_1 Car_2 House_1 House_2
Person
Adam NaN 10.0 NaN 300.0 180.0
Cesar NaN 12.0 NaN NaN NaN
Diana 2.0 15.0 21.0 450.0 NaN
Erika NaN 11.0 NaN 600.0 NaN
I ended up working out a solution to this one, inspired by the excellent answers by #SpghttCd and #Josmoor98, but with a couple differences:
Using a MultiIndex, so I have a really easy way to get all Houses or all Cars.
Sorting values, so looking at the first House or Car can be used to tell who has one worth more than X.
Code for the pivot table:
df_pivot = (df
.assign(BelongingNo=df
.sort_values(by='Value', ascending=False)
.groupby(['Person', 'Belonging'])
.cumcount() + 1
)
.pivot_table(
values='Value',
index='Person',
columns=['Belonging', 'BelongingNo'],
)
)
Resulting DataFrame:
Belonging Bike Car House
BelongingNo 1 1 2 1 2
Person
Adam NaN 10.0 NaN 300.0 180.0
Cesar NaN 12.0 NaN NaN NaN
Diana 2.0 21.0 15.0 450.0 NaN
Erika NaN 11.0 NaN 600.0 NaN
Queries are pretty straightforward.
For example, finding the Value of Person's Cars, if they have a House valued more than 400.0:
df_pivot.loc[
df_pivot[('House', 1)] > 400.0,
'Car'
]
Result:
BelongingNo 1 2
Person
Diana 21.0 15.0
Erika 11.0 NaN
The average Car price for them:
df_pivot.loc[
df_pivot[('House', 1)] > 400.0,
'Car'
].stack().mean()
Result: 15.6666
Here, using stack() is a powerful way to flatten the second level of the MultiIndex, after having used the top-level to select a Belonging column.
Same is useful to get the total Value of all Houses:
df_pivot['House'].sum()
Results in the expected 1530.0.
Finally, looking at all Belongings of a single Person:
df_pivot.loc['Adam'].dropna()
Returns the expected two Houses and the one Car, with their respective Values.
I tried doing this with the lists in the dataframe, so that they get converted to ndarrays.
pd_df_pivot = df_pivot.copy(deep=True)
for row in range(0,df_pivot.shape[0]):
for col in range(0,df_pivot.shape[1]):
if type(df_pivot.iloc[row,col]) is list:
pd_df_pivot.iloc[row,col] = np.array(df_pivot.iloc[row,col])
else:
pd_df_pivot.iloc[row,col] = df_pivot.iloc[row,col]

Perform arithmetic operation mainly subtraction and division over a pandas series on null values

Simply i want when i subtract/division operation with null value it will give the value(digit).Ex - 3/np.nan = 3 or 2-np.nan = 2.
By using np.nansum and np.nanprod i have handled addition and multiplication,but dont know how will i do operation for subtraction and division.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c=a-b d=a/b
0 1 1.0 0.0 1.0
1 2 2.0 0.0 1.0
2 3 NaN 3.0 3.0
3 4 NaN 4.0 4.0
Above i mention that actually what i am looking for.
#Use fill value of 0 for subtraction operation
df['c']=df.a.sub(df.b,fill_value=0)
#Use fill value of 1 for division operation
df['d']=df.a.div(df.b,fill_value=1)
IIUC using sub with fill_value
df.a.sub(df.b,fill_value=0)
Out[251]:
0 0.0
1 0.0
2 3.0
3 4.0
dtype: float64

How do I fill these `NaN` values properly?

Here's my original dataframe with NaN values which I'm trying to fill;
https://prnt.sc/i40j33
If I use df.interpolate(axis=1) to fill up the NaN values, only some of the rows fill up properly with a number.
For e.g
https://prnt.sc/i40mgq
As you can see in the screenshot column:1981 and row:3 which had a NaN value has filled up properly with a value other than NaN. I want to fill the rest of NaN as well like that? Any idea how do I do that?
Using DataFrame.interpolate()
In your case it is failing because there are no columns to the left, and therefore the interpolate method doesn't know what to interpolate it to: missing_value = (left_value + right_value)/2
So you could, for example, insert a column to the left with all 0's (if you would like to impute your missing values on the first column with half of the next value), as such:
df.insert(loc=0, column='allZeroes', value=0)
After this, you could interpolate as you are doing and remove the column
General missing value imputation
Either use df.fillna('DEFAULT-VALUE') as Alex mentioned in the comments to the question. Docs here
or do something like:
df.my_col[df.my_col.isnull()] = 'DEFAULT-VALUE'
I'd recommend using the fillna as you can use methods such as forward fill (ffill) -- impute the missings with the previous value -- and other similar methods.
It seems like you might want to interpolate on axis=0, column-wise:
>>> df = pd.DataFrame(np.arange(35, dtype=float).reshape(5,7),
columns=[1951, 1961, 1971, 1981, 1991, 2001, 2001],
index=range(0, 5))
>>> df.iloc[1:3, 0] = np.nan
>>> df.iloc[3, 3] = np.nan
>>> df.interpolate(axis=0)
1951 1961 1971 1981 1991 2001 2001
0 0.0 1.0 2.0 3.0 4.0 5.0 6.0
1 7.0 8.0 9.0 10.0 11.0 12.0 13.0
2 14.0 15.0 16.0 17.0 18.0 19.0 20.0
3 21.0 22.0 23.0 24.0 25.0 26.0 27.0
4 28.0 29.0 30.0 31.0 32.0 33.0 34.0
Currently you're interpolating row-wise. NaNs that "begin" a Series aren't padded by a value on either side, making interpolation impossible for them.
Update: pandas is adding some more optionality for this in v 0.23.0.

Resources