Groupby one column and forward replace values in multiple columns based on condition using Pandas - python-3.x

Given a dataframe as follows:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 xd dt 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh pd 2020 5
Say there are typo errors in columns city and district for rows in the year columns which is 2020, so I want groupby id and ffill those columns with previous values.
How could I do that in Pandas? Thanks a lot.
The desired output will like this:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 bj cy 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh hp 2020 5

The following code works, but I'm not sure if it's the best solutions.
If you have others, welcome to share. Thanks.
df.loc[df['year'].isin(['2020']), ['city', 'district']] = np.nan
df[['city', 'district']] = df[['city', 'district']].fillna(df.groupby('id')[['city', 'district']].ffill())
Out:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 bj cy 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh hp 2020 5

Related

How to find again the index after pivoting dataframe?

I created a dataframe form a csv file containing data on number of deaths by year (running from 1946 to 2021) and month (within year):
dataD = pd.read_csv('MY_FILE.csv', sep=',')
First rows (out of 902...) of output are :
dataD
Year Month Deaths
0 2021 2 55500
1 2021 1 65400
2 2020 12 62800
3 2020 11 64700
4 2020 10 56900
As expected, the dataframe contains an index numbered 0,1,2, ... and so on.
Now, I pivot this dataframe in order to have only 1 row by year and months in column, using the following code:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths')
The first rows of the result are now:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
My question is:
What do I have to change in the previous pivoting code in order to find again the index 0,1,2,..etc. when I output the pivoted file? I think I need to specify index=*** in order to make the pivot instruction run. But afterwards, I would like to recover an index "as usual" (if I can say), exactly like in my first file dataD.
Any possibility?
You can reset_index() after pivoting:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths').reset_index()
This would give you the following:
Month Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
Note that the "Month" here might look like the index name but is actually df.columns.name. You can unset it if preferred:
df.columns.name = None
Which then gives you:
Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0

Update multiple columns from another dataframe based on one common column in Pandas

Given the following two dataframes:
df1:
id city district year price
0 1 bjs cyq 2018 12
1 2 bjs cyq 2019 6
2 3 sh hp 2018 4
3 4 shs hpq 2019 3
df2:
id city district year
0 1 bj cy 2018
1 2 bj cy 2019
2 4 sh hp 2019
let's say some values in city and district from df1 have errors, so I need to update city and district values' in df1 with those of df2 based on id, my expected result is like this:
id city district year price
0 1 bj cy 2018 12
1 2 bj cy 2019 6
2 3 sh hp 2018 4
3 4 sh hp 2019 3
How could I do that in Pandas? Thanks.
Update:
Solution 1:
cities = df2.set_index('id')['city']
district = df2.set_index('id')['district']
df1['city'] = df1['id'].map(cities)
df1['district'] = df1['id'].map(district)
Solution 2:
df1[["city","district"]] = pd.merge(df1,df2,on=["id"],how="left")[["city_y","district_y"]]
print(df1)
Out:
id city district year price
0 1 bj cy 2018 12
1 2 bj cy 2019 6
2 3 NaN NaN 2018 4
3 4 sh hp 2019 3
Note the city and district for id is 3 are NaNs, but I want keep the values from df1.
Try combine_first:
df2.set_index('id').combine_first(df1.set_index('id')).reset_index()
Output:
id city district price year
0 1 bj cy 12.0 2018.0
1 2 bj cy 6.0 2019.0
2 3 sh hp 4.0 2018.0
3 4 sh hp 3.0 2019.0
Try this
df1[["city","district"]] = pd.merge(df1,df2,on=["id"],how="left")[["city_y","district_y"]]
IIUC, we can use .map
edit - input changed.
target_cols = ['city','district']
df1.loc[df1['id'].isin(df2['id']),target_cols] = np.nan
cities = df2.set_index('id')['city']
district = df2.set_index('id')['district']
df1['city'] = df1['city'].fillna(df1['id'].map(cities))
df1['district'] = df1['district'].fillna(df1['id'].map(cities))
print(df1)
id city district year price
0 1 bj bj 2018 12
1 2 bj bj 2019 6
2 3 sh hp 2018 4
3 4 sh sh 2019 3

Read excel and reformat the multi-index headers in Pandas

Given a excel file with format as follows:
Reading with pd.read_clipboard, I get:
year 2018 Unnamed: 2 2019 Unnamed: 4
0 city quantity price quantity price
1 bj 10 2 4 7
2 sh 6 8 3 4
Just wondering if it's possible to convert to the following format with Pandas:
year city quantity price
0 2018 bj 10 2
1 2019 bj 4 7
2 2018 sh 6 8
3 2019 sh 3 4
I think here is best convert excel file to DataFrame with MultiIndex in columns and first column as index:
df = pd.read_excel(file, header=[0,1], index_col=[0])
print (df)
year 2018 2019
city quantity price quantity price
bj 10 2 4 7
sh 6 8 3 4
print (df.columns)
MultiIndex([('2018', 'quantity'),
('2018', 'price'),
('2019', 'quantity'),
('2019', 'price')],
names=['year', 'city'])
Then reshape by DataFrame.stack, change order of levels by DataFrame.swaplevel, set index and columns names by DataFrame.rename_axis and last convert index to columns, and if encessary convert year to integers:
df1 = (df.stack(0)
.swaplevel(0,1)
.rename_axis(index=['year','city'], columns=None)
.reset_index()
.assign(year=lambda x: x['year'].astype(int)))
print (df1)
year city price quantity
0 2018 bj 2 10
1 2019 bj 7 4
2 2018 sh 8 6
3 2019 sh 4 3

How to randomly generate an unobserved data in Python3

I have an dataframe which contain the observed data as:
import pandas as pd
d = {'humanID': [1, 1, 2,2,2,2 ,2,2,2,2], 'dogID':
[1,2,1,5,4,6,7,20,9,7],'month': [1,1,2,3,1,2,3,1,2,2]}
df = pd.DataFrame(data=d)
The df is follow
humanID dogID month
0 1 1 1
1 1 2 1
2 2 1 2
3 2 5 3
4 2 4 1
5 2 6 2
6 2 7 3
7 2 20 1
8 2 9 2
9 2 7 2
We total have two human and twenty dog, and above df contains the observed data. For example:
The first row means: human1 adopt dog1 at January
The second row means: human1 adopt dog2 at January
The third row means: human2 adopt dog1 at Febuary
========================================================================
My goal is randomly generating two unobserved data for each (human, month) that are not appear in the original observed data.
like for human1 at January, he does't adopt the dog [3,4,5,6,7,..20] And I want to randomly create two unobserved sample (human, month) in triple form
humanID dogID month
1 20 1
1 10 1
However, the follow sample is not allowed since it appear in original df
humanID dogID month
1 2 1
For human1, he doesn't have any activity at Feb, so we don't need to sample the unobserved data.
For human2, he have activity for Jan, Feb and March. Therefore, for each month, we want to randomly create the unobserved data. For example, In Jan, human2 adopt dog1, dog4 and god 20. The two random unobserved samples can be
humanID dogID month
2 2 1
2 6 1
same process can be used for Feb and March.
I want to put all of the unobserved in one dataframe such as follow unobserved
humanID dogID month
0 1 20 1
1 1 10 1
2 2 2 1
3 2 6 1
4 2 13 2
5 2 16 2
6 2 1 3
7 2 20 3
Any fast way to do this?
PS: this is a code interview for a start-up company.
Using groupby and random.choices:
import random
dogs = list(range(1,21))
dfs = []
n_sample = 2
for i,d in df.groupby(['humanID', 'month']):
h_id, month = i
sample = pd.DataFrame([(h_id, dogID, month) for dogID in random.choices(list(set(dogs)-set(d['dogID'])), k=n_sample)])
dfs.append(sample)
new_df = pd.concat(dfs).reset_index(drop=True)
new_df.columns = ['humanID', 'dogID', 'month']
print(new_df)
humanID dogID month
0 1 11 1
1 1 5 1
2 2 19 1
3 2 18 1
4 2 15 2
5 2 14 2
6 2 16 3
7 2 18 3
If I understand you correctly, you can use np.random.permutation() for the dogID column to generate random permutations of the column,
df_new=df.copy()
df_new['dogID']=np.random.permutation(df.dogID)
print(df_new.sort_values('month'))
humanID dogID month
0 1 1 1
1 1 20 1
4 2 9 1
7 2 1 1
2 2 4 2
5 2 5 2
8 2 2 2
9 2 7 2
3 2 7 3
6 2 6 3
Or to create random sampling of missing values within the range of dogID:
df_new=df.copy()
a=np.random.permutation(range(df_new.dogID.min(),df_new.dogID.max()))
df_new['dogID']=np.random.choice(a,df_new.shape[0])
print(df_new.sort_values('month'))
humanID dogID month
0 1 18 1
1 1 16 1
4 2 1 1
7 2 8 1
2 2 4 2
5 2 2 2
8 2 16 2
9 2 14 2
3 2 4 3
6 2 12 3

how to use group by in filter condition in pandas

I have below data stored in a dataframe and I want to remove the rows that has id equal to finalid and for the same id, i have multiple rows.
example:
df_target
id finalid month year count_ph count_sh
1 1 1 2012 12 20
1 2 1 2012 6 18
1 32 1 2012 6 2
2 2 1 2012 2 6
2 23 1 2012 2 6
3 3 1 2012 2 2
output
id finalid month year count_ph count_sh
1 2 1 2012 6 18
1 32 1 2012 6 2
2 23 1 2012 2 6
3 3 1 2012 2 2
functionality is something like:
remove records and get the final dataframe
(df_target.groupby(['id','month','year']).size() > 1) & (df_target['id'] == df_target['finalid'])
I think need transform for same Series as origonal DataFrame and ~ for invert final boolean mask:
df = df_target[~((df_target.groupby(['id','month','year'])['id'].transform('size') > 1) &
(df_target['id'] == df_target['finalid']))]
Alternative solution:
df = df_target[((df_target.groupby(['id','month','year'])['id'].transform('size') <= 1) |
(df_target['id'] != df_target['finalid']))]
print (df)
id finalid month year count_ph count_sh
1 1 2 1 2012 6 18
2 1 32 1 2012 6 2
4 2 23 1 2012 2 6
5 3 3 1 2012 2 2

Resources