Grouping using groupby based on certain conditions - python-3.x
I have the following dataframe:
data = pd.DataFrame({
'ID': [1, 1, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6],
'Date_Time': ['2010-01-01 12:01:00', '2010-01-01 01:27:33',
'2010-04-02 12:01:00', '2010-04-01 07:24:00', '2011-01-01 12:01:00',
'2011-01-01 01:27:33', '2013-01-01 12:01:00', '2014-01-01 12:01:00',
'2014-01-01 01:27:33', '2015-01-01 01:27:33', '2016-01-01 01:27:33',
'2011-01-01 01:28:00'],
'order': [2, 4, 5, 6, 7, 8, 9, 2, 3, 5, 6, 8],
'sort': [1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0]})
An would like to get the following columns:
1- sum_order_total_1 which sums up the values in the column order grouped by the column sort so in this case for the value 1 from column sort for each ID and returns Nans for zeros form column sort
2- sum_order_total_0 which sums up the values in the column order grouped by the column sort so in this case for the value 0 from column sort for each ID and returns Nans for oness form column sort
3- count_order_date_1 which sums up the values in column order of each ID grouped by column Date_Time for 1 in column sort and returns Nans for 0 from column sort
4- count_order_date_0 which sums up the values in column order of each ID grouped by column Date_Time for 0 in column sort and returns Nans for 1 from column sort
The expected reults should look like that attached photo here:
The problem with groupby (and pd.pivot_table) is that only provide half of the job. They give you the numbers but not in the format that you want. To finalize the format you can use apply.
For the total counts I used:
# Retrieve your data, similar as in the groupby query you provided.
data_total = pd.pivot_table(df, values='order', index=['ID'], columns=['sort'], aggfunc=np.sum)
data_total.reset_index(inplace=True)
Which results in the table:
sort ID 0 1
0 1 6.0 11.0
1 2 15.0 NaN
2 3 NaN 9.0
3 4 3.0 2.0
4 5 5.0 NaN
5 6 8.0 6.0
Now using this as an index ('ID' and 0 or 1 for the sort.) We can write a small function that will input the right value:
def filter_count(data, row, sort_value):
""" Select the count that belongs to the correct ID and sort combination. """
if row['sort'] == sort_value:
return data[data['ID'] == row['ID']][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
This leads to:
ID Date_Time order sort total_1 total_0
0 1 2010-01-01 12:01:00 2 1 11.0 NaN
1 1 2010-01-01 01:27:33 4 1 11.0 NaN
2 1 2010-04-02 12:01:00 5 1 11.0 NaN
3 1 2010-04-01 07:24:00 6 0 NaN 6.0
4 2 2011-01-01 12:01:00 7 0 NaN 15.0
5 2 2011-01-01 01:27:33 8 0 NaN 15.0
6 3 2013-01-01 12:01:00 9 1 9.0 NaN
7 4 2014-01-01 12:01:00 2 1 2.0 NaN
8 4 2014-01-01 01:27:33 3 0 NaN 3.0
9 5 2015-01-01 01:27:33 5 0 NaN 5.0
10 6 2016-01-01 01:27:33 6 1 6.0 NaN
11 6 2011-01-01 01:28:00 8 0 NaN 8.0
Now we can apply the same logic to the date, except that the date also contains information about the hours, minutes and seconds. Which can be filtered out using:
# Since we are interesting on a per day bases, we remove the hour/minute/seconds part
df['order_day'] = pd.to_datetime(df['Date_Time']).dt.strftime('%Y/%m/%d')
Now applying the same trick as above, we create a new pivot table, based on the 'ID' and 'order_day':
data_date = pd.pivot_table(df, values='order', index=['ID', 'order_day'], columns=['sort'], aggfunc=np.sum)
data_date.reset_index(inplace=True)
Which is:
sort ID order_day 0 1
0 1 2010/01/01 NaN 6.0
1 1 2010/04/01 6.0 NaN
2 1 2010/04/02 NaN 5.0
3 2 2011/01/01 15.0 NaN
4 3 2013/01/01 NaN 9.0
5 4 2014/01/01 3.0 2.0
6 5 2015/01/01 5.0 NaN
7 6 2011/01/01 8.0 NaN
Writing a second function to fill in the correct value based on 'ID' and 'date':
def filter_date(data, row, sort_value):
if row['sort'] == sort_value:
return data[(data['ID'] == row['ID']) & (data['order_day'] == row['order_day'])][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
Now we only have to drop the temporary column 'order_day':
df.drop(labels=['order_day'], axis=1, inplace=True)
And the final answer becomes:
ID Date_Time order sort total_1 total_0 date_0 date_1
0 1 2010-01-01 12:01:00 2 1 11.0 NaN NaN 6.0
1 1 2010-01-01 01:27:33 4 1 11.0 NaN NaN 6.0
2 1 2010-04-02 12:01:00 5 1 11.0 NaN NaN 5.0
3 1 2010-04-01 07:24:00 6 0 NaN 6.0 6.0 NaN
4 2 2011-01-01 12:01:00 7 0 NaN 15.0 15.0 NaN
5 2 2011-01-01 01:27:33 8 0 NaN 15.0 15.0 NaN
6 3 2013-01-01 12:01:00 9 1 9.0 NaN NaN 9.0
7 4 2014-01-01 12:01:00 2 1 2.0 NaN NaN 2.0
8 4 2014-01-01 01:27:33 3 0 NaN 3.0 3.0 NaN
9 5 2015-01-01 01:27:33 5 0 NaN 5.0 5.0 NaN
10 6 2016-01-01 01:27:33 6 1 6.0 NaN NaN 6.0
11 6 2011-01-01 01:28:00 8 0 NaN 8.0 8.0 NaN
Related
How to avoid bfill or ffill when calculating pct_change with NaNs
For a df like below, I use pct_change() to calculate the rolling percentage changes: price = [np.NaN, 10, 13, np.NaN, np.NaN, 9] df = pd. DataFrame(price, columns = ['price']) df Out[75]: price 0 NaN 1 10.0 2 13.0 3 NaN 4 NaN 5 9.0 But I get these unexpected results: df.price.pct_change(periods = 1, fill_method='bfill') Out[76]: 0 NaN 1 0.000000 2 0.300000 3 -0.307692 4 0.000000 5 0.000000 Name: price, dtype: float64 df.price.pct_change(periods = 1, fill_method='pad') Out[77]: 0 NaN 1 NaN 2 0.300000 3 0.000000 4 0.000000 5 -0.307692 Name: price, dtype: float64 df.price.pct_change(periods = 1, fill_method='ffill') Out[78]: 0 NaN 1 NaN 2 0.300000 3 0.000000 4 0.000000 5 -0.307692 Name: price, dtype: float64 I hope that while calculating with NaNs, the results will be NaNs instead of being filled forward or backward and then calculated. May I ask how to achieve it? Thanks. The expected result: 0 NaN 1 NaN 2 0.300000 3 NaN 4 NaN 5 NaN Name: price, dtype: float64 Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pct_change.html
Maybe you can compute the pct manually with diff and shift: period = 1 pct = df.price.diff().div(df.price.shift(period)) print(pct) # Output 0 NaN 1 NaN 2 0.3 3 NaN 4 NaN 5 NaN Name: price, dtype: float64 Update: you can pass fill_method=None period = 1 pct = df.price.pct_change(periods=period, fill_method=None) print(pct) # Output 0 NaN 1 NaN 2 0.3 3 NaN 4 NaN 5 NaN Name: price, dtype: float64
Replace only leading NaN values in Pandas dataframe
I have a dataframe of time series data, in which data reporting starts at different times (columns) for different observation units (rows). Prior to first reported datapoint for each unit, the dataframe contains NaN values, e.g. 0 1 2 3 4 ... A NaN NaN 4 5 6 ... B NaN 7 8 NaN 10... C NaN 2 11 24 17... I want to replace the leading (left-side) NaN values with 0, but only the leading ones (i.e. leaving the internal missing ones as NaN. So the result on the example above would be: 0 1 2 3 4 ... A 0 0 4 5 6 ... B 0 7 8 NaN 10... C 0 2 11 24 17... (Note the retained NaN for row B col 3) I could iterate through the dataframe row-by-row, identify the first index of a non-NaN value in each row, and replace everything left of that with 0. But is there a way to do this as a whole-array operation?
notna + cumsum by rows, cells with zeros are leading NaN: df[df.notna().cumsum(1) == 0] = 0 df 0 1 2 3 4 A 0.0 0.0 4 5.0 6 B 0.0 7.0 8 NaN 10 C 0.0 2.0 11 24.0 17
Here is another way using cumprod() and apply() s = df.isna().cumprod(axis=1).sum(axis=1) df.apply(lambda x: x.fillna(0,limit = s.loc[x.name]),axis=1) Output: 0 1 2 3 4 A 0.0 0.0 4.0 5.0 6.0 B 0.0 7.0 8.0 NaN 10.0 C 0.0 2.0 11.0 24.0 17.0
Drop columns tha thave a header but all rows are empty Python 3 & Pandas
I just could not figure this one out: df.dropna(axis = 1, how="all").dropna(axis= 0 ,how="all") All headers have data. How can I exclude the headers form a df.dropna(how="all") command. I am afraid this is going to be trivial, but help me out guys. Thanks, Levi
Okay, as I understand what you want is as follows: drop any column where all rows contain NaN drop any row in which one or more NaN appear So for example, given a dataframe df like: Id Col1 Col2 Col3 Col4 0 1 25.0 A NaN 6 1 2 15.0 B NaN 7 2 3 23.0 C NaN 8 3 4 5.0 D NaN 9 4 5 NaN E NaN 10 convert the dataframe by: df.dropna(axis = 1, how="all", inplace= True) df.dropna(axis = 0, how='all', inplace= True) which yields: Id Col1 Col2 Col4 0 1 25.0 A 6 1 2 15.0 B 7 2 3 23.0 C 8 3 4 5.0 D 9 4 5 NaN E 10
Replace multiple columns' NaNs with other columns' values in Pandas
Given a dataframe as follows: date city gdp gdp1 gdp2 gross domestic product pop pop1 pop2 0 2001-03 bj 3.0 NaN NaN NaN 7.0 NaN NaN 1 2001-06 bj 5.0 NaN NaN NaN 6.0 6.0 NaN 2 2001-09 bj 8.0 NaN NaN 8.0 4.0 4.0 NaN 3 2001-12 bj 7.0 NaN 7.0 NaN 2.0 NaN 2.0 4 2001-03 sh 4.0 4.0 NaN NaN 3.0 NaN NaN 5 2001-06 sh 5.0 NaN NaN 5.0 5.0 5.0 NaN 6 2001-09 sh 9.0 NaN NaN NaN 4.0 4.0 NaN 7 2001-12 sh 3.0 3.0 NaN NaN 6.0 NaN 6.0 I want to replace NaNs from gdp and pop with values of gdp1, gdp2, gross domestic product and pop1, pop2 respectively. date city gdp pop 0 2001-03 bj 3 7 1 2001-06 bj 5 6 2 2001-09 bj 8 4 3 2001-12 bj 7 2 4 2001-03 sh 4 3 5 2001-06 sh 5 5 6 2001-09 sh 9 4 7 2001-12 sh 3 6 The following code works, but I wonder if it's possible to make it more concise, since I have many similar columns? df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp1'] df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp2'] df.loc[df['gdp'].isnull(), 'gdp'] = df['gross domestic product'] df.loc[df['pop'].isnull(), 'pop'] = df['pop1'] df.loc[df['pop'].isnull(), 'pop'] = df['pop2'] df.drop(['gdp1', 'gdp2', 'gross domestic product', 'pop1', 'pop2'], axis=1)
Idea is use back filling missing values filtered by DataFrame.filter, if possible more values per group then is prioritize columns from left side, if change .bfill(axis=1).iloc[:, 0] to .ffill(axis=1).iloc[:, -1] then is prioritize columns from right side: #if first column is gdp, pop df['gdp'] = df.filter(like='gdp').bfill(axis=1)['gdp'] df['pop'] = df.filter(like='pop').bfill(axis=1)['pop'] #if possible any first column df['gdp'] = df.filter(like='gdp').bfill(axis=1).iloc[:, 0] df['pop'] = df.filter(like='pop').bfill(axis=1).iloc[:, 0] But if only one non missing values is posible use max, min...: df['gdp'] = df.filter(like='gdp').max(axis=1) df['pop'] = df.filter(like='pop').max(axis=1) If need specify columns names by list: gdp_c = ['gdp1','gdp2','gross domestic product'] pop_c = ['pop1','pop2'] df['gdp'] = df[gdp_c].bfill(axis=1).iloc[:, 0] df['pop'] = df[pop_c].bfill(axis=1).iloc[:, 0] df = df[['date','city','gdp','pop']] print (df) date city gdp pop 0 2001-03 bj 3.0 7.0 1 2001-06 bj 5.0 6.0 2 2001-09 bj 8.0 4.0 3 2001-12 bj 7.0 2.0 4 2001-03 sh 4.0 3.0 5 2001-06 sh 5.0 5.0 6 2001-09 sh 9.0 4.0 7 2001-12 sh 3.0 6.0
Stack two pandas dataframes with different columns, keeping source dataframe as column, also
I have a couple of toy dataframes I can stack using df.append, but I need to keep the source dataframes as a column, as well. I can't seem to find anything about how to do that. Here's what I do have: d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8], "G": [7,8,9,10]}) d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"], "F": [7,8,10,12]}) d2005 A B C G 0 1 2 3 7 1 2 4 5 8 2 3 5 7 9 3 4 6 8 10 d2006 A B D F 0 2 3 a 7 1 1 1 c 8 2 4 5 d 10 3 5 6 e 12 Then I can stack them like this: d_combined = d2005.append(d2006, ignore_index = True, sort = True) d_combined A B C D F G 0 1 2 3.0 NaN NaN 7.0 1 2 4 5.0 NaN NaN 8.0 2 3 5 7.0 NaN NaN 9.0 3 4 6 8.0 NaN NaN 10.0 4 2 3 NaN a 7.0 NaN 5 1 1 NaN c 8.0 NaN 6 4 5 NaN d 10.0 NaN 7 5 6 NaN e 12.0 NaN But what I really need is another column with the source dataframe added to the right end of d_combined. Something like this: A B C D G F From 0 1 2 3.0 NaN 7.0 NaN d2005 1 2 4 5.0 NaN 8.0 NaN d2005 2 3 5 7.0 NaN 9.0 NaN d2005 3 4 6 8.0 NaN 10.0 NaN d2005 4 2 3 NaN a NaN 7.0 d2006 5 1 1 NaN c NaN 8.0 d2006 6 4 5 NaN d NaN 10.0 d2006 7 5 6 NaN e NaN 12.0 d2006 Hopefully someone has a quick trick they can share. Thanks.
This gets what you want but there should be a more elegant way: df_list = [d2005, d2006] name_list = ['2005', '2006'] for df, name in zip(df_list, name_list): df['from'] = name Then d_combined = d2005.append(d2006, ignore_index=True) d_combined A B C D F G from 0 1 2 3.0 NaN NaN 7.0 2005 1 2 4 5.0 NaN NaN 8.0 2005 2 3 5 7.0 NaN NaN 9.0 2005 3 4 6 8.0 NaN NaN 10.0 2005 4 2 3 NaN a 7.0 NaN 2006 5 1 1 NaN c 8.0 NaN 2006 6 4 5 NaN d 10.0 NaN 2006 7 5 6 NaN e 12.0 NaN 2006 Alternatively, you can set df.name at the time of creation of the df and use it in the for loop. d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8], "G": [7,8,9,10]} ) d2005.name = 2005 d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"], "F": [7,8,10,12]}) d2006.name = 2006 df_list = [d2005, d2006] for df in df_list: df['from'] = df.name
I believe this can be simply achieved by adding the From column to the original dataframes itself. So effectively, d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8], "G": [7,8,9,10]}) d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"], "F": [7,8,10,12]}) Then, d2005['From'] = 'd2005' d2006['From'] = 'd2006' And then you append, d_combined = d2005.append(d2006, ignore_index = True, sort = True) gives you something like this: