I have the following dataframe:
data = pd.DataFrame({
'ID': [1, 1, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6],
'Date_Time': ['2010-01-01 12:01:00', '2010-01-01 01:27:33',
'2010-04-02 12:01:00', '2010-04-01 07:24:00', '2011-01-01 12:01:00',
'2011-01-01 01:27:33', '2013-01-01 12:01:00', '2014-01-01 12:01:00',
'2014-01-01 01:27:33', '2015-01-01 01:27:33', '2016-01-01 01:27:33',
'2011-01-01 01:28:00'],
'order': [2, 4, 5, 6, 7, 8, 9, 2, 3, 5, 6, 8],
'sort': [1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0]})
An would like to get the following columns:
1- sum_order_total_1 which sums up the values in the column order grouped by the column sort so in this case for the value 1 from column sort for each ID and returns Nans for zeros form column sort
2- sum_order_total_0 which sums up the values in the column order grouped by the column sort so in this case for the value 0 from column sort for each ID and returns Nans for oness form column sort
3- count_order_date_1 which sums up the values in column order of each ID grouped by column Date_Time for 1 in column sort and returns Nans for 0 from column sort
4- count_order_date_0 which sums up the values in column order of each ID grouped by column Date_Time for 0 in column sort and returns Nans for 1 from column sort
The expected reults should look like that attached photo here:
The problem with groupby (and pd.pivot_table) is that only provide half of the job. They give you the numbers but not in the format that you want. To finalize the format you can use apply.
For the total counts I used:
# Retrieve your data, similar as in the groupby query you provided.
data_total = pd.pivot_table(df, values='order', index=['ID'], columns=['sort'], aggfunc=np.sum)
data_total.reset_index(inplace=True)
Which results in the table:
sort ID 0 1
0 1 6.0 11.0
1 2 15.0 NaN
2 3 NaN 9.0
3 4 3.0 2.0
4 5 5.0 NaN
5 6 8.0 6.0
Now using this as an index ('ID' and 0 or 1 for the sort.) We can write a small function that will input the right value:
def filter_count(data, row, sort_value):
""" Select the count that belongs to the correct ID and sort combination. """
if row['sort'] == sort_value:
return data[data['ID'] == row['ID']][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
This leads to:
ID Date_Time order sort total_1 total_0
0 1 2010-01-01 12:01:00 2 1 11.0 NaN
1 1 2010-01-01 01:27:33 4 1 11.0 NaN
2 1 2010-04-02 12:01:00 5 1 11.0 NaN
3 1 2010-04-01 07:24:00 6 0 NaN 6.0
4 2 2011-01-01 12:01:00 7 0 NaN 15.0
5 2 2011-01-01 01:27:33 8 0 NaN 15.0
6 3 2013-01-01 12:01:00 9 1 9.0 NaN
7 4 2014-01-01 12:01:00 2 1 2.0 NaN
8 4 2014-01-01 01:27:33 3 0 NaN 3.0
9 5 2015-01-01 01:27:33 5 0 NaN 5.0
10 6 2016-01-01 01:27:33 6 1 6.0 NaN
11 6 2011-01-01 01:28:00 8 0 NaN 8.0
Now we can apply the same logic to the date, except that the date also contains information about the hours, minutes and seconds. Which can be filtered out using:
# Since we are interesting on a per day bases, we remove the hour/minute/seconds part
df['order_day'] = pd.to_datetime(df['Date_Time']).dt.strftime('%Y/%m/%d')
Now applying the same trick as above, we create a new pivot table, based on the 'ID' and 'order_day':
data_date = pd.pivot_table(df, values='order', index=['ID', 'order_day'], columns=['sort'], aggfunc=np.sum)
data_date.reset_index(inplace=True)
Which is:
sort ID order_day 0 1
0 1 2010/01/01 NaN 6.0
1 1 2010/04/01 6.0 NaN
2 1 2010/04/02 NaN 5.0
3 2 2011/01/01 15.0 NaN
4 3 2013/01/01 NaN 9.0
5 4 2014/01/01 3.0 2.0
6 5 2015/01/01 5.0 NaN
7 6 2011/01/01 8.0 NaN
Writing a second function to fill in the correct value based on 'ID' and 'date':
def filter_date(data, row, sort_value):
if row['sort'] == sort_value:
return data[(data['ID'] == row['ID']) & (data['order_day'] == row['order_day'])][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
Now we only have to drop the temporary column 'order_day':
df.drop(labels=['order_day'], axis=1, inplace=True)
And the final answer becomes:
ID Date_Time order sort total_1 total_0 date_0 date_1
0 1 2010-01-01 12:01:00 2 1 11.0 NaN NaN 6.0
1 1 2010-01-01 01:27:33 4 1 11.0 NaN NaN 6.0
2 1 2010-04-02 12:01:00 5 1 11.0 NaN NaN 5.0
3 1 2010-04-01 07:24:00 6 0 NaN 6.0 6.0 NaN
4 2 2011-01-01 12:01:00 7 0 NaN 15.0 15.0 NaN
5 2 2011-01-01 01:27:33 8 0 NaN 15.0 15.0 NaN
6 3 2013-01-01 12:01:00 9 1 9.0 NaN NaN 9.0
7 4 2014-01-01 12:01:00 2 1 2.0 NaN NaN 2.0
8 4 2014-01-01 01:27:33 3 0 NaN 3.0 3.0 NaN
9 5 2015-01-01 01:27:33 5 0 NaN 5.0 5.0 NaN
10 6 2016-01-01 01:27:33 6 1 6.0 NaN NaN 6.0
11 6 2011-01-01 01:28:00 8 0 NaN 8.0 8.0 NaN
import pandas as pd
data={'col1':[1,3,3,1,2,3,2,2]}
df=pd.DataFrame(data,columns=['col1'])
print df
col1
0 1
1 3
2 3
3 1
4 2
5 3
6 2
7 2
Expected result:
Col1 newCol1
0 1. 1
1 3. 3
2 3. NaN
3. 1. 1
4 2. 2
5 3. 3
6 2. 2
7. 2. Nan
Try where combine with shift
df['col2'] = df.col1.where(df.col1.ne(df.col1.shift()))
df
Out[191]:
col1 col2
0 1 1.0
1 3 3.0
2 3 NaN
3 1 1.0
4 2 2.0
5 3 3.0
6 2 2.0
7 2 NaN
The code snippet below:
import pandas as pd
pd.DataFrame(
{'type': ['A', 'B', 'A', 'C', 'C', 'A'],
'value': [5, 6, 7, 7, 9, 1]}
)
Gives:
type value
0 A 5
1 B 6
2 A 7
3 C 7
4 C 9
5 A 1
Want this:-
pd.DataFrame(
{'A': [5, 0, 7, 0, 0, 1],
'B': [0, 6, 0, 0, 0, 0],
'C': [0, 0, 0, 7, 9, 0]}
)
A B C
0 5 0 0
1 0 6 0
2 7 0 0
3 0 0 7
4 0 0 9
5 1 0 0
I did try using for loops but strive to be more efficient. Would be a great help!
Use get_dummies and multiply with the second column:
final_df=pd.get_dummies(df['type']).mul(df['value'],axis=0)
A B C
0 5 0 0
1 0 6 0
2 7 0 0
3 0 0 7
4 0 0 9
5 1 0 0
Use Series.unstack for reshape:
df = df.set_index('type', append=True)['value'].unstack(fill_value=0).rename_axis(None, axis=1)
print (df)
A B C
0 5 0 0
1 0 6 0
2 7 0 0
3 0 0 7
4 0 0 9
5 1 0 0
Or numpy solution with multiple indicator DataFrame created by get_dummies with numpy array:
df = pd.get_dummies(df['type']) * df['value'].values[:, None]
print (df)
A B C
0 5 0 0
1 0 6 0
2 7 0 0
3 0 0 7
4 0 0 9
5 1 0 0
I have a pandas dataframe as below,
flag a b c
0 1 5 1 3
1 1 2 1 3
2 1 3 0 3
3 1 4 0 3
4 1 5 5 3
5 1 6 0 3
6 1 7 0 3
7 2 6 1 4
8 2 2 1 4
9 2 3 1 4
10 2 4 1 4
I want to create a column 'd' based on the below condition:
1) For first row of each flag, if a>c, then d = b, else d = nan
2) For non-first row of each flag, if (a>c) & ((previous row of d is nan) | (b > previous row of d)), d=b, else d = prev row of d
I am expecting the below output:
flag a b c d
0 1 5 1 3 1
1 1 2 1 3 1
2 1 3 0 3 1
3 1 4 0 3 1
4 1 5 5 3 5
5 1 6 0 3 5
6 1 7 0 3 5
7 2 6 1 4 1
8 2 2 1 4 1
9 2 3 1 4 1
10 2 4 1 4 1
Here's how I would translate your logic:
df['d'] = np.nan
# first row of flag
s = df.flag.ne(df.flag.shift())
# where a > c
a_gt_c = df['a'].gt(df['c'])
# fill the first rows with a > c
df.loc[s & a_gt_c, 'd'] = df['b']
# mask for second fill
mask = ((~s) # not first rows
& a_gt_c # a > c
& (df['d'].shift().isna() # previous d not null
| df['b'].gt(df['d']).shift()) # or b > previous d
)
# fill those values:
df.loc[mask, 'd'] = df['b']
# ffill for the rest
df['d'] = df['d'].ffill()
Output:
flag a b c d
0 1 5 1 3 1.0
1 1 2 1 3 1.0
2 1 3 0 3 1.0
3 1 4 0 3 0.0
4 1 5 5 3 5.0
5 1 6 0 3 0.0
6 1 7 0 3 0.0
7 2 6 1 4 1.0
8 2 2 1 4 1.0
9 2 3 1 4 1.0
10 2 4 1 4 1.0
I have a table like this, I want to draw a histogram for number of 0, 1, 2, 3 across all table, is there a way to do it?
you can apply melt and hist
for example:
df
A B C D
0 3 1 1 1
1 3 3 2 2
2 1 0 1 1
3 3 2 3 0
4 3 1 1 3
5 3 0 3 1
6 3 1 1 0
7 1 3 3 0
8 3 1 3 3
9 3 3 1 3
df.melt()['value'].value_counts()
3 18
1 14
0 5
2 3