How to perform conditional dataframe operations? - python-3.x

Given df
A = pd.DataFrame([[1, 5, 2, 1, 2], [2, 4, 4, 1, 2], [3, 3, 1, 1, 2], [4, 2, 2, 3, 0],
[5, 1, 4, 3, -4], [1, 5, 2, 3, -20], [2, 4, 4, 2, 0], [3, 3, 1, 2, -1],
[4, 2, 2, 2, 0], [5, 1, 4, 2, -2]],
columns=['a', 'b', 'c', 'd', 'e'],
index=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
How can I create a column 'f', that corresponds to the last value in column 'e' before a change in value in column 'd', and holds that value until the next change in value in column 'd' the output would be:
a b c d e f
1 1 5 2 1 2 nan
2 2 4 4 1 2 nan
3 3 3 1 1 2 nan
4 4 2 2 3 0 2
5 5 1 4 3 -4 2
6 1 5 2 3 -20 2
7 2 4 4 2 0 -20
8 3 3 1 2 -1 -20
9 4 2 2 2 0 -20
10 5 1 4 2 -2 -20
Edit: #Noobie presented a solution that when applied in real data, it breaks down when there's a smaller than previous value in column 'd'

I think we should offer better native support for dealing with contiguous groups, but until then you can us the compare-cumsum-groupby pattern:
g = (A["d"] != A["d"].shift()).cumsum()
A["f"] = A["e"].groupby(g).last().shift().loc[g].values
which gives me
In [41]: A
Out[41]:
a b c d e f
1 1 5 2 1 2 NaN
2 2 4 4 1 2 NaN
3 3 3 1 1 2 NaN
4 4 2 2 2 0 2.0
5 5 1 4 2 -4 2.0
6 1 5 2 2 -20 2.0
7 2 4 4 3 0 -20.0
8 3 3 1 3 -1 -20.0
9 4 2 2 3 0 -20.0
10 5 1 4 3 -2 -20.0
This works because g is a count corresponding to each contiguous group of d values. Note that in this case, using the example you posted, g is the same as column "d", but that needn't be the case. Once we have g, we can use it to group column e:
In [55]: A["e"].groupby(g).last()
Out[55]:
d
1 2
2 -20
3 -2
Name: e, dtype: int64
and then
In [57]: A["e"].groupby(g).last().shift()
Out[57]:
d
1 NaN
2 2.0
3 -20.0
Name: e, dtype: float64
In [58]: A["e"].groupby(g).last().shift().loc[g]
Out[58]:
d
1 NaN
1 NaN
1 NaN
2 2.0
2 2.0
2 2.0
3 -20.0
3 -20.0
3 -20.0
3 -20.0
Name: e, dtype: float64

easy my friend. unleash the POWER OF PANDAS !
A.sort_values(by = 'd', inplace = True)
A['lag'] = A.e.shift(1)
A['output'] = A.groupby('d').lag.transform(lambda x : x.iloc[0])
A
Out[57]:
a b c d e lag output
1 1 5 2 1 2 NaN NaN
2 2 4 4 1 2 2.0 NaN
3 3 3 1 1 2 2.0 NaN
4 4 2 2 2 0 2.0 2.0
5 5 1 4 2 -4 0.0 2.0
6 1 5 2 2 -20 -4.0 2.0
7 2 4 4 3 0 -20.0 -20.0
8 3 3 1 3 -1 0.0 -20.0
9 4 2 2 3 0 -1.0 -20.0
10 5 1 4 3 -2 0.0 -20.0

Related

Grouping using groupby based on certain conditions

I have the following dataframe:
data = pd.DataFrame({
'ID': [1, 1, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6],
'Date_Time': ['2010-01-01 12:01:00', '2010-01-01 01:27:33',
'2010-04-02 12:01:00', '2010-04-01 07:24:00', '2011-01-01 12:01:00',
'2011-01-01 01:27:33', '2013-01-01 12:01:00', '2014-01-01 12:01:00',
'2014-01-01 01:27:33', '2015-01-01 01:27:33', '2016-01-01 01:27:33',
'2011-01-01 01:28:00'],
'order': [2, 4, 5, 6, 7, 8, 9, 2, 3, 5, 6, 8],
'sort': [1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0]})
An would like to get the following columns:
1- sum_order_total_1 which sums up the values in the column order grouped by the column sort so in this case for the value 1 from column sort for each ID and returns Nans for zeros form column sort
2- sum_order_total_0 which sums up the values in the column order grouped by the column sort so in this case for the value 0 from column sort for each ID and returns Nans for oness form column sort
3- count_order_date_1 which sums up the values in column order of each ID grouped by column Date_Time for 1 in column sort and returns Nans for 0 from column sort
4- count_order_date_0 which sums up the values in column order of each ID grouped by column Date_Time for 0 in column sort and returns Nans for 1 from column sort
The expected reults should look like that attached photo here:
The problem with groupby (and pd.pivot_table) is that only provide half of the job. They give you the numbers but not in the format that you want. To finalize the format you can use apply.
For the total counts I used:
# Retrieve your data, similar as in the groupby query you provided.
data_total = pd.pivot_table(df, values='order', index=['ID'], columns=['sort'], aggfunc=np.sum)
data_total.reset_index(inplace=True)
Which results in the table:
sort ID 0 1
0 1 6.0 11.0
1 2 15.0 NaN
2 3 NaN 9.0
3 4 3.0 2.0
4 5 5.0 NaN
5 6 8.0 6.0
Now using this as an index ('ID' and 0 or 1 for the sort.) We can write a small function that will input the right value:
def filter_count(data, row, sort_value):
""" Select the count that belongs to the correct ID and sort combination. """
if row['sort'] == sort_value:
return data[data['ID'] == row['ID']][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
This leads to:
ID Date_Time order sort total_1 total_0
0 1 2010-01-01 12:01:00 2 1 11.0 NaN
1 1 2010-01-01 01:27:33 4 1 11.0 NaN
2 1 2010-04-02 12:01:00 5 1 11.0 NaN
3 1 2010-04-01 07:24:00 6 0 NaN 6.0
4 2 2011-01-01 12:01:00 7 0 NaN 15.0
5 2 2011-01-01 01:27:33 8 0 NaN 15.0
6 3 2013-01-01 12:01:00 9 1 9.0 NaN
7 4 2014-01-01 12:01:00 2 1 2.0 NaN
8 4 2014-01-01 01:27:33 3 0 NaN 3.0
9 5 2015-01-01 01:27:33 5 0 NaN 5.0
10 6 2016-01-01 01:27:33 6 1 6.0 NaN
11 6 2011-01-01 01:28:00 8 0 NaN 8.0
Now we can apply the same logic to the date, except that the date also contains information about the hours, minutes and seconds. Which can be filtered out using:
# Since we are interesting on a per day bases, we remove the hour/minute/seconds part
df['order_day'] = pd.to_datetime(df['Date_Time']).dt.strftime('%Y/%m/%d')
Now applying the same trick as above, we create a new pivot table, based on the 'ID' and 'order_day':
data_date = pd.pivot_table(df, values='order', index=['ID', 'order_day'], columns=['sort'], aggfunc=np.sum)
data_date.reset_index(inplace=True)
Which is:
sort ID order_day 0 1
0 1 2010/01/01 NaN 6.0
1 1 2010/04/01 6.0 NaN
2 1 2010/04/02 NaN 5.0
3 2 2011/01/01 15.0 NaN
4 3 2013/01/01 NaN 9.0
5 4 2014/01/01 3.0 2.0
6 5 2015/01/01 5.0 NaN
7 6 2011/01/01 8.0 NaN
Writing a second function to fill in the correct value based on 'ID' and 'date':
def filter_date(data, row, sort_value):
if row['sort'] == sort_value:
return data[(data['ID'] == row['ID']) & (data['order_day'] == row['order_day'])][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
Now we only have to drop the temporary column 'order_day':
df.drop(labels=['order_day'], axis=1, inplace=True)
And the final answer becomes:
ID Date_Time order sort total_1 total_0 date_0 date_1
0 1 2010-01-01 12:01:00 2 1 11.0 NaN NaN 6.0
1 1 2010-01-01 01:27:33 4 1 11.0 NaN NaN 6.0
2 1 2010-04-02 12:01:00 5 1 11.0 NaN NaN 5.0
3 1 2010-04-01 07:24:00 6 0 NaN 6.0 6.0 NaN
4 2 2011-01-01 12:01:00 7 0 NaN 15.0 15.0 NaN
5 2 2011-01-01 01:27:33 8 0 NaN 15.0 15.0 NaN
6 3 2013-01-01 12:01:00 9 1 9.0 NaN NaN 9.0
7 4 2014-01-01 12:01:00 2 1 2.0 NaN NaN 2.0
8 4 2014-01-01 01:27:33 3 0 NaN 3.0 3.0 NaN
9 5 2015-01-01 01:27:33 5 0 NaN 5.0 5.0 NaN
10 6 2016-01-01 01:27:33 6 1 6.0 NaN NaN 6.0
11 6 2011-01-01 01:28:00 8 0 NaN 8.0 8.0 NaN

Replacing constant values with nan

import pandas as pd
data={'col1':[1,3,3,1,2,3,2,2]}
df=pd.DataFrame(data,columns=['col1'])
print df
col1
0 1
1 3
2 3
3 1
4 2
5 3
6 2
7 2
Expected result:
Col1 newCol1
0 1. 1
1 3. 3
2 3. NaN
3. 1. 1
4 2. 2
5 3. 3
6 2. 2
7. 2. Nan
Try where combine with shift
df['col2'] = df.col1.where(df.col1.ne(df.col1.shift()))
df
Out[191]:
col1 col2
0 1 1.0
1 3 3.0
2 3 NaN
3 1 1.0
4 2 2.0
5 3 3.0
6 2 2.0
7 2 NaN

Replace column values to generate new data frame

The code snippet below:
import pandas as pd
pd.DataFrame(
{'type': ['A', 'B', 'A', 'C', 'C', 'A'],
'value': [5, 6, 7, 7, 9, 1]}
)
Gives:
type value
0 A 5
1 B 6
2 A 7
3 C 7
4 C 9
5 A 1
Want this:-
pd.DataFrame(
{'A': [5, 0, 7, 0, 0, 1],
'B': [0, 6, 0, 0, 0, 0],
'C': [0, 0, 0, 7, 9, 0]}
)
A B C
0 5 0 0
1 0 6 0
2 7 0 0
3 0 0 7
4 0 0 9
5 1 0 0
I did try using for loops but strive to be more efficient. Would be a great help!
Use get_dummies and multiply with the second column:
final_df=pd.get_dummies(df['type']).mul(df['value'],axis=0)
A B C
0 5 0 0
1 0 6 0
2 7 0 0
3 0 0 7
4 0 0 9
5 1 0 0
Use Series.unstack for reshape:
df = df.set_index('type', append=True)['value'].unstack(fill_value=0).rename_axis(None, axis=1)
print (df)
A B C
0 5 0 0
1 0 6 0
2 7 0 0
3 0 0 7
4 0 0 9
5 1 0 0
Or numpy solution with multiple indicator DataFrame created by get_dummies with numpy array:
df = pd.get_dummies(df['type']) * df['value'].values[:, None]
print (df)
A B C
0 5 0 0
1 0 6 0
2 7 0 0
3 0 0 7
4 0 0 9
5 1 0 0

Creating a variable using conditionals python using vectorization

I have a pandas dataframe as below,
flag a b c
0 1 5 1 3
1 1 2 1 3
2 1 3 0 3
3 1 4 0 3
4 1 5 5 3
5 1 6 0 3
6 1 7 0 3
7 2 6 1 4
8 2 2 1 4
9 2 3 1 4
10 2 4 1 4
I want to create a column 'd' based on the below condition:
1) For first row of each flag, if a>c, then d = b, else d = nan
2) For non-first row of each flag, if (a>c) & ((previous row of d is nan) | (b > previous row of d)), d=b, else d = prev row of d
I am expecting the below output:
flag a b c d
0 1 5 1 3 1
1 1 2 1 3 1
2 1 3 0 3 1
3 1 4 0 3 1
4 1 5 5 3 5
5 1 6 0 3 5
6 1 7 0 3 5
7 2 6 1 4 1
8 2 2 1 4 1
9 2 3 1 4 1
10 2 4 1 4 1
Here's how I would translate your logic:
df['d'] = np.nan
# first row of flag
s = df.flag.ne(df.flag.shift())
# where a > c
a_gt_c = df['a'].gt(df['c'])
# fill the first rows with a > c
df.loc[s & a_gt_c, 'd'] = df['b']
# mask for second fill
mask = ((~s) # not first rows
& a_gt_c # a > c
& (df['d'].shift().isna() # previous d not null
| df['b'].gt(df['d']).shift()) # or b > previous d
)
# fill those values:
df.loc[mask, 'd'] = df['b']
# ffill for the rest
df['d'] = df['d'].ffill()
Output:
flag a b c d
0 1 5 1 3 1.0
1 1 2 1 3 1.0
2 1 3 0 3 1.0
3 1 4 0 3 0.0
4 1 5 5 3 5.0
5 1 6 0 3 0.0
6 1 7 0 3 0.0
7 2 6 1 4 1.0
8 2 2 1 4 1.0
9 2 3 1 4 1.0
10 2 4 1 4 1.0

Table wise value count

I have a table like this, I want to draw a histogram for number of 0, 1, 2, 3 across all table, is there a way to do it?
you can apply melt and hist
for example:
df
A B C D
0 3 1 1 1
1 3 3 2 2
2 1 0 1 1
3 3 2 3 0
4 3 1 1 3
5 3 0 3 1
6 3 1 1 0
7 1 3 3 0
8 3 1 3 3
9 3 3 1 3
df.melt()['value'].value_counts()
3 18
1 14
0 5
2 3

Resources