pandas remove rows with multiple criteria - python-3.x

Consider the following pandas Data Frame:
df = pd.DataFrame({
'case_id': [1050, 1050, 1050, 1050, 1051, 1051, 1051, 1051],
'elm_id': [101, 102, 101, 102, 101, 102, 101, 102],
'cid': [1, 1, 2, 2, 1, 1, 2, 2],
'fx': [736.1, 16.5, 98.8, 158.5, 272.5, 750.0, 333.4, 104.2],
'fy': [992.0, 261.3, 798.3, 452.0, 535.9, 838.8, 526.7, 119.4],
'fz': [428.4, 611.0, 948.3, 523.9, 880.9, 340.3, 890.7, 422.1]})
When printed looks like this:
case_id cid elm_id fx fy fz
0 1050 1 101 736.1 992.0 428.4
1 1050 1 102 16.5 261.3 611.0
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
4 1051 1 101 272.5 535.9 880.9
5 1051 1 102 750.0 838.8 340.3
6 1051 2 101 333.4 526.7 890.7
7 1051 2 102 104.2 119.4 422.1
I need to remove rows where 'case_id' = values in a List and 'cid' = values in a List. For simplicity lets just use Lists with a single value: cases = [1051] and ids = [1] respectively. In this scenario I want the NEW Data Frame to have (6) rows of data. It should look like this because there were two rows matching my criteria which should be removed:
case_id cid elm_id fx fy fz
0 1050 1 101 736.1 992.0 428.4
1 1050 1 102 16.5 261.3 611.0
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
4 1051 2 101 333.4 526.7 890.7
5 1051 2 102 104.2 119.4 422.1
I've tried a few different things like:
df2 = df[(df.case_id != subcase) & (df.cid != commit_id)]
But this returns the inverse of what I was expecting:
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
I've also tried using .query(): df.query('(case_id != 1051) & (cid != 1)')
but got the same (2) rows of results.
Any help and/or explanations would be greatly appreciated.

Your code looks for the rows that meets the criteria, not drop it. You can drop thee rows using .drop()
Use the following:
df.drop(df.loc[(df['case_id'].isin(cases)) & (df['cid'].isin(ids))].index)
Output:
case_id cid elm_id fx fy fz
0 1050 1 101 736.1 992.0 428.4
1 1050 1 102 16.5 261.3 611.0
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
6 1051 2 101 333.4 526.7 890.7
7 1051 2 102 104.2 119.4 422.1

Related

Pivot a Two Column DataFrame With No Numeric Column To Aggregate On

I have a dataframe with input like this:
df1 = pd.DataFrame(
{'StoreId':
[244, 391, 246, 246, 130, 130] , 'PackageStatus': ['IN TRANSIT','IN TRANSIT','IN TRANSIT', 'IN TRANSIT','IN TRANSIT','COLLECTED',]}
)
StoreId PackageStatus
0 244 IN TRANSIT
1 391 IN TRANSIT
2 246 IN TRANSIT
3 246 IN TRANSIT
4 130 IN TRANSIT
5 130 COLLECTED
The output I'm expecting is to look like this with the package status pivoting to columns and their counts becoming the values:
StoreId IN TRANSIT COLLECTED
244 1 0
391 1 0
246 2 0
130 1 1
All the examples I come across are with a third numeric column with which some aggregation (sum, mean, average etc) is done.
When I tried
df1.pivot_table(index='StoreId',values='PackageStatus', aggfunc='count')
I get the following instead:
PackageStatus
StoreId
130 2
244 1
246 2
391 1
In my case I need a simple transpose/pivot with the count. How to accomplish this? Thank you.
Use columns="PackageStatus" parameter:
print(
df1.pivot_table(
index="StoreId", columns="PackageStatus", aggfunc="size", fill_value=0
)
)
Prints:
PackageStatus COLLECTED IN TRANSIT
StoreId
130 1 1
244 0 1
246 0 2
391 0 1
With .reset_index():
print(
df1.pivot_table(
index="StoreId", columns="PackageStatus", aggfunc="size", fill_value=0
)
.reset_index()
.rename_axis("", axis=1)
)
Prints:
StoreId COLLECTED IN TRANSIT
0 130 1 1
1 244 0 1
2 246 0 2
3 391 0 1

How update one dataframe's column by matching columns in two different dataframes in Pandas

I have two dataframes. I need to generate report by matching columns in two dataframes and updating a column in the first dataframe:
Sample Data
input_file = pd.DataFrame({'Branch' : ['GGN','MDU','PDR','VLR','AMB'],
'Inflow' : [0, 0, 0, 0, 0]})
month_inflow = pd.DataFrame({'Branch' : ['AMB','GGN','MDU','PDR','VLR'],
'Visits' : [124, 130, 150, 100, 112]})
input_file
Branch Inflow
0 GGN 0
1 MDU 0
2 PDR 0
3 VLR 0
4 AMB 0
month_inflow
Branch Visits
0 AMB 124
1 GGN 130
2 MDU 150
3 PDR 100
4 VLR 112
Expected Output:
input_file
Branch Inflow
1 GGN 130
2 MDU 150
3 PDR 100
4 VLR 112
5 AMB 124
I tried using merge option, but I get the 'Inflow' column which is not required, I know I can drop it, but could someone let me know if there's a better way to get the desired output.
pd.merge(input_file, month_inflow, on = 'Branch')
Branch Inflow Visits
0 GGN 0 130
1 MDU 0 150
2 PDR 0 100
3 VLR 0 112
4 AMB 0 124
You can try
input_file.Inflow=input_file.Branch.map(month_inflow.set_index('Branch').Visits)
input_file
Out[145]:
Branch Inflow
0 GGN 130
1 MDU 150
2 PDR 100
3 VLR 112
4 AMB 124
Merge on "Branch" column and then drop "Inflow" from input file.
input_file = input_file.merge(month_inflow, on="Branch").drop('Inflow',1)
input_file
Branch Visits
0 GGN 130
1 MDU 150
2 PDR 100
3 VLR 112
4 AMB 124

How to do cumulative mean and count in a easy way

I have following dataframe in pandas
data = {'call_put':['C', 'C', 'P','C', 'P'],'price':[10,20,30,40,50], 'qty':[11,12,11,14,9]}
df['amt']=df.price*df.qty
df=pd.DataFrame(data)
call_put price qty amt
0 C 10 11 110
1 C 20 12 240
2 P 30 11 330
3 C 40 14 560
4 P 50 9 450
I want output something like following based on call_put value is 'C' or 'P' count, median and calculation as follows
call_put price qty amt cummcount cummmedian cummsum
C 10 11 110 1 110 110
C 20 12 240 2 175 ((110+240)/2 ) 350
P 30 11 330 1 330 680
C 40 14 560 3 303.33 (110+240+560)/3 1240
P 50 9 450 2 390 ((330+450)/2) 1690
Can it be done in some easy way without creating additional dataframes and functions?
create a grouped element named g and use df.assign to assign values:
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
call_put price qty amt cum_count cummedian cum_sum
0 C 10 11 110 1 110.000000 110
1 C 20 12 240 2 175.000000 350
2 P 30 11 330 1 303.333333 680
3 C 40 14 560 3 330.000000 1240
4 P 50 9 450 2 390.000000 1690
Note: for P , the cummedian should be 390 since (330+450)/2 = 390
For cum_count look at df.groupby.cumcount()
for cummedian check how expanding() works ,
for cumsum check df.cumsum()
IIUC, this should work
df['cumcount']=df.groupby('call_put').cumcount()
df['cummidean']=df.groupby('call_put')['amt'].cumsum()
df['cumsum']=df.groupby('call_put').cumsum()
Thanks following solution is fine
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
if I run following without drop=True
g['amt'].expanding().mean().reset_index()
why output is showing level_1
call_put level_1 amt
0 C 0 110.000000
1 C 1 175.000000
2 C 3 303.333333
3 P 2 330.000000
4 P 4 390.000000
g['amt'].expanding().mean().reset_index(drop=True)
0 110.000000
1 175.000000
2 303.333333
3 330.000000
4 390.000000
Name: amt, dtype: float64
Can you pl explain in more detail ?
How do you add one more condition in groupby clause
g=df.groupby('call_put', 'price' < 50)
TypeError: '<' not supported between instances of 'str' and 'int'

Get unique values of a column in between a timeperiod in pandas after groupby

I have a requirement where I need to find all the unique values of a merchant_store_id of the user on the same stampcard in between a specific time period. I had group by stampcard id and userid to get the data frame based on the condition. Now I need to find the unique merchant_store_id of the this dataframe in interval of 10mins from that entry.
My approach is I would loop in that groupby dataframe and then I would find the all indexes in that dataframe of that group and then I would create a new dataframe from time of index to index + 60mins and then find the unique merchant_store_id's in it. If the unique merchant_store_id is >1 , I would append that dataframe from that time to a final dataframe. Problem with the approach is it works fine for small data, but for data of size 20,000 rows it shows memory error on linux and keeps on running on windows. Below is my code
fi_df = pd.DataFrame()
for i in df.groupby(["stamp_card_id", "merchant_id", "user_id"]):
user_df = i[1]
if len(user_df)>1:
# get list of unique indexes in that groupby df
index = user_df.index.values
for ind in index:
fdf = user_df[ind:ind+np.timedelta64(1, 'h')]
if len(fdf.merchant_store_id.unique())>1:
fi_df=fi_df.append(fdf)
fi_df.drop_duplicates(keep="first").to_csv(csv_export_path)
Sample Data after group by is:
((117, 209, 'oZOfOgAgnO'), stamp_card_id stamp_time stamps_record_id user_id \
0 117 2018-10-14 16:48:03 1756 oZOfOgAgnO
1 117 2018-10-14 16:54:03 1759 oZOfOgAgnO
2 117 2018-10-14 16:58:03 1760 oZOfOgAgnO
3 117 2018-10-14 17:48:03 1763 oZOfOgAgnO
4 117 2018-10-14 18:48:03 1765 oZOfOgAgnO
5 117 2018-10-14 19:48:03 1767 oZOfOgAgnO
6 117 2018-10-14 20:48:03 1769 oZOfOgAgnO
7 117 2018-10-14 21:48:03 1771 oZOfOgAgnO
8 117 2018-10-15 22:48:03 1773 oZOfOgAgnO
9 117 2018-10-15 23:08:03 1774 oZOfOgAgnO
10 117 2018-10-15 23:34:03 1777 oZOfOgAgnO
merchant_id merchant_store_id
0 209 662
1 209 662
2 209 662
3 209 662
4 209 662
5 209 662
6 209 663
7 209 664
8 209 662
9 209 664
10 209 663 )
I have tried the resampling method also, but then i get the data in respective of the time, where the condition of user hitting multiple merchant_store_id is neglected at end time of the hours.
Any help would be appreciated. Thanks
if those are datetimes you can filter with the following:
filtered_set = set(df[df["stamp_time"]>=x][df["stamp_time"]<=y]["col of interest"])
df[df["stamp_time"]>=x] filters the df
adding [df["stamp_time"]<=y] filters the filtered df
["merchant_store_id"] captures just the specified column (series)
and finally set() returns the unique list (set)
Specific to your code:
x = datetime(lowerbound) #pseudo-code
y = datetime(upperbound) #pseudo-code
filtered_set = set(fi_df[fi_df["stamp_time"]>=x][fi_df["stamp_time"]<=y]["col of interest"])

Add two more columns to csv file based on matching values of other csv

I have two csv files
csv1:
csv2:
What i need to process is:
Get each value of column c of csv1 file and match it with column number of csv2.
If any row of csv2 matches with that number then add a new column c_text into csv1 that will contain value of text column for matching row of csv2
Repeat above process for column d of csv1 and add a new column d_text into csv1
Here is what i need at the end
Am new to pandas. How can i do this using pandas.
You can use apply():
csv1['c_text'] = csv1['c'].apply(lambda x: csv2[csv2['number']==x]['text'].values[0])
csv1['d_text'] = csv1['d'].apply(lambda x: csv2[csv2['number']==x]['text'].values[0])
Yields:
a b c d c_text d_text
0 1 4 101 201 val1 val4
1 2 5 105 202 val2 val5
2 3 6 107 203 val3 val6
In terms of an option using merge(), this will yield the same output:
csv1 = csv1.merge(csv2, left_on='c', right_on='number', how='left')
csv1 = csv1.merge(csv2, left_on='d', right_on='number', how='left')
csv1 = csv1.rename(columns={'text_x': 'c_text', 'text_y': 'd_text'})[['a','b','c','d','c_text','d_text']]
Here's something that will do the trick:
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c':[101, 105, 107], 'd':[201, 202, 203]})
df2 = pd.DataFrame({'number': [101, 105, 107, 201, 202, 203, 205, 2010, 310], 'text': ["val_{x}".format(x=y + 1) for y in range(9)]})
df1
a b c d
0 1 4 101 201
1 2 5 105 202
2 3 6 107 203
df2
number text
0 101 val_1
1 105 val_2
2 107 val_3
3 201 val_4
4 202 val_5
5 203 val_6
6 205 val_7
7 2010 val_8
8 310 val_9
merged = df1.merge(df2, left_on='c', right_on='number', how='left')
merged
a b c d number text
0 1 4 101 201 101 val_1
1 2 5 105 202 105 val_2
2 3 6 107 203 107 val_3
output = merged.merge(df2, left_on='d', right_on='number', how='left')[['a', 'b', 'c', 'd', 'text_x', 'text_y']]
output
a b c d text_x text_y
0 1 4 101 201 val_1 val_4
1 2 5 105 202 val_2 val_5
2 3 6 107 203 val_3 val_6
What you want is the merge functionality of Pandas. Assuming you have imported the Pandas module with the shorthand name like import pandas as pd, then:
csv1_with_text_col = pd.merge(csv1, csv2, left_on='c', right_on='number', how='left')
This will give you a new dataframe, csv1_with_text_col, with the columns from csv2 merged into csv1 where csv1['c'] == csv2['number']. Additionally, by specifying how='left', only rows from the left dataframe, csv1, will be kept.
You can then merge this new dataframe, csv1_with_text_col, with csv2 again but with left_on='d'.

Resources