Pivot a Two Column DataFrame With No Numeric Column To Aggregate On - python-3.x

I have a dataframe with input like this:
df1 = pd.DataFrame(
{'StoreId':
[244, 391, 246, 246, 130, 130] , 'PackageStatus': ['IN TRANSIT','IN TRANSIT','IN TRANSIT', 'IN TRANSIT','IN TRANSIT','COLLECTED',]}
)
StoreId PackageStatus
0 244 IN TRANSIT
1 391 IN TRANSIT
2 246 IN TRANSIT
3 246 IN TRANSIT
4 130 IN TRANSIT
5 130 COLLECTED
The output I'm expecting is to look like this with the package status pivoting to columns and their counts becoming the values:
StoreId IN TRANSIT COLLECTED
244 1 0
391 1 0
246 2 0
130 1 1
All the examples I come across are with a third numeric column with which some aggregation (sum, mean, average etc) is done.
When I tried
df1.pivot_table(index='StoreId',values='PackageStatus', aggfunc='count')
I get the following instead:
PackageStatus
StoreId
130 2
244 1
246 2
391 1
In my case I need a simple transpose/pivot with the count. How to accomplish this? Thank you.

Use columns="PackageStatus" parameter:
print(
df1.pivot_table(
index="StoreId", columns="PackageStatus", aggfunc="size", fill_value=0
)
)
Prints:
PackageStatus COLLECTED IN TRANSIT
StoreId
130 1 1
244 0 1
246 0 2
391 0 1
With .reset_index():
print(
df1.pivot_table(
index="StoreId", columns="PackageStatus", aggfunc="size", fill_value=0
)
.reset_index()
.rename_axis("", axis=1)
)
Prints:
StoreId COLLECTED IN TRANSIT
0 130 1 1
1 244 0 1
2 246 0 2
3 391 0 1

Related

Re-formatting a dataframe to show sequence number and time difference after a groupby

I have a pandas dataframe that has an identifier, a sequence number, and a timestamp.
For example:
MyIndex seq_no timestamp
1 181 7:56
1 182 7:57
1 183 7:59
2 184 8:01
2 185 8:04
3 186 8:05
3 187 8:08
3 188 8:10
I want to reformat by showing a sequence number for each index and with the time difference, something like:
MyIndex seq_no timediff
1 1 0
1 2 1
1 3 2
2 1 0
2 2 3
3 1 0
3 2 3
3 3 2
I know I can get the seq_no by doing
df.groupby("MyIndex")["seq_no"].rank(method="first", ascending=True)
but how do I get the time difference? Bonus points if you show me how to do the time difference between steps, or total timediff from the start.
I think the simplest way to get the difference is to convert the timestamp to a single unit. You can then calculate the difference with groupby and shift.
import pandas as pd
from io import StringIO
data = """Index seq_no timestamp
1 181 7:56
1 182 7:57
1 183 7:59
2 184 8:01
2 185 8:04
3 186 8:05
3 187 8:08
3 188 8:10"""
df = pd.read_csv(StringIO(data), sep='\s+')
# use cumcount to get new seq_no
df['seq_no_new'] = df.groupby('Index').cumcount() + 1
# can convert timestamp by splitting string
# and then casting to int
time = df['timestamp'].str.split(':', expand=True).astype(int)
df['time'] = time.iloc[:, 0] * 60 + time.iloc[:, 1]
# you then calculate the difference with groupby/shift
# fillna values with 0 and cast to int
df['timediff'] = (df['time'] - df.groupby('Index')['time'].shift(1)).fillna(0).astype(int)
# pick columns you want at the end
df = df.loc[:, ['Index', 'seq_no_new', 'timediff']]
Output
>>>df
Index seq_no_new timediff
0 1 1 0
1 1 2 1
2 1 3 2
3 2 1 0
4 2 2 3
5 3 1 0
6 3 2 3
7 3 3 2

How update one dataframe's column by matching columns in two different dataframes in Pandas

I have two dataframes. I need to generate report by matching columns in two dataframes and updating a column in the first dataframe:
Sample Data
input_file = pd.DataFrame({'Branch' : ['GGN','MDU','PDR','VLR','AMB'],
'Inflow' : [0, 0, 0, 0, 0]})
month_inflow = pd.DataFrame({'Branch' : ['AMB','GGN','MDU','PDR','VLR'],
'Visits' : [124, 130, 150, 100, 112]})
input_file
Branch Inflow
0 GGN 0
1 MDU 0
2 PDR 0
3 VLR 0
4 AMB 0
month_inflow
Branch Visits
0 AMB 124
1 GGN 130
2 MDU 150
3 PDR 100
4 VLR 112
Expected Output:
input_file
Branch Inflow
1 GGN 130
2 MDU 150
3 PDR 100
4 VLR 112
5 AMB 124
I tried using merge option, but I get the 'Inflow' column which is not required, I know I can drop it, but could someone let me know if there's a better way to get the desired output.
pd.merge(input_file, month_inflow, on = 'Branch')
Branch Inflow Visits
0 GGN 0 130
1 MDU 0 150
2 PDR 0 100
3 VLR 0 112
4 AMB 0 124
You can try
input_file.Inflow=input_file.Branch.map(month_inflow.set_index('Branch').Visits)
input_file
Out[145]:
Branch Inflow
0 GGN 130
1 MDU 150
2 PDR 100
3 VLR 112
4 AMB 124
Merge on "Branch" column and then drop "Inflow" from input file.
input_file = input_file.merge(month_inflow, on="Branch").drop('Inflow',1)
input_file
Branch Visits
0 GGN 130
1 MDU 150
2 PDR 100
3 VLR 112
4 AMB 124

How to do cumulative mean and count in a easy way

I have following dataframe in pandas
data = {'call_put':['C', 'C', 'P','C', 'P'],'price':[10,20,30,40,50], 'qty':[11,12,11,14,9]}
df['amt']=df.price*df.qty
df=pd.DataFrame(data)
call_put price qty amt
0 C 10 11 110
1 C 20 12 240
2 P 30 11 330
3 C 40 14 560
4 P 50 9 450
I want output something like following based on call_put value is 'C' or 'P' count, median and calculation as follows
call_put price qty amt cummcount cummmedian cummsum
C 10 11 110 1 110 110
C 20 12 240 2 175 ((110+240)/2 ) 350
P 30 11 330 1 330 680
C 40 14 560 3 303.33 (110+240+560)/3 1240
P 50 9 450 2 390 ((330+450)/2) 1690
Can it be done in some easy way without creating additional dataframes and functions?
create a grouped element named g and use df.assign to assign values:
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
call_put price qty amt cum_count cummedian cum_sum
0 C 10 11 110 1 110.000000 110
1 C 20 12 240 2 175.000000 350
2 P 30 11 330 1 303.333333 680
3 C 40 14 560 3 330.000000 1240
4 P 50 9 450 2 390.000000 1690
Note: for P , the cummedian should be 390 since (330+450)/2 = 390
For cum_count look at df.groupby.cumcount()
for cummedian check how expanding() works ,
for cumsum check df.cumsum()
IIUC, this should work
df['cumcount']=df.groupby('call_put').cumcount()
df['cummidean']=df.groupby('call_put')['amt'].cumsum()
df['cumsum']=df.groupby('call_put').cumsum()
Thanks following solution is fine
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
if I run following without drop=True
g['amt'].expanding().mean().reset_index()
why output is showing level_1
call_put level_1 amt
0 C 0 110.000000
1 C 1 175.000000
2 C 3 303.333333
3 P 2 330.000000
4 P 4 390.000000
g['amt'].expanding().mean().reset_index(drop=True)
0 110.000000
1 175.000000
2 303.333333
3 330.000000
4 390.000000
Name: amt, dtype: float64
Can you pl explain in more detail ?
How do you add one more condition in groupby clause
g=df.groupby('call_put', 'price' < 50)
TypeError: '<' not supported between instances of 'str' and 'int'

how to pass pandas series element to another dataframe

I want to check if an error occurred.
I have this two dataframes, from excel files:
Log_frame is a dataframe of log files, reporting data recording and error:
Time Voltage[V] Freq[Hz] Speed Motor_Stt: ErrNo
0 10:00 220 50 30 1 0
1 10:10 220 50 30 1 0
2 10:20 220 50 0 2 3601
3 10:30 220 47 0 1 1500
4 10:40 250 50 0 1 7707
5 10:50 220 50 0 2 3601
6 11:00 220 50 0 2 3601
7 11:10 220 47 0 1 1500
8 11:20 220 50 30 1 0
9 11:30 220 50 30 1 0
Dev_frame is the dataframe of error description:
Fehler-Nr. Descr Cause
0 1500 Chk_Voltage Voltage out of range
1 7707 Chk_Freq. Freq. out of range
2 3601 Chk_Motor_Stt Motor_defec
3 7704 switch_trip chk_over_curr
from Log_frame I can check if, which and how many errors occurred during a day by:
Err_log = Log_frame['ErrNo']
p = Err_log[Err_log != 0].drop_duplicates('first').reset_index(drop=True)
and this result is a pandas series:
<class 'pandas.core.series.Series'>
0 3601
1 1500
2 7707
I can "pass" first error (or second and all the other) by this:
Dev_Err = Dev_frame['Fehler-Nr.']
n = Dev_Err[Dev_Err == p.iloc[0]] #or 1, 2 and so on
I was wondering how to loop trough p.iloc[i].
Should I use a for loop or can be done by any pandas function
EDIT: e.g. if I put 1 in p.iloc[] I can get:
0 1500
if 2:
1 7707
No need to create a loop to check each value, you can use isin method that pandas.DataFrame has as following:
n = dev_frame[dev_frame['Fehler-Nr.'].isin(p)]['Fehler-Nr.']
which is going to return:
0 1500
1 7707
2 3601
Name: Fehler-Nr., dtype: int64
Ref: pandas.DataFrame.isin
If you're using pandas and going for for loops you are wrong. Use pandas vectorised operations. These are done using (simple exaple)
df.apply(some function, axis)
I'm not 100% convinced I understood what you're trying to achieve, but I believe you just want to merge/join number of errors for a given error. If so, pandas.join() and pandas.merge() are to help. Check the docs.

pandas remove rows with multiple criteria

Consider the following pandas Data Frame:
df = pd.DataFrame({
'case_id': [1050, 1050, 1050, 1050, 1051, 1051, 1051, 1051],
'elm_id': [101, 102, 101, 102, 101, 102, 101, 102],
'cid': [1, 1, 2, 2, 1, 1, 2, 2],
'fx': [736.1, 16.5, 98.8, 158.5, 272.5, 750.0, 333.4, 104.2],
'fy': [992.0, 261.3, 798.3, 452.0, 535.9, 838.8, 526.7, 119.4],
'fz': [428.4, 611.0, 948.3, 523.9, 880.9, 340.3, 890.7, 422.1]})
When printed looks like this:
case_id cid elm_id fx fy fz
0 1050 1 101 736.1 992.0 428.4
1 1050 1 102 16.5 261.3 611.0
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
4 1051 1 101 272.5 535.9 880.9
5 1051 1 102 750.0 838.8 340.3
6 1051 2 101 333.4 526.7 890.7
7 1051 2 102 104.2 119.4 422.1
I need to remove rows where 'case_id' = values in a List and 'cid' = values in a List. For simplicity lets just use Lists with a single value: cases = [1051] and ids = [1] respectively. In this scenario I want the NEW Data Frame to have (6) rows of data. It should look like this because there were two rows matching my criteria which should be removed:
case_id cid elm_id fx fy fz
0 1050 1 101 736.1 992.0 428.4
1 1050 1 102 16.5 261.3 611.0
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
4 1051 2 101 333.4 526.7 890.7
5 1051 2 102 104.2 119.4 422.1
I've tried a few different things like:
df2 = df[(df.case_id != subcase) & (df.cid != commit_id)]
But this returns the inverse of what I was expecting:
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
I've also tried using .query(): df.query('(case_id != 1051) & (cid != 1)')
but got the same (2) rows of results.
Any help and/or explanations would be greatly appreciated.
Your code looks for the rows that meets the criteria, not drop it. You can drop thee rows using .drop()
Use the following:
df.drop(df.loc[(df['case_id'].isin(cases)) & (df['cid'].isin(ids))].index)
Output:
case_id cid elm_id fx fy fz
0 1050 1 101 736.1 992.0 428.4
1 1050 1 102 16.5 261.3 611.0
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
6 1051 2 101 333.4 526.7 890.7
7 1051 2 102 104.2 119.4 422.1

Resources