Creating multiple cohort from the pivot table - python-3.x

I have a requirement like below.
The initial information is a list of gross adds.
201910
201911
201912
202001
202002
20000
30000
32000
40000
36000
I have a pivot table as below.
201910
201911
201912
202001
202002
1000
2000
2400
3200
1800
500
400
300
200
nan
200
150
100
nan
nan
200
100
nan
nan
nan
160
nan
nan
nan
nan
Need to generate the report like below.
Cohort01:
5%
3%
3%
1%
1%
1%
From Cohort02 onwards it will take the average of last value of cohort01.
Similarly for Cohort03 for both nan values it will take the average of corresponding values of cohort01 and cohort2.
Again while calculating for cohort04 it will take the average of previous two cohorts(cohort02 and cohort03 values) to add all three nan value.
Is there anyone who can provide me a solution on this in Python.
The report should be generated as below.
All cohorts should be created separately.

You could try it like this:
res = df.apply(lambda x: round(100/(df_gross.iloc[0]/x),1),axis=1)
print(res)
201910 201911 201912 202001 202002
0 5.0 6.7 7.5 8.0 5.0
1 2.5 1.3 0.9 0.5 NaN
2 1.0 0.5 0.3 NaN NaN
3 1.0 0.3 NaN NaN NaN
4 0.8 NaN NaN NaN NaN
for idx,col in enumerate(res.columns[1:],1):
res[col] = res[col].fillna((res.iloc[:,max(idx-2,0)]+res.iloc[:,idx-1])/2)
print(res)
201910 201911 201912 202001 202002
0 5.0 6.7 7.50 8.000 5.0000
1 2.5 1.3 0.90 0.500 0.7000
2 1.0 0.5 0.30 0.400 0.3500
3 1.0 0.3 0.65 0.475 0.5625
4 0.8 0.8 0.80 0.800 0.8000

Related

Keep column when resampling hourly to daily data in pandas

I have a dataset of hourly weather observations in this format:
df = pd.DataFrame({ 'date': ['2019-01-01 09:30:00', '2019-01-01 10:00', '2019-01-02 04:30:00','2019-01-02 05:00:00','2019-07-04 02:00:00'],
'windSpeedHigh': [155,90,35,45,15],
'windSpeedHigh_Dir':['NE','NNW','SW','W','S']})
My goal is to find the highest wind speed each day and the wind direction associated with that maximum daily wind speed.
Using resample, I have sucessfully found the maximum wind speed for each day, but not its associated direction:
df['date'] = pd.to_datetime(df['date'])
df['windSpeedHigh'] = pd.to_numeric(df['windSpeedHigh'])
df_daily = df.resample('D', on='date')[['windSpeedHigh_Dir','windSpeedHigh']].max()
df_daily
Results in:
windSpeedHigh_Dir windSpeedHigh
date
2019-01-01 NNW 155.0
2019-01-02 W 45.0
2019-01-03 NaN NaN
2019-01-04 NaN NaN
2019-01-05 NaN NaN
... ... ...
2019-06-30 NaN NaN
2019-07-01 NaN NaN
2019-07-02 NaN NaN
2019-07-03 NaN NaN
2019-07-04 S 15.0
This is incorrect as this resample is also grabbing the max() for 'windSpeedHigh_Dir'. For 2019-01-01 the direction for the associated windspeed should be 'NE' not 'NNW', because the wind direction df['windSpeedHigh_Dir'] == 'NE' when the maximum wind speed occurred.
So my question is, is it possible for me to resample this dataset from half-hourly to daily maximum wind speed while keeping the wind direction associated with that speed?
Use DataFrameGroupBy.idxmax for indices by dates first:
df_daily = df.loc[df.groupby(df['date'].dt.date)['windSpeedHigh'].idxmax()]
print (df_daily)
date windSpeedHigh windSpeedHigh_Dir
0 2019-01-01 09:30:00 155 NE
3 2019-01-02 05:00:00 45 W
4 2019-07-04 02:00:00 15 S
And then for add DatetimeIndex use DataFrame.set_index with Series.dt.normalize and DataFrame.asfreq:
df_daily = df_daily.set_index(df_daily['date'].dt.normalize().rename('day')).asfreq('d')
print (df_daily)
date windSpeedHigh windSpeedHigh_Dir
day
2019-01-01 2019-01-01 09:30:00 155.0 NE
2019-01-02 2019-01-02 05:00:00 45.0 W
2019-01-03 NaT NaN NaN
2019-01-04 NaT NaN NaN
2019-01-05 NaT NaN NaN
... ... ...
2019-06-30 NaT NaN NaN
2019-07-01 NaT NaN NaN
2019-07-02 NaT NaN NaN
2019-07-03 NaT NaN NaN
2019-07-04 2019-07-04 02:00:00 15.0 S
[185 rows x 3 columns]
Your solution shoudl working with custom function, because idxmax failed for missing values with DataFrame.join:
f = lambda x: x.idxmax() if len(x) > 0 else np.nan
df_daily = df.resample('D', on='date')['windSpeedHigh'].agg(f).to_frame('idx').join(df, on='idx')
print (df_daily)
idx date windSpeedHigh windSpeedHigh_Dir
date
2019-01-01 0.0 2019-01-01 09:30:00 155.0 NE
2019-01-02 3.0 2019-01-02 05:00:00 45.0 W
2019-01-03 NaN NaT NaN NaN
2019-01-04 NaN NaT NaN NaN
2019-01-05 NaN NaT NaN NaN
... ... ... ...
2019-06-30 NaN NaT NaN NaN
2019-07-01 NaN NaT NaN NaN
2019-07-02 NaN NaT NaN NaN
2019-07-03 NaN NaT NaN NaN
2019-07-04 4.0 2019-07-04 02:00:00 15.0 S
[185 rows x 4 columns]

Comparing just the time component of two datetime64 columns

I am trying to subtract or compare Only the time component of two datetime64 columns but have been unsuccessful. I have tried using strftime with an exception block to catch NaTs but no luck. Any help is much appreciated. I have attached the Python code below.
Column A Column B
1/1/1900 10:00 NaT
1/1/1900 10:30 NaT
1/1/1900 11:00 NaT
1/1/1900 9:00 2/6/2021 23:59
1/1/1900 11:00 2/6/2021 8:59
1/1/1900 9:30 2/6/2021 16:00
def convert(x):
try:
return x.strftime("%H:%M:%S")
except ValueError:
return x
df['B'].apply(convert)-df['A'].apply(convert)
I get the error TypeError: unsupported operand type(s) for -: 'NaTType' and 'str'
Convert both columns to pandas datetime using pd.to_datetime. Then extract just time using Series.dt.time:
df['Column A'] = pd.to_datetime(df['Column A'])
df['Column B'] = pd.to_datetime(df['Column B'])
In [213]: (df['Column A'] - df['Column B']).dt.components
Out[213]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 -44232.0 9.0 1.0 0.0 0.0 0.0 0.0
4 -44231.0 2.0 1.0 0.0 0.0 0.0 0.0
5 -44232.0 17.0 30.0 0.0 0.0 0.0 0.0
From the above, you can extract hours, minutes, etc.. separately:
In [215]: (df['Column A'] - df['Column B']).dt.components.hours
Out[215]:
0 NaN
1 NaN
2 NaN
3 9.0
4 2.0
5 17.0
Name: hours, dtype: float64

Transpose DF columns based on column values - Pandas

My df looks like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 1.0 2019-10-01 11.25
3 XYZ 2.0 2019-08-01 15.71
4 XYZ 3.0 2020-10-01 11.50
5 XYZ NaN NaN NaN
6 MMG 1.0 2021-10-01 11.75
7 MMG 2.0 2014-01-01 14.00
8 MMG 3.0 2021-10-01 12.50
9 MMG 1.0 2014-01-01 15.00
10 LKG NaN NaN NaN
11 LKG NaN NaN NaN
I need my output like this,
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 per_date_3 per_num_3
0 XYZ 1 2018-10-01 11.0 2 2017-08-01 15.25 NaN NaN NaN
1 XYZ 1 2019-10-01 11.25 2 2019-08-01 15.71 3 2020-10-01 11.50
2 XYZ NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 MMG 1 2021-10-01 11.75 2 2014-01-01 14.00 3 2021-10-01 12.50
5 MMG 1 2014-01-01 15.00 NaN NaN NaN NaN NaN NaN
6 LKG NaN NaN NaN NaN NaN NaN NaN NaN NaN
If you see param column has values that are repeating and transposed column names are created from these values. Also, a new records gets created as soon as param values starts with 1. How can I achieve this?
Here main problem are NaNs in last LKG group - first replace missing values by counter created by cumcount and assign to new column per1:
s = df['per'].isna().groupby(df['param']).cumsum()
df = df.assign(per1=df['per'].fillna(s).astype(int))
print (df)
param per per_date per_num per1
0 XYZ 1.0 2018-10-01 11.00 1
1 XYZ 2.0 2017-08-01 15.25 2
2 XYZ 1.0 2019-10-01 11.25 1
3 XYZ 2.0 2019-08-01 15.71 2
4 XYZ 3.0 2020-10-01 11.50 3
5 XYZ NaN NaN NaN 1
6 MMG 1.0 2021-10-01 11.75 1
7 MMG 2.0 2014-01-01 14.00 2
8 MMG 3.0 2021-10-01 12.50 3
9 MMG 1.0 2014-01-01 15.00 1
10 LKG NaN NaN NaN 1
11 LKG NaN NaN NaN 2
Then create MultiIndex with groups with compare by 1 and cumulative sum and reshape by unstack:
g = df['per1'].eq(1).cumsum()
df = df.set_index(['param', 'per1',g]).unstack(1).sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(level=1, drop=True).reset_index()
print (df)
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 \
0 LKG NaN NaN NaN NaN NaN NaN NaN
1 MMG 1.0 2021-10-01 11.75 2.0 2014-01-01 14.00 3.0
2 MMG 1.0 2014-01-01 15.00 NaN NaN NaN NaN
3 XYZ 1.0 2018-10-01 11.00 2.0 2017-08-01 15.25 NaN
4 XYZ 1.0 2019-10-01 11.25 2.0 2019-08-01 15.71 3.0
5 XYZ NaN NaN NaN NaN NaN NaN NaN
per_date_3 per_num_3
0 NaN NaN
1 2021-10-01 12.5
2 NaN NaN
3 NaN NaN
4 2020-10-01 11.5
5 NaN NaN

Dropna By Column by levels in multiindex and swap for non-na values

I am trying to do some transformations and kind of stuck. Hopefully somebody, can help me out here.
l0 a b c d e f
l1 1 2 1 2 1 2 1 2 1 2 1 2
0 NaN NaN NaN NaN 93.4 NaN NaN NaN NaN NaN 19.0 28.9
1 NaN 9.0 NaN NaN 43.5 32.0 NaN NaN NaN NaN NaN 3.4
2 NaN 5.0 NaN NaN 93.3 83.6 NaN NaN NaN NaN 59.5 28.2
3 NaN 19.6 NaN NaN 72.8 47.4 NaN NaN NaN NaN 31.5 67.2
4 NaN NaN NaN NaN NaN 62.5 NaN NaN NaN NaN NaN 1.8
I have a dataframe, (shown above), and as u can see that, there are multiple 'NaN' with an multiindex column. Selecting the columns along level = 0 (i.e. l0)
I would like to drop the entire column if all are NaN. so, in this case the column's
l0 = ['b', 'd', 'e'] # drop-cols
should be dropped from the Dataframe
l0 a c f
l1 1 2 1 2 1 2
0 NaN NaN 93.4 NaN 19.0 28.9
1 NaN 9.0 43.5 32.0 NaN 3.4
2 NaN 5.0 93.3 83.6 59.5 28.2
3 NaN 19.6 72.8 47.4 31.5 67.2
4 NaN NaN NaN 62.5 NaN 1.8
This will give me the dataframe (as shown above). I would like to then slide values along the rows if all the entries before are null (or swap values between adjacent cols). e.g. Looking at index = 0 i.e. first row.
l0 a c f
l1 1 2 1 2 1 2
0 NaN NaN 93.4 NaN 19.0 28.9
Since, all the values in col - a are null.
I would like to slide / swap values first b/w col - a and col - c.
and then receprocate the same for columns along the right-side i.e. replace entries in col-c with col-f and make all entries in col-f, NaN giving me
l0 a c f
l1 1 2 1 2 1 2
0 93.4 NaN 19.0 28.9 NaN NaN
This is really to save memory for processing and storing information, as interchainging labels ['a', 'b', 'c'...] does not change the meaning of the data.
EDIT: Any Idea's for (2)
I have managed to solve (1) with the following code:
for c in df.columns.get_level_values(0).unique():
if df[c].isna().all().all():
df = df.drop(columns=[c])
df
You can do with all
s=df.isnull().all(level=0,axis=1).all()
df.drop(s.index[s],axis=1,level=0)
Out[55]:
a c f
1 2 1 2 1 2
l1
0 NaN NaN 93.4 NaN 19.0 28.9
1 NaN 9.0 43.5 32.0 NaN 3.4
2 NaN 5.0 93.3 83.6 59.5 28.2
3 NaN 19.6 72.8 47.4 31.5 67.2
4 NaN NaN NaN 62.5 NaN 1.8
groupby and filter
df.groupby(axis=1, level=0).filter(lambda d: ~d.isna().all().all())
a c f
1 2 1 2 1 2
0 NaN NaN 93.4 NaN 19.0 28.9
1 NaN 9.0 43.5 32.0 NaN 3.4
2 NaN 5.0 93.3 83.6 59.5 28.2
3 NaN 19.6 72.8 47.4 31.5 67.2
4 NaN NaN NaN 62.5 NaN 1.8
A little bit shorter
df.groupby(axis=1, level=0).filter(lambda d: ~np.all(d.isna()))

How To create Multiple Columns From Values of The Same Column?

I Have A DataFrame , & I Want to Create New Columns Based o The Values of The Same Column , And At Each of This Column I want The Values To Be The Sum of repetition of Plate over the Time.
So I have This DataFrame:
Val_Tra.Head():
Plate EURO
Timestamp
2013-11-01 00:00:00 NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 0.0
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad 6.0
So I Have The EURO Column That Looks Like This:
Veh_Tra.EURO.value_counts():
5 1590144
6 745865
4 625512
0 440834
3 243800
2 40664
7 14207
1 4301
And This My Desired Output:
Plate EURO_1 EURO_2 EURO_3 EURO_4 EURO_5 EURO_6 EURO_7
Timestamp
2013-11-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 1.0 NaN NaN NaN NaN NaN NaN
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad NaN NaN 1.0 NaN NaN NaN NaN
So Basically , What I Want , Is The Sum in Which Each Time That a Plate Value repeats Itself on a Specific Type of Euro over a specific Time.
Any Suggestions Would Be Much Appreciated , Thank U.
This is more like a get_dummies problem
s=df.dropna().EURO.astype(int).astype(str).str.get_dummies().add_prefix('EURO')
df=pd.concat([df,s],axis=1,sort=True)
df
Out[259]:
Plate EURO EURO0 EURO6
2013-11-0100:00:00 NaN NaN NaN NaN
2013-11-0101:00:00 dcc2f657e897ffef752003469c688381 0.0 1.0 0.0
2013-11-0102:00:00 a5ac0c2f48ea80707621e530780139ad 6.0 0.0 1.0

Resources