Groupby year-month and drop columns with all NaNs in Python - python-3.x

Based on the output dataframe from this link:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])
models = df.columns[df.columns.str.endswith('_values')]
# function to calculate mape
def mape(y_true, y_pred):
y_pred = np.array(y_pred)
return np.mean(np.abs(y_true - y_pred) / np.clip(np.abs(y_true), 1, np.inf),
axis=0)*100
errors = (df.groupby(pd.Grouper(freq='M'))
.apply(lambda x: mape(x[models], x[['target']]))
)
k = 2
n = len(models)
sorted_args = np.argsort(errors, axis=1) < k
res = pd.merge_asof(df[['target']], sorted_args,
left_index=True,
right_index=True,
direction='forward'
)
topk = df[models].where(res[models])
df = df.join(topk.add_suffix('_mape'))
df = df[['target', 'A_values_mape', 'B_values_mape', 'C_values_mape', 'D_values_mape',
'E_values_mape']]
df
Out:
target A_values_mape ... D_values_mape E_values_mape
2013-02-26 1.281624 6.059783 ... 3.126731 NaN
2013-02-27 0.585713 1.789931 ... 7.843101 NaN
2013-02-28 9.638430 9.623960 ... 5.612724 NaN
2013-03-01 1.950960 NaN ... NaN 5.693051
2013-03-02 0.690563 NaN ... NaN 7.322250
... ... ... ... ...
2013-05-22 5.554824 NaN ... NaN 6.803052
2013-05-23 8.440801 NaN ... NaN 2.756443
2013-05-24 0.968086 NaN ... NaN 0.430184
2013-05-25 0.672555 NaN ... NaN 5.461017
2013-05-26 5.273122 NaN ... NaN 6.312104
How could I groupby year-month and drop columns with all NaNs, then rename the rest columns by ie., top_1, top_2, ..., top_k?
The final expected result could be like this if k=2:
Pseudocode:
df2 = df.filter(regex='_mape$').groupby(pd.Grouper(freq='M')).dropna(axis=1, how='all')
df2.columns = ['top_1', 'top_2', ..., 'top_k']
df.join(df2)
As #Quang Hoang commented in the last post, we may could use justify_nd to achieve that, but I don't know how. Thanks for your help at advance.
EDIT:
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])
models = df.columns[df.columns.str.endswith('_values')]
k = 2
n = len(models)
def grpProc(grp):
err = mape(grp[models], grp[['target']])
# sort_args = np.argsort(err) < k
# cols = models[sort_args]
cols = err.nsmallest(k).index
out_cols = [f'top_{i+1}' for i in range(k)]
rv = grp.loc[:, cols]
rv.columns = out_cols
return rv
wrk = df.groupby(pd.Grouper(freq='M')).apply(grpProc)
res = df[['target']].join(wrk)
print(res)
Out:
target top_1 top_2
2013-02-26 1.281624 6.059783 9.972433
2013-02-27 0.585713 1.789931 0.968944
2013-02-28 9.638430 9.623960 6.165247
2013-03-01 1.950960 4.521452 5.693051
2013-03-02 0.690563 5.178144 7.322250
... ... ...
2013-05-22 5.554824 3.864723 6.803052
2013-05-23 8.440801 5.140268 2.756443
2013-05-24 0.968086 5.890717 0.430184
2013-05-25 0.672555 1.610210 5.461017
2013-05-26 5.273122 6.893207 6.312104

Actually, what you need is, for each group (by year / month):
compute errors locally for the current group,
find k "wanted" columns (calling argsort) and take indicated
columns from models,
take the indicated columns from the current group and rename them to top_…,
return what you generated so far.
To do it, define a "group processing" function:
def grpProc(grp):
err = mape(grp[models], grp[['target']])
sort_args = np.argsort(err) < k
cols = models[sort_args]
out_cols = [f'top_{i+1}' for i in range(k)]
rv = grp.loc[:, cols]
rv.columns = out_cols
return rv
Then, to generate top_… columns alone, apply this function to each group:
wrk = df.groupby(pd.Grouper(freq='M')).apply(grpProc)
And finally generate the expected result joining target column with wrk:
result = df[['target']].join(wrk)
First 15 rows of it, based on your source data, are:
target top_1 top_2
2013-02-26 1.281624 6.059783 3.126731
2013-02-27 0.585713 1.789931 7.843101
2013-02-28 9.638430 9.623960 5.612724
2013-03-01 1.950960 4.521452 5.693051
2013-03-02 0.690563 5.178144 7.322250
2013-03-03 6.177010 8.280144 6.174890
2013-03-04 1.263177 5.896541 4.422322
2013-03-05 5.888856 9.159396 8.906554
2013-03-06 2.013227 8.237912 3.075435
2013-03-07 8.482991 1.546148 6.476141
2013-03-08 7.986413 3.322442 4.738473
2013-03-09 5.944385 7.769769 0.631033
2013-03-10 7.543775 3.710198 6.787289
2013-03-11 5.816264 3.722964 6.795556
2013-03-12 3.054002 3.304891 8.258990
Edit
For the first group (2013-02-28) err contains:
A_values 48.759348
B_values 77.023855
C_values 325.376455
D_values 74.422508
E_values 60.602101
Note that 2 lowest error values are 48.759348 and 60.602101,
so from this group you should probably take A_values (this is OK)
and E_values (instead of D_values).
So maybe grpProc function instead of:
sort_args = np.argsort(err) < k
cols = models[sort_args]
should contain:
cols = err.nsmallest(k).index

Related

How to change float to date type in python? (ValueError: day is out of range for month)

I have the following column:
0 3012022.0
1 3012022.0
2 3012022.0
3 3012022.0
4 3012022.0
...
351 24032022.0
352 24032022.0
df.Data = df.Data.astype('str')
I converted the float to string and I'm trying to transform them in datetype:
df['data'] = pd.to_datetime(df['data'], format='%d%m%Y'+'.0').dt.strftime('%d-%m-%Y')
output:
ValueError: day is out of range for month
the code is:
os.chdir('/home/carol/upload')
for file in glob.glob("*.xlsx"):
xls = pd.ExcelFile('/home/carol/upload/%s'%(file))
if len(xls.sheet_names) > 1:
list_sheets= []
for i in xls.sheet_names:
df = pd.read_excel(xls, i)
list_sheets.append(df)
df = pd.concat(list_sheets)
else:
df = pd.read_excel(xls)
df = df[['Data','Frota','Placa','ValorFrete', 'ValorFaturado','CodFilial','NomeFilial']]
df = df.dropna()
df = df.apply(lambda x: x.astype(str).str.lower())
df.columns = df.columns.str.lower()
df.columns = df.columns.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
df= df.apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))
df['data'] = pd.to_datetime(df['data'].astype(str).str.split('\.').str[0], format='%d%m%Y')
Convert to string, remove the decimal point and coerce to datettime. Code as follows
df['data'] = pd.to_datetime(df['data'].astype(str).str.split('\.').str[0], format='%d%m%Y')
data
0 2022-01-30
1 2022-01-30
2 2022-01-30
3 2022-01-30
4 2022-01-30
351 2022-03-24
352 2022-03-24

Groupby year-month and find top N smallest standard deviation values columns in Python

With sample data and code below, I'm trying to groupby year-month and find top K columns with smallest std values inside all the columns endswith _values:
import pandas as pd
import numpy as np
from statistics import stdev
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])
k = 3 # set k as 3
value_cols = df.columns[df.columns.str.endswith('_values')]
def find_topK_smallest_std(group):
std = stdev(group[value_cols])
cols = std.nsmallest(k).index
out_cols = [f'std_{i+1}' for i in range(k)]
rv = group.loc[:, cols]
rv.columns = out_cols
return rv
df.groupby(pd.Grouper(freq='M'), dropna=False).apply(find_topK_smallest_std)
But it raises a type error, how could I fix this issue? Sincere thanks at advance.
Out:
TypeError: can't convert type 'str' to numerator/denominator
Reference link:
Groupby year-month and find top N smallest values columns in Python
In your solution add DataFrame.apply for stdev per columns, if need per rows add axis=1:
def find_topK_smallest_std(group):
#procssing per columns
std = group[value_cols].apply(stdev)
cols = std.nsmallest(k).index
out_cols = [f'std_{i+1}' for i in range(k)]
rv = group.loc[:, cols]
rv.columns = out_cols
return rv
df = df.groupby(pd.Grouper(freq='M'), dropna=False).apply(find_topK_smallest_std)
print (df)
std_1 std_2 std_3
2013-02-26 7.333694 3.126731 1.389472
2013-02-27 7.529254 7.843101 6.621605
2013-02-28 6.165574 5.612724 0.866300
2013-03-01 5.693051 3.711608 4.521452
2013-03-02 7.322250 4.763135 5.178144
... ... ...
2013-05-22 8.795736 3.864723 6.316478
2013-05-23 7.959282 5.140268 1.839659
2013-05-24 5.412016 5.890717 9.081583
2013-05-25 1.088414 1.610210 9.016004
2013-05-26 4.930571 6.893207 2.338785
[90 rows x 3 columns]

Find list of column names having their minimum values in a given date range using Python

Given a dataset as follows and a date range from 2013-05-01 to 2013-05-15:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])
Out:
A_values B_values C_values D_values E_values target
2013-02-26 6.059783 7.333694 1.389472 3.126731 9.972433 1.281624
2013-02-27 1.789931 7.529254 6.621605 7.843101 0.968944 0.585713
2013-02-28 9.623960 6.165574 0.866300 5.612724 6.165247 9.638430
2013-03-01 5.743043 3.711608 4.521452 2.018502 5.693051 1.950960
2013-03-02 5.837040 4.763135 5.178144 8.230986 7.322250 0.690563
... ... ... ... ... ...
2013-05-22 8.795736 6.316478 0.427136 3.864723 6.803052 5.554824
2013-05-23 7.959282 1.839659 2.225667 5.140268 2.756443 8.440801
2013-05-24 5.412016 9.081583 7.212742 5.890717 0.430184 0.968086
2013-05-25 1.088414 9.016004 5.384490 1.610210 5.461017 0.672555
2013-05-26 4.930571 2.338785 9.823048 6.893207 6.312104 5.273122
First I filter columns by df.filter(regex='_values$'), then I hope to return a list of column names whose minimum value falls in the given date range (2013-05-01, 2013-05-15), ie. If column A_values's minimum value is in any day in this range, then A_values will be contained in the returned result list.
How could I achieve that in Pandas or Numpy? Thanks.
Use DataFrame.idxmin for minimal datetimes per columns and then filter index by Series.between_time
s = df.filter(regex='_values$').idxmin()
out = s[s.between('2013-05-01','2013-05-15')].index.tolist()
print (out)
['D_values']
EDIT:
df1 = df.filter(regex='_values$')
s1 = df1.idxmin()
s2 = df1.idxmax()
#removed tolist
out1 = s1[s1.between('2013-03-16','2013-03-31')].index
out2 = s2[s2.between('2013-05-01','2013-05-15')].index
out = out1.intersection(out2).tolist()

Dataframe optimize groupby & calculated fields

I have a dataframe with the following structure:
import pandas as pd
import numpy as np
names = ['PersonA', 'PersonB', 'PersonC', 'PersonD','PersonE','PersonF']
team = ['Team1','Team2']
dates = pd.date_range(start = '2020-05-28', end = '2021-11-22')
df = pd.DataFrame({'runtime': np.repeat(dates, len(names)*len(team))})
df['name'] = len(dates)*len(team)*names
df['team'] = len(dates)*len(names)*team
df['A'] = 40 + 20*np.random.random(len(df))
df['B'] = .1 * np.random.random(len(df))
df['C'] = 1 +.5 * np.random.random(len(df))
I would like to create a dataframe that displays calculated mean values for the runtime in periods such as the previous Week, Month, Yearl, and All-Time such that it looks like this:
name | team | A_w | B_w | C_w| A_m | B_m | C_m | A_y | B_y | C_y | A_at | B_at | C_at
I have successfully added a calculated column for the mean value using the lamda method described here:
How do I create a new column from the output of pandas groupby().sum()?
e.g.:
df = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_at=lambda gdf: gdf['A'].mean()))
My output gives me an additional column:
runtime name team A B C A_at
0 2020-05-28 PersonA Team1 55.608186 0.027767 1.311662 49.957820
1 2020-05-28 PersonB Team2 43.481041 0.038685 1.144240 50.057015
2 2020-05-28 PersonC Team1 47.277667 0.012190 1.047263 50.151846
3 2020-05-28 PersonD Team2 41.995354 0.040623 1.087151 50.412061
4 2020-05-28 PersonE Team1 49.824062 0.036805 1.416110 50.073381
... ... ... ... ... ... ... ...
6523 2021-11-22 PersonB Team2 46.799963 0.069523 1.322076 50.057015
6524 2021-11-22 PersonC Team1 48.851620 0.007291 1.473467 50.151846
6525 2021-11-22 PersonD Team2 49.711142 0.051443 1.044063 50.412061
6526 2021-11-22 PersonE Team1 57.074027 0.095908 1.464404 50.073381
6527 2021-11-22 PersonF Team2 41.372381 0.059240 1.132346 50.094965
[6528 rows x 7 columns]
But this is where it gets messy...
I don't need the runtime column, and I am unsure about how to clean this up so that it only lists the 'name' & 'team' columns, additionally... the way I have been producing my source dataframe(s) is by recreating the entire dataframe using a for loop for each time-period with:
for pt in runtimes[:d]:
<insert dataframe creation for d# of runtimes>
if d==7:
dfw = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_w=lambda gdf: gdf['A'].mean()))
if d==30:
dfm = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_m=lambda gdf: gdf['A'].mean()))
I am then attempting to concatenate the outputs like so:
dfs = pd.concat([dfw, dfm])
This works "OK" when d < 30, but when I'm looking at 90-100 days, it creates a dataframe with 50000+ rows and concats it with each other dataframe. Is there a way to perform this operation for x# of previous runtime values in-place?
Any tips on how to make this more efficient would be greatly appreciated.
An update...
I have been able to formulate a decent output by doing the following:
dfs = pd.DataFrame(columns=['name','team'])
for pt in runtimes[:d]:
if d == 7:
df = <insert dataframe creation for d# of runtimes>
dfw = df.groupby(['name', 'team'], as_index=True).apply(lambda gdf: gdf.assign(A_w=lambda gdf: gdf['A'].mean()))
...
dfw = dfw[['name', 'A_w','B_w','C_w','team']]
dfs = pd.merge(dfs, dfw, how='inner', on=['name', 'team'])
if d == 30:
df = <insert dataframe creation for d# of runtimes>
dfm = df.groupby(['name', 'team'], as_index=True).apply(lambda gdf: gdf.assign(A_m=lambda gdf: gdf['A'].mean()))
...
dfm = dfm[['name', 'A_m','B_m','C_m','team']]
dfs = pd.merge(dfs, dfm, how='inner', on=['name', 'team'])
This gives me the output that I am expecting.

pandas.isnull() not working on decimal type?

do I miss something or do we have an issue with pandas.isnull() ?
>>> import pandas as pd
>>> import decimal
>>> d = decimal.Decimal('NaN')
>>> d
Decimal('NaN')
>>> pd.isnull(d)
False
>>> f = float('NaN')
>>> f
nan
>>> pd.isnull(f)
True
>>> pd.isnull(float(d))
True
Problem is I have a dataframe with decimal.Decimal values in it, and df.dropna() doesn't remove NaN for this reason...
Yes this isn't supported, you can use the property that NaN does not equal itself which still works for Decimal types:
In [20]:
import pandas as pd
import decimal
d = decimal.Decimal('NaN')
df = pd.DataFrame({'a':[d]})
df
Out[20]:
a
0 NaN
In [21]:
df['a'].apply(lambda x: x != x)
Out[21]:
0 True
Name: a, dtype: bool
So you can do:
In [26]:
df = pd.DataFrame({'a':[d,1,2,3]})
df[df['a'].apply(lambda x: x == x)]
Out[26]:
a
1 1
2 2
3 3

Resources