Get value from grouped data frame maximum in another column - python-3.x

Return and assign the value of column c based on the maximum value of column b in a grouped data frame, grouped on column a.
I'm trying to return and assign (ideally in a different column) the value of column c, determined by the row with the maximum in column b (which are dates). The dataframe is grouped by column a, but I keep getting errors. The data frame is relatively large, with over 16,000 rows.
df = pd.DataFrame({'a': ['a','a','a','b','b','b','c','c','c'],
'b': ['2008-11-01', '2022-07-01', '2017-02-01', '2017-02-01', '2018-02-01', '2008-11-01', '2014-11-01', '2008-11-01', '2022-07-01'],
'c': [numpy.NaN,6,8,2,1,numpy.NaN,6,numpy.NaN,7]})
df['b'] = pd.to_datetime(df['b'])
df
a b c
0 a 2008-11-01 NaN
1 a 2022-07-01 6.0
2 a 2017-02-01 8.0
3 b 2017-02-01 2.0
4 b 2018-02-01 1.0
5 b 2008-11-01 NaN
6 c 2014-11-01 6.0
7 c 2008-11-01 NaN
8 c 2022-07-01 7.0
I want the following result:
a b c d
0 a 2008-11-01 NaN 8.0
1 a 2022-07-01 6.0 8.0
2 a 2017-02-01 8.0 8.0
3 b 2017-02-01 2.0 1.0
4 b 2018-02-01 1.0 1.0
5 b 2008-11-01 NaN 1.0
6 c 2014-11-01 6.0 7.0
7 c 2008-11-01 NaN 7.0
8 c 2022-07-01 7.0 7.0
I've tried the following, but grouped series ('SeriesGroupBy') cannot be cast to another dtype:
df['d'] = df.groupby('a').b.astype(np.int64).max()[df.c].reset_index()
I've set column b to int before running the above code, but get an error saying none of the values in c are in the index.
None of [Index([....values in column c...\n dtype = 'object', name = 'a', length = 16280)] are in the [index]
I've tried expanding out the command but errors (which I just realized relate to no alternative values being given in the pd.where function - but I don't know how to give the correct value):
df= df.groupby('a')
df['d'] = df.['c'].where(grouped['b'] == grouped['b'].max())
I've also tried using solutions provided here, here, and here.
Any help or direction would be appreciated.

I'm assuming the first three values should be 6.0 (because maximum date for group a is 2022-07-01 with value 6):
df["d"] = df["a"].map(
df.groupby("a").apply(lambda x: x.loc[x["b"].idxmax(), "c"])
)
print(df)
Prints:
a b c d
0 a 2008-11-01 NaN 6.0
1 a 2022-07-01 6.0 6.0
2 a 2017-02-01 8.0 6.0
3 b 2017-02-01 2.0 1.0
4 b 2018-02-01 1.0 1.0
5 b 2008-11-01 NaN 1.0
6 c 2014-11-01 6.0 7.0
7 c 2008-11-01 NaN 7.0
8 c 2022-07-01 7.0 7.0

Related

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?
Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN
You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

Rearrange columns in DataFrame

Having a DataFrame structured as follows:
country A B C D
0 Albany 5.2 4.7 253.75 4
1 China 7.5 3.4 280.72 3
2 Portugal 4.6 7.5 320.00 6
3 France 8.4 3.6 144.00 3
4 Greece 2.1 10.0 331.00 6
I wanted to get something like this:
cost A B
country C D C D
Albany 2.05 4 1.85 4
China 2.67 3 1.21 3
Portugal 1.44 6 2.34 6
France 5.83 3 2.50 3
Greece 0.63 6 3.02 6
I mean, get the columns A and B as headers over C and D, keeping D the same with its constant value, and calculating in C the percentage resulting of the header over C. Example for Albany:
value C in A: (5.2/253.75)*100 = 2.05
value C in B: (4.7/253.75)*100 = 1.85
Is there any way to do it?
Thanks!
You can divide multiple columns, here A and B by DataFrame.div, then DataFrame.reindex by MultiIndex created by MultiIndex.from_product and last set D columns by original with MultiIndex slicers:
cols = ['A','B']
mux = pd.MultiIndex.from_product([cols, ['C', 'D']])
df1 = df[cols].div(df['C'], axis=0).mul(100).reindex(mux, axis=1, level=0)
idx = pd.IndexSlice
df1.loc[:, idx[:, 'D']] = df[['D'] * len(cols)].to_numpy()
#pandas bellow 0.24
#df1.loc[:, idx[:, 'D']] = df[['D'] * len(cols)].values
print (df1)
A B
C D C D
0 2.049261 4 1.852217 4
1 2.671701 3 1.211171 3
2 1.437500 6 2.343750 6
3 5.833333 3 2.500000 3
4 0.634441 6 3.021148 6

How to fill NaN with user defined value in pandas dataframe

How to fill NaN with user defined value in pandas dataframe.
For text columns like A and B, user defined text like 'Missing' should be imputed. For discrete numeric variables like C and D, median value should be imputed. I have many columns like these, I would like apply rule for all vars in the dataframe
DF
A B C D
A0A1 Railway 10 NaN
A1A1 Shipping NaN 1
NaN Shipping 3 2
B1A1 NaN 1 7
DF out:
A B C D
A0A1 Railway 10 2
A1A1 Shipping 3 1
Missing Shipping 3 2
B1A1 Missing 1 7
You can fillna by pass dict
df.fillna({'A':'Miss','B':"Your2",'C':df.C.median(),'D':df.D.mean()})
Out[373]:
A B C D
0 A0A1 Railway 10.0 3.333333
1 A1A1 Shipping 3.0 1.000000
2 Miss Shipping 3.0 2.000000
3 B1A1 Your2 1.0 7.000000
Fun way!
d = {np.dtype('O'): 'Missing'}
df.fillna(df.dtypes.map(d).fillna(df.median()))
A B C D
0 A0A1 Railway 10.0 2.0
1 A1A1 Shipping 3.0 1.0
2 Missing Shipping 3.0 2.0
3 B1A1 Missing 1.0 7.0
First replace median for numeric columns and then fillna for non numeric:
df = df.fillna(df.median()).fillna('Missing')
print (df)
A B C D
0 A0A1 Railway 10.0 2.0
1 A1A1 Shipping 3.0 1.0
2 Missing Shipping 3.0 2.0
3 B1A1 Missing 1.0 7.0

Get number of rows for all combinations of attribute levels in Pandas

I have a dataframe bunch of categorical variables, each row corresponds to a product.I wanted to find the number of rows for every combination of attribute levels and decided to run the following:
att1=list(frame_base.columns.values)
f1=att.groupby(att1,as_index=False).size().rename('counts').to_frame()
att1 is the list of all attributes, f1 does not seem to provide the correct value as f1.counts.sum() is not equal to len(f1) before the group by.Why doesn't this work?
One possible problem is NaN row, but maybe there is typo - need att instead frame_base:
att = pd.DataFrame({'A':[1,1,3,np.nan],
'B':[1,1,6,np.nan],
'C':[2,2,9,np.nan],
'D':[1,1,5,np.nan],
'E':[1,1,6,np.nan],
'F':[1,1,3,np.nan]})
print (att)
A B C D E F
0 1.0 1.0 2.0 1.0 1.0 1.0
1 1.0 1.0 2.0 1.0 1.0 1.0
2 3.0 6.0 9.0 5.0 6.0 3.0
3 NaN NaN NaN NaN NaN NaN
att1=list(att.columns.values)
f1=att.groupby(att1).size().reset_index(name='counts')
print (f1)
A B C D E F counts
0 1.0 1.0 2.0 1.0 1.0 1.0 2
1 3.0 6.0 9.0 5.0 6.0 3.0 1

How get ranges of one column gruop by class column? In Pandas

I'm practicing with Pandas and i want to get the ranges of a column from a dataframe by the values of another column.
An example dataset:
Points Grade
1 7.5 C
2 9.3 A
3 NaN A
4 1.3 F
5 8.7 B
6 9.5 A
7 7.9 C
8 4.5 F
9 8.0 B
10 6.8 D
11 5.0 D
I want group ranges of points for each grade so i can induce missing values.
For that goal i need gets something like this:
Grade Points
A [9.5, 9.3]
B [8.7, 8.0]
C [7.5, 7.0]
D [6.8, 5.0]
F [1.3, 4.5]
I can get it with for and that kinds of stuffs but is it possible with pandas in some easy way?
I tried all groupby combinations i know and nothing. Some suggestion?
You can first filter df with notnull and then groupby and tolist with reset_index:
print df
Points Grade
0 7.5 C
1 9.3 A
2 NaN A
3 1.3 F
4 8.7 B
5 9.5 A
6 7.9 C
7 4.5 F
8 8.0 B
9 6.8 D
10 5.0 D
print df['Points'].notnull()
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
Name: Points, dtype: bool
print df.loc[df['Points'].notnull()]
Points Grade
0 7.5 C
1 9.3 A
3 1.3 F
4 8.7 B
5 9.5 A
6 7.9 C
7 4.5 F
8 8.0 B
9 6.8 D
10 5.0 D
print df.loc[df['Points'].notnull()].groupby('Grade')['Points']
.apply(lambda x: x.tolist()).reset_index()
Grade Points
0 A [9.3, 9.5]
1 B [8.7, 8.0]
2 C [7.5, 7.9]
3 D [6.8, 5.0]
4 F [1.3, 4.5]

Resources