How do I also aggregate the 'reviewer' lists together with average of 'quantities'?
For a data frame like below I can successfully calculate the average of the quantities per group over every 3 years. How do I add an extra column that aggregates the values of column 'reviewer, for every period as well? for example for company 'A' for year 1993, the column would be [[p1,p2],[p3,p2],[p4]].
df= pd.DataFrame(data=[
['A', 1990, 2,['p1','p2']],
['A', 1991,3,['p3','p2']],
['A', 1993,5,['p4']],
['A',2000,4,['p1','p5','p7']],
['B',2000,1, ['p3']],
['B',2001,2,['p6','p9']],
['B',2002,3,['p10','p1']]], columns=['company', 'year','quantity', 'reviewer'])
df['rolling_average'] = (df.groupby(['company'])
.rolling(3).agg({'quantity':'mean'}).reset_index(level=[0], drop=True))
The output currently looks like:
| index | company | year | quantity | reviewer | rolling_average |
| :---- | :------ | :--- | :------- | :------- | :-------------- |
| 0 | A | 1990 | 2 | [p1, p2] | NaN |
| 1 | A | 1991 | 3 | [p3, p2] | NaN |
| 2 | A | 1993 | 5 | [p4] | 3.33 |
| 3 | A | 2000 | 4 | [p5, p7] | 4.00 |
| 4 | B | 2000 | 1 | [p3] | NaN |
| 5 | B | 2001 | 2 | [p6, p9] | NaN |
| 6 | B | 2002 | 3 | [p10, p1]| 2.00 |
Since the rolling can not take non-numeric , we need self-define the rolling here
n = 3
df['new'] = df.groupby(['company'])['reviewer'].apply(lambda x :[x[y-n:y].tolist() if y>=n else np.nan for y in range(1,len(x)+1)]).explode().values
df
company year quantity reviewer new
0 A 1990 2 [p1, p2] NaN
1 A 1991 3 [p3, p2] NaN
2 A 1993 5 [p4] [[p1, p2], [p3, p2], [p4]]
3 A 2000 4 [p1, p5, p7] [[p3, p2], [p4], [p1, p5, p7]]
4 B 2000 1 [p3] NaN
5 B 2001 2 [p6, p9] NaN
6 B 2002 3 [p10, p1] [[p3], [p6, p9], [p10, p1]]
I have a dataframe with YYYYMM columns that contain monthly totals on the row level:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | Nan | 9.0 | 32.0 |
| 1 | feature2 | 1.0 | 1.0 | 1.0 | 4.0 |
| 2 | feature3 | Nan | 1.0 | 4.0 | 8.0 |
| 3 | feature4 | 9.0 | 15.0 | 19.0 | 24.0 |
| 4 | feature5 | 33.0 | 67.0 | 99.0 | 121.0 |
| 5 | feature6 | 12.0 | 15.0 | 17.0 | 19.0 |
| 6 | feature7 | 1.0 | 8.0 | 15.0 | 20.0 |
| 7 | feature8 | Nan | Nan | 1.0 | 9.0 |
I would like to convert the totals to the monthly change. The feature column should be excluded as I need to keep the feature names. The yearM in the index is a result of pivoting a dataframe to get the YYYYMM on the column level.
This is how the output would look like:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | 0.0 | 9.0 | 23.0 |
| 1 | feature2 | 1.0 | 0.0 | 0.0 | 3.0 |
| 2 | feature3 | Nan | 1.0 | 3.0 | 5.0 |
| 3 | feature4 | 9.0 | 6.0 | 4.0 | 5 |
| 4 | feature5 | 33.0 | 34.0 | 32.0 | 22.0 |
| 5 | feature6 | 12.0 | 3.0 | 2.0 | 2.0 |
| 6 | feature7 | 1.0 | 7.0 | 7.0 | 5.0 |
| 7 | feature8 | Nan | 0.0 | 1.0 | 8.0 |
The row level values now represent the change compared to the previous month instead of having the total for the month.
I know that I should start by filling the NaN rows in the starting column 201902 with 0:
df['201902'] = df['201902'].fillna(0)
I could also calculate them one by one with something similar to this:
df['201902'] = df['201902'].fillna(0) - df['201901'].fillna(0)
df['201903'] = df['201903'].fillna(0) - df['201902'].fillna(0)
df['201904'] = df['201904'].fillna(0) - df['201903'].fillna(0)
...
...
Hopefully there's a smarter solution though
use iloc or drop to access the other columns, then diff with axis=1 for row-wise differences.
monthly_change = df.iloc[:, 1:].fillna(0).diff(axis=1)
# or
# monthly_change = df.drop(['feature'], axis=1).fillna(0).diff(axis=1)
I'm working in Python with a pandas DataFrame similar to:
REQUESET_ID | DESCR | TEST | TEST_DESC | RESULT |
1 | 1 | T1 | TEST_1 | 2.0 |
1 | 2 | T2 | TEST_2 | 92.0 |
2 | 1 | T1 | TEST_1 | 8.0 |
3 | 3 | T3 | TEST_3 | 12.0 |
3 | 4 | T4 | TEST_4 | 45.0 |
What I want is a final dataframe like this:
REQUESET_ID | DESCR_1 | TEST_1 | TEST_DESC_1 | RESULT_1 | DESCR_2 | TEST_2 | TEST_DESC_2 | RESULT_2 |
1 | 1 | T1 | TEST_1 | 2.0 | 2 | T2 | TEST_2 | 92.0 |
2 | 1 | T1 | TEST_1 | 8.0 | NaN | NaN | NaN | Nan |
3 | 3 | T3 | TEST_3 | 12.0 | 4 | T4 | TEST_4 | 45.0 |
How I should implement that as a method working with DataFrames. I understand that if I try to do it with a merge instead of having 4x2 columns added beacuse the value_counts method of the REQUEST_ID will return 2, will add the 4 columns for each entry in the request column.
Assign a new column with cumcount, then do stack + unstack
s=df.assign(col=(df.groupby('REQUESET_ID').cumcount()+1).astype(str)).\
set_index(['REQUESET_ID','col']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
DESCR_1 RESULT_1 TEST_1 ... RESULT_2 TEST_2 TEST_DESC_2
REQUESET_ID ...
1 1.0 2.0 T1 ... 92.0 T2 TEST_2
2 1.0 8.0 T1 ... NaN NaN NaN
3 3.0 12.0 T3 ... 45.0 T4 TEST_4
[3 rows x 8 columns]
I have got trouble to make to compute the number of days in a row until a condition is found.
It is given in the following table were Gap done is the messy table I obtained with the solution form there , and Expected gap the output I want to obtain.
+--------+------------+---------------------+----------+----------------------------------------------------------------------------------------------+
| Player | Result | Date | Gap done | Expected Gap |
+--------+------------+---------------------+----------+----------------------------------------------------------------------------------------------+
| K2000 | Lose | 2015-11-13 13:42:00 | Nan | Nan/0 |
| K2000 | Lose | 2016-03-23 16:40:00 | 131.0 | 131.0 |
| K2000 | Lose | 2016-05-16 19:17:00 | 54.0 | 185.0 |
| K2000 | Win | 2016-06-09 19:36:00 | 54.0 | 239.0 #he always lose before |
| K2000 | Win | 2016-06-30 14:05:00 | 54.0 | 54.0 #because he won last time, it's 54 days btw this current date and the last time he won. |
| K2000 | Lose | 2016-07-29 16:20:00 | 29.0 | 29.0 |
| K2000 | Win | 2016-10-08 17:48:00 | 29.0 | 58.0 |
| Kssis | Lose | 2007-02-25 15:05:00 | Nan | Nan/0 |
| Kssis | Lose | 2007-04-25 16:07:00 | 59.0 | 59.0 |
| Kssis | Not ranked | 2007-06-01 16:54:00 | 37.0 | 96.0 |
| Kssis | Lose | 2007-09-09 14:33:00 | 99.0 | 195.0 |
| Kssis | Lose | 2008-04-06 16:27:00 | 210.0 | 405.0 |
+--------+------------+---------------------+----------+----------------------------------------------------------------------------------------------+
The issue of the solution there is it does not really compute date. It has the chance that date in this example are always separate by 1 day.
Sure I adapted with
def sum_days_in_row_with_condition(g):
sorted_g = g.sort_values(by='date', ascending=True)
condition = sorted_g['Result'] == 'Win'
sorted_g['days-in-a-row'] = g.date.diff().dt.days.where(~condition).ffill()
return sorted_g
But as I showed you, this is messy.
So I thought about a solution, but it needs global variables (out of function), and that's a little fastidious.
Can anyone help to solve this problematic in a simpler way ?
Pandas version: 0.23.4 Python version: 3.7.4
IIUC, you need to find the boolean mask m1 where win has previous row also win. From m1 create a groupID s to separate group win. Split them into group and cumsum
m = df.Result.eq('Win')
m1 = m & m.shift()
s = m1.ne(m1.shift()).cumsum()
df['Expected Gap'] = df.groupby(['Player', s])['Gap done'].cumsum()
Out[808]:
Player Result Date Gap done Expected Gap
0 K2000 Lose 2015-11-13 13:42:00 NaN NaN
1 K2000 Lose 2016-03-23 16:40:00 131.0 131.0
2 K2000 Lose 2016-05-16 19:17:00 54.0 185.0
3 K2000 Win 2016-06-09 19:36:00 54.0 239.0
4 K2000 Win 2016-06-30 14:05:00 54.0 54.0
5 K2000 Lose 2016-07-29 16:20:00 29.0 29.0
6 K2000 Win 2016-10-08 17:48:00 29.0 58.0
7 Kssis Lose 2007-02-25 15:05:00 NaN NaN
8 Kssis Lose 2007-04-25 6:07:00 59.0 59.0
9 Kssis Not-ranked 2007-06-01 16:54:00 37.0 96.0
10 Kssis Lose 2007-09-09 14:33:00 99.0 195.0
11 Kssis Lose 2008-04-06 16:27:00 210.0 405.0
I would like to detect in a dataframe the start and end (Datetime) of consecutive sets of rows with all the values being NaN.
What is the best way to store the results in a array of tuples with the start and end of each set of datetimes with NaN values?
For example using the dataframe bellow the tuple should be like this:
missing_datetimes = [('2018-10-10 22:00:00', '2018-10-11 00:00:00 '),
('2018-10-11 02:00:00','2018-10-11 02:00:00'), ('2018-10-11 04:00:00', '2018-10-11 04:00:00')
Example of dataframe:
-------------+---------------------+------------+------------+
| geo_id | Datetime | Variable1 | Variable2 |
+------------+---------------------+------------+------------+
| 1 | 2018-10-10 18:00:00 | 20 | 10 |
| 2 | 2018-10-10 18:00:00 | 22 | 10 |
| 1 | 2018-10-10 19:00:00 | 20 | nan |
| 2 | 2018-10-10 19:00:00 | 21 | nan |
| 1 | 2018-10-10 20:00:00 | 30 | nan |
| 2 | 2018-10-10 20:00:00 | 30 | nan |
| 1 | 2018-10-10 21:00:00 | nan | 5 |
| 2 | 2018-10-10 21:00:00 | nan | 5 |
| 1 | 2018-10-10 22:00:00 | nan | nan |
| 1 | 2018-10-10 23:00:00 | nan | nan |
| 1 | 2018-10-11 00:00:00 | nan | nan |
| 1 | 2018-10-11 01:00:00 | 5 | 2 |
| 1 | 2018-10-11 02:00:00 | nan | nan |
| 1 | 2018-10-11 03:00:00 | 2 | 1 |
| 1 | 2018-10-11 04:00:00 | nan | nan |
+------------+---------------------+------------+------------+
Update: And what if some datetimes are duplicated?
You may need to using groupby with condition
s=df.set_index('Datetime').isnull().all(axis=1)
df.loc[s,'Datetime'].groupby((~s).cumsum()[s]).agg(['first','last']).apply(tuple,1).tolist()
# find the all nan value and if they are consecutive we pull them into one group
Out[89]:
[('2018-10-1022:00:00', '2018-10-1100:00:00'),
('2018-10-1102:00:00', '2018-10-1102:00:00'),
('2018-10-1104:00:00', '2018-10-1104:00:00')]