Create DataFrames or Dictionaries from Unique Values in Separate Columns - python-3.x

I'm a Python newbie trying to model stock trades from a DataFrame containing timestamped trade executions. For example, index 0-2 below represent three executions of a single trade. I'm trying to isolate each group of executions that represent a trade. Here's a sample of the existing aggregate:
event side symbol shares price time trade_date
0 Execute Shrt TSLA 25 311.9500 10:29:51 2019-01-19
1 Execute Buy TSLA 20 310.7300 10:30:42 2019-01-19
2 Execute Buy TSLA 5 311.1173 10:31:15 2019-01-19
3 Execute Buy BYND 25 83.3027 11:06:15 2019-01-19
4 Execute Shrt BYND 13 84.0500 11:07:11 2019-01-19
5 Execute Sell BYND 12 83.2500 11:07:42 2019-01-19
6 Execute Buy NVDA 25 297.3400 12:07:42 2019-01-20
7 Execute Shrt AMZN 10 500.0100 12:09:12 2019-01-20
8 Execute Sell NVDA 25 296.7500 12:10:30 2019-01-20
9 Execute Buy AMZN 10 495.7500 12:11:15 2019-01-20
The approach in this post creates slices from unique values in a single column, but I'm unsure of how to make the second slice. With this approach applied, I have:
date_list = list(set(execs_df['trade_date'])) # Create list of dates from original DataFrame
by_date_dict = {date: execs_df.loc[execs_df['trade_date'] == date] for date in date_list}
for date in date_list:
print(by_date__dict[date])
This produces the following, date-specific dictionaries:
side symbol shares price time trade_date p & l trades value
0 Shrt TSLA 25.0 311.9500 10:29:51 2019-11-01 NaN NaN 7798.7500
1 Buy TSLA 8.0 311.2000 10:30:31 2019-11-01 NaN NaN 2489.6000
2 Buy TSLA 8.0 310.7300 10:30:42 2019-11-01 NaN NaN 2485.8400
3 Buy TSLA 4.0 311.1173 10:31:15 2019-11-01 NaN NaN 1244.4692
4 Buy TSLA 5.0 311.5500 10:35:39 2019-11-01 NaN NaN 1557.7500
5 Shrt BYND 25.0 83.3027 11:06:15 2019-11-01 NaN NaN 2082.5675
6 Buy BYND 12.0 83.0500 11:06:43 2019-11-01 NaN NaN 996.6000
7 Buy BYND 13.0 83.2400 11:07:49 2019-11-01 NaN NaN 1082.1200
In terms of final output, I need the following:
side symbol shares price time trade_date p & l trades value
0 Shrt TSLA 25.0 311.9500 10:29:51 2019-11-01 NaN NaN 7798.7500
1 Buy TSLA 8.0 311.2000 10:30:31 2019-11-01 NaN NaN 2489.6000
2 Buy TSLA 8.0 310.7300 10:30:42 2019-11-01 NaN NaN 2485.8400
3 Buy TSLA 4.0 311.1173 10:31:15 2019-11-01 NaN NaN 1244.4692
4 Buy TSLA 5.0 311.5500 10:35:39 2019-11-01 NaN NaN 1557.7500
side symbol shares price time trade_date p & l trades value
0 Shrt BYND 25.0 83.3027 11:06:15 2019-11-01 NaN NaN 2082.5675
1 Buy BYND 12.0 83.0500 11:06:43 2019-11-01 NaN NaN 996.6000
2 Buy BYND 13.0 83.2400 11:07:49 2019-11-01 NaN NaN 1082.1200
etc...
Any pointers would be greatly appreciated.

Given your current dictionary of dataframes by_date_dict
The following code we be a dict of dict of dataframes
Top key is still the date
Under each data key is a key for the symbol (e.g. updated_df['2019-11-01']['BYND'])
updated_df = {k: {sym: v[v.symbol == sym] for sym in v.symbol.unique()} for k, v in by_date_dict.items()}
# structure
{date: {symbol: df,
symbol: df,
symbol: df}
date: {symbol, df,
symbol, df,
symbol, df}}
for k, v in updated_df.items():
print(k)
for x, y in v.items():
print(x)
print(y.to_markdown())
2019-11-01
TSLA
| | side | symbol | shares | price | time | trade_date | pl | trades | value |
|---:|:-------|:---------|---------:|--------:|:---------|:-------------|-----:|---------:|--------:|
| 0 | Shrt | TSLA | 25 | 311.95 | 10:29:51 | 2019-11-01 | nan | nan | 7798.75 |
| 1 | Buy | TSLA | 8 | 311.2 | 10:30:31 | 2019-11-01 | nan | nan | 2489.6 |
| 2 | Buy | TSLA | 8 | 310.73 | 10:30:42 | 2019-11-01 | nan | nan | 2485.84 |
| 3 | Buy | TSLA | 4 | 311.117 | 10:31:15 | 2019-11-01 | nan | nan | 1244.47 |
| 4 | Buy | TSLA | 5 | 311.55 | 10:35:39 | 2019-11-01 | nan | nan | 1557.75 |
BYND
| | side | symbol | shares | price | time | trade_date | pl | trades | value |
|---:|:-------|:---------|---------:|--------:|:---------|:-------------|-----:|---------:|--------:|
| 5 | Shrt | BYND | 25 | 83.3027 | 11:06:15 | 2019-11-01 | nan | nan | 2082.57 |
| 6 | Buy | BYND | 12 | 83.05 | 11:06:43 | 2019-11-01 | nan | nan | 996.6 |
| 7 | Buy | BYND | 13 | 83.24 | 11:07:49 | 2019-11-01 | nan | nan | 1082.12 |
Access specific key
updated_df['2019-11-01']['BYND']
| | side | symbol | shares | price | time | trade_date | pl | trades | value |
|---:|:-------|:---------|---------:|--------:|:---------|:-------------|-----:|---------:|--------:|
| 5 | Shrt | BYND | 25 | 83.3027 | 11:06:15 | 2019-11-01 | nan | nan | 2082.57 |
| 6 | Buy | BYND | 12 | 83.05 | 11:06:43 | 2019-11-01 | nan | nan | 996.6 |
| 7 | Buy | BYND | 13 | 83.24 | 11:07:49 | 2019-11-01 | nan | nan | 1082.12 |

Related

rolling average and aggregate more than one column in pandas

How do I also aggregate the 'reviewer' lists together with average of 'quantities'?
For a data frame like below I can successfully calculate the average of the quantities per group over every 3 years. How do I add an extra column that aggregates the values of column 'reviewer, for every period as well? for example for company 'A' for year 1993, the column would be [[p1,p2],[p3,p2],[p4]].
df= pd.DataFrame(data=[
['A', 1990, 2,['p1','p2']],
['A', 1991,3,['p3','p2']],
['A', 1993,5,['p4']],
['A',2000,4,['p1','p5','p7']],
['B',2000,1, ['p3']],
['B',2001,2,['p6','p9']],
['B',2002,3,['p10','p1']]], columns=['company', 'year','quantity', 'reviewer'])
df['rolling_average'] = (df.groupby(['company'])
.rolling(3).agg({'quantity':'mean'}).reset_index(level=[0], drop=True))
The output currently looks like:
| index | company | year | quantity | reviewer | rolling_average |
| :---- | :------ | :--- | :------- | :------- | :-------------- |
| 0 | A | 1990 | 2 | [p1, p2] | NaN |
| 1 | A | 1991 | 3 | [p3, p2] | NaN |
| 2 | A | 1993 | 5 | [p4] | 3.33 |
| 3 | A | 2000 | 4 | [p5, p7] | 4.00 |
| 4 | B | 2000 | 1 | [p3] | NaN |
| 5 | B | 2001 | 2 | [p6, p9] | NaN |
| 6 | B | 2002 | 3 | [p10, p1]| 2.00 |
Since the rolling can not take non-numeric , we need self-define the rolling here
n = 3
df['new'] = df.groupby(['company'])['reviewer'].apply(lambda x :[x[y-n:y].tolist() if y>=n else np.nan for y in range(1,len(x)+1)]).explode().values
df
company year quantity reviewer new
0 A 1990 2 [p1, p2] NaN
1 A 1991 3 [p3, p2] NaN
2 A 1993 5 [p4] [[p1, p2], [p3, p2], [p4]]
3 A 2000 4 [p1, p5, p7] [[p3, p2], [p4], [p1, p5, p7]]
4 B 2000 1 [p3] NaN
5 B 2001 2 [p6, p9] NaN
6 B 2002 3 [p10, p1] [[p3], [p6, p9], [p10, p1]]

Calculating the difference in value between columns

I have a dataframe with YYYYMM columns that contain monthly totals on the row level:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | Nan | 9.0 | 32.0 |
| 1 | feature2 | 1.0 | 1.0 | 1.0 | 4.0 |
| 2 | feature3 | Nan | 1.0 | 4.0 | 8.0 |
| 3 | feature4 | 9.0 | 15.0 | 19.0 | 24.0 |
| 4 | feature5 | 33.0 | 67.0 | 99.0 | 121.0 |
| 5 | feature6 | 12.0 | 15.0 | 17.0 | 19.0 |
| 6 | feature7 | 1.0 | 8.0 | 15.0 | 20.0 |
| 7 | feature8 | Nan | Nan | 1.0 | 9.0 |
I would like to convert the totals to the monthly change. The feature column should be excluded as I need to keep the feature names. The yearM in the index is a result of pivoting a dataframe to get the YYYYMM on the column level.
This is how the output would look like:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | 0.0 | 9.0 | 23.0 |
| 1 | feature2 | 1.0 | 0.0 | 0.0 | 3.0 |
| 2 | feature3 | Nan | 1.0 | 3.0 | 5.0 |
| 3 | feature4 | 9.0 | 6.0 | 4.0 | 5 |
| 4 | feature5 | 33.0 | 34.0 | 32.0 | 22.0 |
| 5 | feature6 | 12.0 | 3.0 | 2.0 | 2.0 |
| 6 | feature7 | 1.0 | 7.0 | 7.0 | 5.0 |
| 7 | feature8 | Nan | 0.0 | 1.0 | 8.0 |
The row level values now represent the change compared to the previous month instead of having the total for the month.
I know that I should start by filling the NaN rows in the starting column 201902 with 0:
df['201902'] = df['201902'].fillna(0)
I could also calculate them one by one with something similar to this:
df['201902'] = df['201902'].fillna(0) - df['201901'].fillna(0)
df['201903'] = df['201903'].fillna(0) - df['201902'].fillna(0)
df['201904'] = df['201904'].fillna(0) - df['201903'].fillna(0)
...
...
Hopefully there's a smarter solution though
use iloc or drop to access the other columns, then diff with axis=1 for row-wise differences.
monthly_change = df.iloc[:, 1:].fillna(0).diff(axis=1)
# or
# monthly_change = df.drop(['feature'], axis=1).fillna(0).diff(axis=1)

How to create new columns in pandas dataframe using column values?

I'm working in Python with a pandas DataFrame similar to:
REQUESET_ID | DESCR | TEST | TEST_DESC | RESULT |
1 | 1 | T1 | TEST_1 | 2.0 |
1 | 2 | T2 | TEST_2 | 92.0 |
2 | 1 | T1 | TEST_1 | 8.0 |
3 | 3 | T3 | TEST_3 | 12.0 |
3 | 4 | T4 | TEST_4 | 45.0 |
What I want is a final dataframe like this:
REQUESET_ID | DESCR_1 | TEST_1 | TEST_DESC_1 | RESULT_1 | DESCR_2 | TEST_2 | TEST_DESC_2 | RESULT_2 |
1 | 1 | T1 | TEST_1 | 2.0 | 2 | T2 | TEST_2 | 92.0 |
2 | 1 | T1 | TEST_1 | 8.0 | NaN | NaN | NaN | Nan |
3 | 3 | T3 | TEST_3 | 12.0 | 4 | T4 | TEST_4 | 45.0 |
How I should implement that as a method working with DataFrames. I understand that if I try to do it with a merge instead of having 4x2 columns added beacuse the value_counts method of the REQUEST_ID will return 2, will add the 4 columns for each entry in the request column.
Assign a new column with cumcount, then do stack + unstack
s=df.assign(col=(df.groupby('REQUESET_ID').cumcount()+1).astype(str)).\
set_index(['REQUESET_ID','col']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
DESCR_1 RESULT_1 TEST_1 ... RESULT_2 TEST_2 TEST_DESC_2
REQUESET_ID ...
1 1.0 2.0 T1 ... 92.0 T2 TEST_2
2 1.0 8.0 T1 ... NaN NaN NaN
3 3.0 12.0 T3 ... 45.0 T4 TEST_4
[3 rows x 8 columns]

Difference of days between two dates until a condition is met

I have got trouble to make to compute the number of days in a row until a condition is found.
It is given in the following table were Gap done is the messy table I obtained with the solution form there , and Expected gap the output I want to obtain.
+--------+------------+---------------------+----------+----------------------------------------------------------------------------------------------+
| Player | Result | Date | Gap done | Expected Gap |
+--------+------------+---------------------+----------+----------------------------------------------------------------------------------------------+
| K2000 | Lose | 2015-11-13 13:42:00 | Nan | Nan/0 |
| K2000 | Lose | 2016-03-23 16:40:00 | 131.0 | 131.0 |
| K2000 | Lose | 2016-05-16 19:17:00 | 54.0 | 185.0 |
| K2000 | Win | 2016-06-09 19:36:00 | 54.0 | 239.0 #he always lose before |
| K2000 | Win | 2016-06-30 14:05:00 | 54.0 | 54.0 #because he won last time, it's 54 days btw this current date and the last time he won. |
| K2000 | Lose | 2016-07-29 16:20:00 | 29.0 | 29.0 |
| K2000 | Win | 2016-10-08 17:48:00 | 29.0 | 58.0 |
| Kssis | Lose | 2007-02-25 15:05:00 | Nan | Nan/0 |
| Kssis | Lose | 2007-04-25 16:07:00 | 59.0 | 59.0 |
| Kssis | Not ranked | 2007-06-01 16:54:00 | 37.0 | 96.0 |
| Kssis | Lose | 2007-09-09 14:33:00 | 99.0 | 195.0 |
| Kssis | Lose | 2008-04-06 16:27:00 | 210.0 | 405.0 |
+--------+------------+---------------------+----------+----------------------------------------------------------------------------------------------+
The issue of the solution there is it does not really compute date. It has the chance that date in this example are always separate by 1 day.
Sure I adapted with
def sum_days_in_row_with_condition(g):
sorted_g = g.sort_values(by='date', ascending=True)
condition = sorted_g['Result'] == 'Win'
sorted_g['days-in-a-row'] = g.date.diff().dt.days.where(~condition).ffill()
return sorted_g
But as I showed you, this is messy.
So I thought about a solution, but it needs global variables (out of function), and that's a little fastidious.
Can anyone help to solve this problematic in a simpler way ?
Pandas version: 0.23.4 Python version: 3.7.4
IIUC, you need to find the boolean mask m1 where win has previous row also win. From m1 create a groupID s to separate group win. Split them into group and cumsum
m = df.Result.eq('Win')
m1 = m & m.shift()
s = m1.ne(m1.shift()).cumsum()
df['Expected Gap'] = df.groupby(['Player', s])['Gap done'].cumsum()
Out[808]:
Player Result Date Gap done Expected Gap
0 K2000 Lose 2015-11-13 13:42:00 NaN NaN
1 K2000 Lose 2016-03-23 16:40:00 131.0 131.0
2 K2000 Lose 2016-05-16 19:17:00 54.0 185.0
3 K2000 Win 2016-06-09 19:36:00 54.0 239.0
4 K2000 Win 2016-06-30 14:05:00 54.0 54.0
5 K2000 Lose 2016-07-29 16:20:00 29.0 29.0
6 K2000 Win 2016-10-08 17:48:00 29.0 58.0
7 Kssis Lose 2007-02-25 15:05:00 NaN NaN
8 Kssis Lose 2007-04-25 6:07:00 59.0 59.0
9 Kssis Not-ranked 2007-06-01 16:54:00 37.0 96.0
10 Kssis Lose 2007-09-09 14:33:00 99.0 195.0
11 Kssis Lose 2008-04-06 16:27:00 210.0 405.0

Detect consecutive timestamps with all rows with NaN values in pandas

I would like to detect in a dataframe the start and end (Datetime) of consecutive sets of rows with all the values being NaN.
What is the best way to store the results in a array of tuples with the start and end of each set of datetimes with NaN values?
For example using the dataframe bellow the tuple should be like this:
missing_datetimes = [('2018-10-10 22:00:00', '2018-10-11 00:00:00 '),
('2018-10-11 02:00:00','2018-10-11 02:00:00'), ('2018-10-11 04:00:00', '2018-10-11 04:00:00')
Example of dataframe:
-------------+---------------------+------------+------------+
| geo_id | Datetime | Variable1 | Variable2 |
+------------+---------------------+------------+------------+
| 1 | 2018-10-10 18:00:00 | 20 | 10 |
| 2 | 2018-10-10 18:00:00 | 22 | 10 |
| 1 | 2018-10-10 19:00:00 | 20 | nan |
| 2 | 2018-10-10 19:00:00 | 21 | nan |
| 1 | 2018-10-10 20:00:00 | 30 | nan |
| 2 | 2018-10-10 20:00:00 | 30 | nan |
| 1 | 2018-10-10 21:00:00 | nan | 5 |
| 2 | 2018-10-10 21:00:00 | nan | 5 |
| 1 | 2018-10-10 22:00:00 | nan | nan |
| 1 | 2018-10-10 23:00:00 | nan | nan |
| 1 | 2018-10-11 00:00:00 | nan | nan |
| 1 | 2018-10-11 01:00:00 | 5 | 2 |
| 1 | 2018-10-11 02:00:00 | nan | nan |
| 1 | 2018-10-11 03:00:00 | 2 | 1 |
| 1 | 2018-10-11 04:00:00 | nan | nan |
+------------+---------------------+------------+------------+
Update: And what if some datetimes are duplicated?
You may need to using groupby with condition
s=df.set_index('Datetime').isnull().all(axis=1)
df.loc[s,'Datetime'].groupby((~s).cumsum()[s]).agg(['first','last']).apply(tuple,1).tolist()
# find the all nan value and if they are consecutive we pull them into one group
Out[89]:
[('2018-10-1022:00:00', '2018-10-1100:00:00'),
('2018-10-1102:00:00', '2018-10-1102:00:00'),
('2018-10-1104:00:00', '2018-10-1104:00:00')]

Resources