Create DataFrames or Dictionaries from Unique Values in Separate Columns

Create DataFrames or Dictionaries from Unique Values in Separate Columns - python-3.x

I'm a Python newbie trying to model stock trades from a DataFrame containing timestamped trade executions. For example, index 0-2 below represent three executions of a single trade. I'm trying to isolate each group of executions that represent a trade. Here's a sample of the existing aggregate:
event side symbol shares price time trade_date
0 Execute Shrt TSLA 25 311.9500 10:29:51 2019-01-19
1 Execute Buy TSLA 20 310.7300 10:30:42 2019-01-19
2 Execute Buy TSLA 5 311.1173 10:31:15 2019-01-19
3 Execute Buy BYND 25 83.3027 11:06:15 2019-01-19
4 Execute Shrt BYND 13 84.0500 11:07:11 2019-01-19
5 Execute Sell BYND 12 83.2500 11:07:42 2019-01-19
6 Execute Buy NVDA 25 297.3400 12:07:42 2019-01-20
7 Execute Shrt AMZN 10 500.0100 12:09:12 2019-01-20
8 Execute Sell NVDA 25 296.7500 12:10:30 2019-01-20
9 Execute Buy AMZN 10 495.7500 12:11:15 2019-01-20
The approach in this post creates slices from unique values in a single column, but I'm unsure of how to make the second slice. With this approach applied, I have:
date_list = list(set(execs_df['trade_date'])) # Create list of dates from original DataFrame
by_date_dict = {date: execs_df.loc[execs_df['trade_date'] == date] for date in date_list}
for date in date_list:
print(by_date__dict[date])
This produces the following, date-specific dictionaries:
side symbol shares price time trade_date p & l trades value
0 Shrt TSLA 25.0 311.9500 10:29:51 2019-11-01 NaN NaN 7798.7500
1 Buy TSLA 8.0 311.2000 10:30:31 2019-11-01 NaN NaN 2489.6000
2 Buy TSLA 8.0 310.7300 10:30:42 2019-11-01 NaN NaN 2485.8400
3 Buy TSLA 4.0 311.1173 10:31:15 2019-11-01 NaN NaN 1244.4692
4 Buy TSLA 5.0 311.5500 10:35:39 2019-11-01 NaN NaN 1557.7500
5 Shrt BYND 25.0 83.3027 11:06:15 2019-11-01 NaN NaN 2082.5675
6 Buy BYND 12.0 83.0500 11:06:43 2019-11-01 NaN NaN 996.6000
7 Buy BYND 13.0 83.2400 11:07:49 2019-11-01 NaN NaN 1082.1200
In terms of final output, I need the following:
side symbol shares price time trade_date p & l trades value
0 Shrt TSLA 25.0 311.9500 10:29:51 2019-11-01 NaN NaN 7798.7500
1 Buy TSLA 8.0 311.2000 10:30:31 2019-11-01 NaN NaN 2489.6000
2 Buy TSLA 8.0 310.7300 10:30:42 2019-11-01 NaN NaN 2485.8400
3 Buy TSLA 4.0 311.1173 10:31:15 2019-11-01 NaN NaN 1244.4692
4 Buy TSLA 5.0 311.5500 10:35:39 2019-11-01 NaN NaN 1557.7500
side symbol shares price time trade_date p & l trades value
0 Shrt BYND 25.0 83.3027 11:06:15 2019-11-01 NaN NaN 2082.5675
1 Buy BYND 12.0 83.0500 11:06:43 2019-11-01 NaN NaN 996.6000
2 Buy BYND 13.0 83.2400 11:07:49 2019-11-01 NaN NaN 1082.1200
etc...
Any pointers would be greatly appreciated.

Given your current dictionary of dataframes by_date_dict
The following code we be a dict of dict of dataframes
Top key is still the date
Under each data key is a key for the symbol (e.g. updated_df['2019-11-01']['BYND'])
updated_df = {k: {sym: v[v.symbol == sym] for sym in v.symbol.unique()} for k, v in by_date_dict.items()}
# structure
{date: {symbol: df,
symbol: df,
symbol: df}
date: {symbol, df,
symbol, df,
symbol, df}}
for k, v in updated_df.items():
print(k)
for x, y in v.items():
print(x)
print(y.to_markdown())
2019-11-01
TSLA
| | side | symbol | shares | price | time | trade_date | pl | trades | value |
|---:|:-------|:---------|---------:|--------:|:---------|:-------------|-----:|---------:|--------:|
| 0 | Shrt | TSLA | 25 | 311.95 | 10:29:51 | 2019-11-01 | nan | nan | 7798.75 |
| 1 | Buy | TSLA | 8 | 311.2 | 10:30:31 | 2019-11-01 | nan | nan | 2489.6 |
| 2 | Buy | TSLA | 8 | 310.73 | 10:30:42 | 2019-11-01 | nan | nan | 2485.84 |
| 3 | Buy | TSLA | 4 | 311.117 | 10:31:15 | 2019-11-01 | nan | nan | 1244.47 |
| 4 | Buy | TSLA | 5 | 311.55 | 10:35:39 | 2019-11-01 | nan | nan | 1557.75 |
BYND
| | side | symbol | shares | price | time | trade_date | pl | trades | value |
|---:|:-------|:---------|---------:|--------:|:---------|:-------------|-----:|---------:|--------:|
| 5 | Shrt | BYND | 25 | 83.3027 | 11:06:15 | 2019-11-01 | nan | nan | 2082.57 |
| 6 | Buy | BYND | 12 | 83.05 | 11:06:43 | 2019-11-01 | nan | nan | 996.6 |
| 7 | Buy | BYND | 13 | 83.24 | 11:07:49 | 2019-11-01 | nan | nan | 1082.12 |
Access specific key
updated_df['2019-11-01']['BYND']
| | side | symbol | shares | price | time | trade_date | pl | trades | value |
|---:|:-------|:---------|---------:|--------:|:---------|:-------------|-----:|---------:|--------:|
| 5 | Shrt | BYND | 25 | 83.3027 | 11:06:15 | 2019-11-01 | nan | nan | 2082.57 |
| 6 | Buy | BYND | 12 | 83.05 | 11:06:43 | 2019-11-01 | nan | nan | 996.6 |
| 7 | Buy | BYND | 13 | 83.24 | 11:07:49 | 2019-11-01 | nan | nan | 1082.12 |

Related

rolling average and aggregate more than one column in pandas

How do I also aggregate the 'reviewer' lists together with average of 'quantities'?
For a data frame like below I can successfully calculate the average of the quantities per group over every 3 years. How do I add an extra column that aggregates the values of column 'reviewer, for every period as well? for example for company 'A' for year 1993, the column would be [[p1,p2],[p3,p2],[p4]].
df= pd.DataFrame(data=[
['A', 1990, 2,['p1','p2']],
['A', 1991,3,['p3','p2']],
['A', 1993,5,['p4']],
['A',2000,4,['p1','p5','p7']],
['B',2000,1, ['p3']],
['B',2001,2,['p6','p9']],
['B',2002,3,['p10','p1']]], columns=['company', 'year','quantity', 'reviewer'])
df['rolling_average'] = (df.groupby(['company'])
.rolling(3).agg({'quantity':'mean'}).reset_index(level=[0], drop=True))
The output currently looks like:
| index | company | year | quantity | reviewer | rolling_average |
| :---- | :------ | :--- | :------- | :------- | :-------------- |
| 0 | A | 1990 | 2 | [p1, p2] | NaN |
| 1 | A | 1991 | 3 | [p3, p2] | NaN |
| 2 | A | 1993 | 5 | [p4] | 3.33 |
| 3 | A | 2000 | 4 | [p5, p7] | 4.00 |
| 4 | B | 2000 | 1 | [p3] | NaN |
| 5 | B | 2001 | 2 | [p6, p9] | NaN |
| 6 | B | 2002 | 3 | [p10, p1]| 2.00 |

Since the rolling can not take non-numeric , we need self-define the rolling here
n = 3
df['new'] = df.groupby(['company'])['reviewer'].apply(lambda x :[x[y-n:y].tolist() if y>=n else np.nan for y in range(1,len(x)+1)]).explode().values
df
company year quantity reviewer new
0 A 1990 2 [p1, p2] NaN
1 A 1991 3 [p3, p2] NaN
2 A 1993 5 [p4] [[p1, p2], [p3, p2], [p4]]
3 A 2000 4 [p1, p5, p7] [[p3, p2], [p4], [p1, p5, p7]]
4 B 2000 1 [p3] NaN
5 B 2001 2 [p6, p9] NaN
6 B 2002 3 [p10, p1] [[p3], [p6, p9], [p10, p1]]

Calculating the difference in value between columns

I have a dataframe with YYYYMM columns that contain monthly totals on the row level:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | Nan | 9.0 | 32.0 |
| 1 | feature2 | 1.0 | 1.0 | 1.0 | 4.0 |
| 2 | feature3 | Nan | 1.0 | 4.0 | 8.0 |
| 3 | feature4 | 9.0 | 15.0 | 19.0 | 24.0 |
| 4 | feature5 | 33.0 | 67.0 | 99.0 | 121.0 |
| 5 | feature6 | 12.0 | 15.0 | 17.0 | 19.0 |
| 6 | feature7 | 1.0 | 8.0 | 15.0 | 20.0 |
| 7 | feature8 | Nan | Nan | 1.0 | 9.0 |
I would like to convert the totals to the monthly change. The feature column should be excluded as I need to keep the feature names. The yearM in the index is a result of pivoting a dataframe to get the YYYYMM on the column level.
This is how the output would look like:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | 0.0 | 9.0 | 23.0 |
| 1 | feature2 | 1.0 | 0.0 | 0.0 | 3.0 |
| 2 | feature3 | Nan | 1.0 | 3.0 | 5.0 |
| 3 | feature4 | 9.0 | 6.0 | 4.0 | 5 |
| 4 | feature5 | 33.0 | 34.0 | 32.0 | 22.0 |
| 5 | feature6 | 12.0 | 3.0 | 2.0 | 2.0 |
| 6 | feature7 | 1.0 | 7.0 | 7.0 | 5.0 |
| 7 | feature8 | Nan | 0.0 | 1.0 | 8.0 |
The row level values now represent the change compared to the previous month instead of having the total for the month.
I know that I should start by filling the NaN rows in the starting column 201902 with 0:
df['201902'] = df['201902'].fillna(0)
I could also calculate them one by one with something similar to this:
df['201902'] = df['201902'].fillna(0) - df['201901'].fillna(0)
df['201903'] = df['201903'].fillna(0) - df['201902'].fillna(0)
df['201904'] = df['201904'].fillna(0) - df['201903'].fillna(0)
...
...
Hopefully there's a smarter solution though

use iloc or drop to access the other columns, then diff with axis=1 for row-wise differences.
monthly_change = df.iloc[:, 1:].fillna(0).diff(axis=1)
# or
# monthly_change = df.drop(['feature'], axis=1).fillna(0).diff(axis=1)

How to create new columns in pandas dataframe using column values?

I'm working in Python with a pandas DataFrame similar to:
REQUESET_ID | DESCR | TEST | TEST_DESC | RESULT |
1 | 1 | T1 | TEST_1 | 2.0 |
1 | 2 | T2 | TEST_2 | 92.0 |
2 | 1 | T1 | TEST_1 | 8.0 |
3 | 3 | T3 | TEST_3 | 12.0 |
3 | 4 | T4 | TEST_4 | 45.0 |
What I want is a final dataframe like this:
REQUESET_ID | DESCR_1 | TEST_1 | TEST_DESC_1 | RESULT_1 | DESCR_2 | TEST_2 | TEST_DESC_2 | RESULT_2 |
1 | 1 | T1 | TEST_1 | 2.0 | 2 | T2 | TEST_2 | 92.0 |
2 | 1 | T1 | TEST_1 | 8.0 | NaN | NaN | NaN | Nan |
3 | 3 | T3 | TEST_3 | 12.0 | 4 | T4 | TEST_4 | 45.0 |
How I should implement that as a method working with DataFrames. I understand that if I try to do it with a merge instead of having 4x2 columns added beacuse the value_counts method of the REQUEST_ID will return 2, will add the 4 columns for each entry in the request column.

Assign a new column with cumcount, then do stack + unstack
s=df.assign(col=(df.groupby('REQUESET_ID').cumcount()+1).astype(str)).\
set_index(['REQUESET_ID','col']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
DESCR_1 RESULT_1 TEST_1 ... RESULT_2 TEST_2 TEST_DESC_2
REQUESET_ID ...
1 1.0 2.0 T1 ... 92.0 T2 TEST_2
2 1.0 8.0 T1 ... NaN NaN NaN
3 3.0 12.0 T3 ... 45.0 T4 TEST_4
[3 rows x 8 columns]

Difference of days between two dates until a condition is met

I have got trouble to make to compute the number of days in a row until a condition is found.
It is given in the following table were Gap done is the messy table I obtained with the solution form there , and Expected gap the output I want to obtain.
+--------+------------+---------------------+----------+----------------------------------------------------------------------------------------------+
| Player | Result | Date | Gap done | Expected Gap |
+--------+------------+---------------------+----------+----------------------------------------------------------------------------------------------+
| K2000 | Lose | 2015-11-13 13:42:00 | Nan | Nan/0 |
| K2000 | Lose | 2016-03-23 16:40:00 | 131.0 | 131.0 |
| K2000 | Lose | 2016-05-16 19:17:00 | 54.0 | 185.0 |
| K2000 | Win | 2016-06-09 19:36:00 | 54.0 | 239.0 #he always lose before |
| K2000 | Win | 2016-06-30 14:05:00 | 54.0 | 54.0 #because he won last time, it's 54 days btw this current date and the last time he won. |
| K2000 | Lose | 2016-07-29 16:20:00 | 29.0 | 29.0 |
| K2000 | Win | 2016-10-08 17:48:00 | 29.0 | 58.0 |
| Kssis | Lose | 2007-02-25 15:05:00 | Nan | Nan/0 |
| Kssis | Lose | 2007-04-25 16:07:00 | 59.0 | 59.0 |
| Kssis | Not ranked | 2007-06-01 16:54:00 | 37.0 | 96.0 |
| Kssis | Lose | 2007-09-09 14:33:00 | 99.0 | 195.0 |
| Kssis | Lose | 2008-04-06 16:27:00 | 210.0 | 405.0 |
+--------+------------+---------------------+----------+----------------------------------------------------------------------------------------------+
The issue of the solution there is it does not really compute date. It has the chance that date in this example are always separate by 1 day.
Sure I adapted with
def sum_days_in_row_with_condition(g):
sorted_g = g.sort_values(by='date', ascending=True)
condition = sorted_g['Result'] == 'Win'
sorted_g['days-in-a-row'] = g.date.diff().dt.days.where(~condition).ffill()
return sorted_g
But as I showed you, this is messy.
So I thought about a solution, but it needs global variables (out of function), and that's a little fastidious.
Can anyone help to solve this problematic in a simpler way ?
Pandas version: 0.23.4 Python version: 3.7.4

IIUC, you need to find the boolean mask m1 where win has previous row also win. From m1 create a groupID s to separate group win. Split them into group and cumsum
m = df.Result.eq('Win')
m1 = m & m.shift()
s = m1.ne(m1.shift()).cumsum()
df['Expected Gap'] = df.groupby(['Player', s])['Gap done'].cumsum()
Out[808]:
Player Result Date Gap done Expected Gap
0 K2000 Lose 2015-11-13 13:42:00 NaN NaN
1 K2000 Lose 2016-03-23 16:40:00 131.0 131.0
2 K2000 Lose 2016-05-16 19:17:00 54.0 185.0
3 K2000 Win 2016-06-09 19:36:00 54.0 239.0
4 K2000 Win 2016-06-30 14:05:00 54.0 54.0
5 K2000 Lose 2016-07-29 16:20:00 29.0 29.0
6 K2000 Win 2016-10-08 17:48:00 29.0 58.0
7 Kssis Lose 2007-02-25 15:05:00 NaN NaN
8 Kssis Lose 2007-04-25 6:07:00 59.0 59.0
9 Kssis Not-ranked 2007-06-01 16:54:00 37.0 96.0
10 Kssis Lose 2007-09-09 14:33:00 99.0 195.0
11 Kssis Lose 2008-04-06 16:27:00 210.0 405.0

Detect consecutive timestamps with all rows with NaN values in pandas

I would like to detect in a dataframe the start and end (Datetime) of consecutive sets of rows with all the values being NaN.
What is the best way to store the results in a array of tuples with the start and end of each set of datetimes with NaN values?
For example using the dataframe bellow the tuple should be like this:
missing_datetimes = [('2018-10-10 22:00:00', '2018-10-11 00:00:00 '),
('2018-10-11 02:00:00','2018-10-11 02:00:00'), ('2018-10-11 04:00:00', '2018-10-11 04:00:00')
Example of dataframe:
-------------+---------------------+------------+------------+
| geo_id | Datetime | Variable1 | Variable2 |
+------------+---------------------+------------+------------+
| 1 | 2018-10-10 18:00:00 | 20 | 10 |
| 2 | 2018-10-10 18:00:00 | 22 | 10 |
| 1 | 2018-10-10 19:00:00 | 20 | nan |
| 2 | 2018-10-10 19:00:00 | 21 | nan |
| 1 | 2018-10-10 20:00:00 | 30 | nan |
| 2 | 2018-10-10 20:00:00 | 30 | nan |
| 1 | 2018-10-10 21:00:00 | nan | 5 |
| 2 | 2018-10-10 21:00:00 | nan | 5 |
| 1 | 2018-10-10 22:00:00 | nan | nan |
| 1 | 2018-10-10 23:00:00 | nan | nan |
| 1 | 2018-10-11 00:00:00 | nan | nan |
| 1 | 2018-10-11 01:00:00 | 5 | 2 |
| 1 | 2018-10-11 02:00:00 | nan | nan |
| 1 | 2018-10-11 03:00:00 | 2 | 1 |
| 1 | 2018-10-11 04:00:00 | nan | nan |
+------------+---------------------+------------+------------+
Update: And what if some datetimes are duplicated?

You may need to using groupby with condition
s=df.set_index('Datetime').isnull().all(axis=1)
df.loc[s,'Datetime'].groupby((~s).cumsum()[s]).agg(['first','last']).apply(tuple,1).tolist()
# find the all nan value and if they are consecutive we pull them into one group
Out[89]:
[('2018-10-1022:00:00', '2018-10-1100:00:00'),
('2018-10-1102:00:00', '2018-10-1102:00:00'),
('2018-10-1104:00:00', '2018-10-1104:00:00')]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Create DataFrames or Dictionaries from Unique Values in Separate Columns - python-3.x

Related

rolling average and aggregate more than one column in pandas

Calculating the difference in value between columns

How to create new columns in pandas dataframe using column values?

Difference of days between two dates until a condition is met

Detect consecutive timestamps with all rows with NaN values in pandas

Categories

Resources