Difference of days between two dates until a condition is met - python-3.x

I have got trouble to make to compute the number of days in a row until a condition is found.
It is given in the following table were Gap done is the messy table I obtained with the solution form there , and Expected gap the output I want to obtain.
+--------+------------+---------------------+----------+----------------------------------------------------------------------------------------------+
| Player | Result | Date | Gap done | Expected Gap |
+--------+------------+---------------------+----------+----------------------------------------------------------------------------------------------+
| K2000 | Lose | 2015-11-13 13:42:00 | Nan | Nan/0 |
| K2000 | Lose | 2016-03-23 16:40:00 | 131.0 | 131.0 |
| K2000 | Lose | 2016-05-16 19:17:00 | 54.0 | 185.0 |
| K2000 | Win | 2016-06-09 19:36:00 | 54.0 | 239.0 #he always lose before |
| K2000 | Win | 2016-06-30 14:05:00 | 54.0 | 54.0 #because he won last time, it's 54 days btw this current date and the last time he won. |
| K2000 | Lose | 2016-07-29 16:20:00 | 29.0 | 29.0 |
| K2000 | Win | 2016-10-08 17:48:00 | 29.0 | 58.0 |
| Kssis | Lose | 2007-02-25 15:05:00 | Nan | Nan/0 |
| Kssis | Lose | 2007-04-25 16:07:00 | 59.0 | 59.0 |
| Kssis | Not ranked | 2007-06-01 16:54:00 | 37.0 | 96.0 |
| Kssis | Lose | 2007-09-09 14:33:00 | 99.0 | 195.0 |
| Kssis | Lose | 2008-04-06 16:27:00 | 210.0 | 405.0 |
+--------+------------+---------------------+----------+----------------------------------------------------------------------------------------------+
The issue of the solution there is it does not really compute date. It has the chance that date in this example are always separate by 1 day.
Sure I adapted with
def sum_days_in_row_with_condition(g):
sorted_g = g.sort_values(by='date', ascending=True)
condition = sorted_g['Result'] == 'Win'
sorted_g['days-in-a-row'] = g.date.diff().dt.days.where(~condition).ffill()
return sorted_g
But as I showed you, this is messy.
So I thought about a solution, but it needs global variables (out of function), and that's a little fastidious.
Can anyone help to solve this problematic in a simpler way ?
Pandas version: 0.23.4 Python version: 3.7.4

IIUC, you need to find the boolean mask m1 where win has previous row also win. From m1 create a groupID s to separate group win. Split them into group and cumsum
m = df.Result.eq('Win')
m1 = m & m.shift()
s = m1.ne(m1.shift()).cumsum()
df['Expected Gap'] = df.groupby(['Player', s])['Gap done'].cumsum()
Out[808]:
Player Result Date Gap done Expected Gap
0 K2000 Lose 2015-11-13 13:42:00 NaN NaN
1 K2000 Lose 2016-03-23 16:40:00 131.0 131.0
2 K2000 Lose 2016-05-16 19:17:00 54.0 185.0
3 K2000 Win 2016-06-09 19:36:00 54.0 239.0
4 K2000 Win 2016-06-30 14:05:00 54.0 54.0
5 K2000 Lose 2016-07-29 16:20:00 29.0 29.0
6 K2000 Win 2016-10-08 17:48:00 29.0 58.0
7 Kssis Lose 2007-02-25 15:05:00 NaN NaN
8 Kssis Lose 2007-04-25 6:07:00 59.0 59.0
9 Kssis Not-ranked 2007-06-01 16:54:00 37.0 96.0
10 Kssis Lose 2007-09-09 14:33:00 99.0 195.0
11 Kssis Lose 2008-04-06 16:27:00 210.0 405.0

Related

Calculating the difference in value between columns

I have a dataframe with YYYYMM columns that contain monthly totals on the row level:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | Nan | 9.0 | 32.0 |
| 1 | feature2 | 1.0 | 1.0 | 1.0 | 4.0 |
| 2 | feature3 | Nan | 1.0 | 4.0 | 8.0 |
| 3 | feature4 | 9.0 | 15.0 | 19.0 | 24.0 |
| 4 | feature5 | 33.0 | 67.0 | 99.0 | 121.0 |
| 5 | feature6 | 12.0 | 15.0 | 17.0 | 19.0 |
| 6 | feature7 | 1.0 | 8.0 | 15.0 | 20.0 |
| 7 | feature8 | Nan | Nan | 1.0 | 9.0 |
I would like to convert the totals to the monthly change. The feature column should be excluded as I need to keep the feature names. The yearM in the index is a result of pivoting a dataframe to get the YYYYMM on the column level.
This is how the output would look like:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | 0.0 | 9.0 | 23.0 |
| 1 | feature2 | 1.0 | 0.0 | 0.0 | 3.0 |
| 2 | feature3 | Nan | 1.0 | 3.0 | 5.0 |
| 3 | feature4 | 9.0 | 6.0 | 4.0 | 5 |
| 4 | feature5 | 33.0 | 34.0 | 32.0 | 22.0 |
| 5 | feature6 | 12.0 | 3.0 | 2.0 | 2.0 |
| 6 | feature7 | 1.0 | 7.0 | 7.0 | 5.0 |
| 7 | feature8 | Nan | 0.0 | 1.0 | 8.0 |
The row level values now represent the change compared to the previous month instead of having the total for the month.
I know that I should start by filling the NaN rows in the starting column 201902 with 0:
df['201902'] = df['201902'].fillna(0)
I could also calculate them one by one with something similar to this:
df['201902'] = df['201902'].fillna(0) - df['201901'].fillna(0)
df['201903'] = df['201903'].fillna(0) - df['201902'].fillna(0)
df['201904'] = df['201904'].fillna(0) - df['201903'].fillna(0)
...
...
Hopefully there's a smarter solution though
use iloc or drop to access the other columns, then diff with axis=1 for row-wise differences.
monthly_change = df.iloc[:, 1:].fillna(0).diff(axis=1)
# or
# monthly_change = df.drop(['feature'], axis=1).fillna(0).diff(axis=1)

Create DataFrames or Dictionaries from Unique Values in Separate Columns

I'm a Python newbie trying to model stock trades from a DataFrame containing timestamped trade executions. For example, index 0-2 below represent three executions of a single trade. I'm trying to isolate each group of executions that represent a trade. Here's a sample of the existing aggregate:
event side symbol shares price time trade_date
0 Execute Shrt TSLA 25 311.9500 10:29:51 2019-01-19
1 Execute Buy TSLA 20 310.7300 10:30:42 2019-01-19
2 Execute Buy TSLA 5 311.1173 10:31:15 2019-01-19
3 Execute Buy BYND 25 83.3027 11:06:15 2019-01-19
4 Execute Shrt BYND 13 84.0500 11:07:11 2019-01-19
5 Execute Sell BYND 12 83.2500 11:07:42 2019-01-19
6 Execute Buy NVDA 25 297.3400 12:07:42 2019-01-20
7 Execute Shrt AMZN 10 500.0100 12:09:12 2019-01-20
8 Execute Sell NVDA 25 296.7500 12:10:30 2019-01-20
9 Execute Buy AMZN 10 495.7500 12:11:15 2019-01-20
The approach in this post creates slices from unique values in a single column, but I'm unsure of how to make the second slice. With this approach applied, I have:
date_list = list(set(execs_df['trade_date'])) # Create list of dates from original DataFrame
by_date_dict = {date: execs_df.loc[execs_df['trade_date'] == date] for date in date_list}
for date in date_list:
print(by_date__dict[date])
This produces the following, date-specific dictionaries:
side symbol shares price time trade_date p & l trades value
0 Shrt TSLA 25.0 311.9500 10:29:51 2019-11-01 NaN NaN 7798.7500
1 Buy TSLA 8.0 311.2000 10:30:31 2019-11-01 NaN NaN 2489.6000
2 Buy TSLA 8.0 310.7300 10:30:42 2019-11-01 NaN NaN 2485.8400
3 Buy TSLA 4.0 311.1173 10:31:15 2019-11-01 NaN NaN 1244.4692
4 Buy TSLA 5.0 311.5500 10:35:39 2019-11-01 NaN NaN 1557.7500
5 Shrt BYND 25.0 83.3027 11:06:15 2019-11-01 NaN NaN 2082.5675
6 Buy BYND 12.0 83.0500 11:06:43 2019-11-01 NaN NaN 996.6000
7 Buy BYND 13.0 83.2400 11:07:49 2019-11-01 NaN NaN 1082.1200
In terms of final output, I need the following:
side symbol shares price time trade_date p & l trades value
0 Shrt TSLA 25.0 311.9500 10:29:51 2019-11-01 NaN NaN 7798.7500
1 Buy TSLA 8.0 311.2000 10:30:31 2019-11-01 NaN NaN 2489.6000
2 Buy TSLA 8.0 310.7300 10:30:42 2019-11-01 NaN NaN 2485.8400
3 Buy TSLA 4.0 311.1173 10:31:15 2019-11-01 NaN NaN 1244.4692
4 Buy TSLA 5.0 311.5500 10:35:39 2019-11-01 NaN NaN 1557.7500
side symbol shares price time trade_date p & l trades value
0 Shrt BYND 25.0 83.3027 11:06:15 2019-11-01 NaN NaN 2082.5675
1 Buy BYND 12.0 83.0500 11:06:43 2019-11-01 NaN NaN 996.6000
2 Buy BYND 13.0 83.2400 11:07:49 2019-11-01 NaN NaN 1082.1200
etc...
Any pointers would be greatly appreciated.
Given your current dictionary of dataframes by_date_dict
The following code we be a dict of dict of dataframes
Top key is still the date
Under each data key is a key for the symbol (e.g. updated_df['2019-11-01']['BYND'])
updated_df = {k: {sym: v[v.symbol == sym] for sym in v.symbol.unique()} for k, v in by_date_dict.items()}
# structure
{date: {symbol: df,
symbol: df,
symbol: df}
date: {symbol, df,
symbol, df,
symbol, df}}
for k, v in updated_df.items():
print(k)
for x, y in v.items():
print(x)
print(y.to_markdown())
2019-11-01
TSLA
| | side | symbol | shares | price | time | trade_date | pl | trades | value |
|---:|:-------|:---------|---------:|--------:|:---------|:-------------|-----:|---------:|--------:|
| 0 | Shrt | TSLA | 25 | 311.95 | 10:29:51 | 2019-11-01 | nan | nan | 7798.75 |
| 1 | Buy | TSLA | 8 | 311.2 | 10:30:31 | 2019-11-01 | nan | nan | 2489.6 |
| 2 | Buy | TSLA | 8 | 310.73 | 10:30:42 | 2019-11-01 | nan | nan | 2485.84 |
| 3 | Buy | TSLA | 4 | 311.117 | 10:31:15 | 2019-11-01 | nan | nan | 1244.47 |
| 4 | Buy | TSLA | 5 | 311.55 | 10:35:39 | 2019-11-01 | nan | nan | 1557.75 |
BYND
| | side | symbol | shares | price | time | trade_date | pl | trades | value |
|---:|:-------|:---------|---------:|--------:|:---------|:-------------|-----:|---------:|--------:|
| 5 | Shrt | BYND | 25 | 83.3027 | 11:06:15 | 2019-11-01 | nan | nan | 2082.57 |
| 6 | Buy | BYND | 12 | 83.05 | 11:06:43 | 2019-11-01 | nan | nan | 996.6 |
| 7 | Buy | BYND | 13 | 83.24 | 11:07:49 | 2019-11-01 | nan | nan | 1082.12 |
Access specific key
updated_df['2019-11-01']['BYND']
| | side | symbol | shares | price | time | trade_date | pl | trades | value |
|---:|:-------|:---------|---------:|--------:|:---------|:-------------|-----:|---------:|--------:|
| 5 | Shrt | BYND | 25 | 83.3027 | 11:06:15 | 2019-11-01 | nan | nan | 2082.57 |
| 6 | Buy | BYND | 12 | 83.05 | 11:06:43 | 2019-11-01 | nan | nan | 996.6 |
| 7 | Buy | BYND | 13 | 83.24 | 11:07:49 | 2019-11-01 | nan | nan | 1082.12 |

Detect consecutive timestamps with all rows with NaN values in pandas

I would like to detect in a dataframe the start and end (Datetime) of consecutive sets of rows with all the values being NaN.
What is the best way to store the results in a array of tuples with the start and end of each set of datetimes with NaN values?
For example using the dataframe bellow the tuple should be like this:
missing_datetimes = [('2018-10-10 22:00:00', '2018-10-11 00:00:00 '),
('2018-10-11 02:00:00','2018-10-11 02:00:00'), ('2018-10-11 04:00:00', '2018-10-11 04:00:00')
Example of dataframe:
-------------+---------------------+------------+------------+
| geo_id | Datetime | Variable1 | Variable2 |
+------------+---------------------+------------+------------+
| 1 | 2018-10-10 18:00:00 | 20 | 10 |
| 2 | 2018-10-10 18:00:00 | 22 | 10 |
| 1 | 2018-10-10 19:00:00 | 20 | nan |
| 2 | 2018-10-10 19:00:00 | 21 | nan |
| 1 | 2018-10-10 20:00:00 | 30 | nan |
| 2 | 2018-10-10 20:00:00 | 30 | nan |
| 1 | 2018-10-10 21:00:00 | nan | 5 |
| 2 | 2018-10-10 21:00:00 | nan | 5 |
| 1 | 2018-10-10 22:00:00 | nan | nan |
| 1 | 2018-10-10 23:00:00 | nan | nan |
| 1 | 2018-10-11 00:00:00 | nan | nan |
| 1 | 2018-10-11 01:00:00 | 5 | 2 |
| 1 | 2018-10-11 02:00:00 | nan | nan |
| 1 | 2018-10-11 03:00:00 | 2 | 1 |
| 1 | 2018-10-11 04:00:00 | nan | nan |
+------------+---------------------+------------+------------+
Update: And what if some datetimes are duplicated?
You may need to using groupby with condition
s=df.set_index('Datetime').isnull().all(axis=1)
df.loc[s,'Datetime'].groupby((~s).cumsum()[s]).agg(['first','last']).apply(tuple,1).tolist()
# find the all nan value and if they are consecutive we pull them into one group
Out[89]:
[('2018-10-1022:00:00', '2018-10-1100:00:00'),
('2018-10-1102:00:00', '2018-10-1102:00:00'),
('2018-10-1104:00:00', '2018-10-1104:00:00')]

tensorflow timeseries different lengths

I try to get a timeseries into tenserflow to work for an LSTM. I have 4 Files but I'm not sure how to get them together running together. The biggest problem I have is that my first dataset has 1 Data-point per year but 2 others monthly data which should be used for correlation to predict the first set. The 4th Dataset just has some Metadata like Species and Coordinates. Should I put them together somehow, if so how? Any advice in right direction would be nice.
I already looked to the timeseries documentation of tenserflow and also was trying to follow this guide: https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
but I struggle with getting the year and month data good together. I manage the data in R but run Tensorflow in Python. I'm more familiar with R in general.
Thank you all for being here!
Header samples of the Data structure:
File1.csv:
years | noaa-tree-2657 |noaa-tree-2658 |noaa-tree-2659 |noaa-tree-2662
1901 | 1.676948 | 1.305594 | 0.6756204 | 0.7149572
1902 | 1.562344 | 0.899884 | 0.5102933 | 0.6351094
1903 | 1.687270 | 1.354678 | 0.9899198 | 0.6158589
File2.csv:
noaa-tree-2657 |noaa-tree-2658 |noaa-tree-2659 |noaa-tree-2662 |noaa-tree-2664
1 6.41 | 1.85 | 0.33 | 8.61 | 6.07
2 10.45 | 3.20 | 0.38 | 8.58 | 5.30
3 10.81 | 4.30 | 1.50 | 9.34 | 8.50
File3.csv:
noaa-tree-2657 |noaa-tree-2658 |noaa-tree-2659 |noaa-tree-2662 |noaa-tree-2664
1 -0.3 | 11.0 | 10.1 | -22.4 | -15.1
2 -2.9 | 10.2 | 8.8 | -14.5 | -13.3
3 1.0 | 14.3 | 14.7 | -13.8 | -12.7
File4.csv:
noaa-tree-2657 |noaa-tree-2658 |noaa-tree-2659 |noaa-tree-2662 |noaa-tree-2664
1 QUPR | PSME | PSME | PCGL | THOC
2 280.28 | 249.65 | 250.08 | 298 | 280.72
3 39.1 | 31.45 | 32.72 | 56.55 | 48.47

Create Line-Chart with different X-Values

I have a certain number of measurements. Each in the following form:
Table A:
| Time [s] | Value |
| 0.5 | 2.0 |
| 50.3 | 33.7 |
| 100.0 | 25.5 |
Table B:
| Time [s] | Value |
| 1.3 | 12.7 |
| 27.8 | 25.0 |
| 97.5 | 20.0 |
| 100.0 | 7.1 |
Table C:
...
The time is always the same, from 0.0 seconds to 100.0 seconds.
The measurement-points as to be seen in the example differ.
I now want to display the different measurements in one chart. Each table has its own line-graph. The X-Axis would display the Time.
Is something like this possible in Excel?
Solved my problem by using a Scatter graph instead of a Line graph...

Resources