substract two ECDF time series - python-3.x

Hi I have a ECDF plot by seaborn which is the following.
I can obtain this by doing sns.ecdfplot(data=df2, x='time', hue='seg_oper', stat='count').
My dataframe is very simple:
In [174]: df2
Out[174]:
time seg_oper
265 18475 1->0:ADD['TX']
2342 78007 0->1:ADD['RX']
2399 78613 1->0:DELETE['TX']
2961 87097 0->1:ADD['RX']
2994 87210 0->1:ADD['RX']
... ... ...
330823 1002281 1->0:DELETE['TX']
331256 1003545 1->0:DELETE['TX']
331629 1004961 1->0:DELETE['TX']
332375 1006663 1->0:DELETE['TX']
333083 1008644 1->0:DELETE['TX']
[834 rows x 2 columns]
How can I substract series 0->1:ADD['RX'] from 1->0:DELETE['TX']?
I like seaborn because most of this data mangling is done inside the library, but in this case I need to substract these two series ...
Thanks.

So the first thing is to obtain what seaborn does, but manually. After that (because I need to) I can subtract one series from the other.
Cumulative Count
First we need to obtain a cumulative count per each series.
In [304]: df2['cum'] = df2.groupby(['seg_oper']).cumcount()
In [305]: df2
Out[305]:
time seg_oper cum
265 18475 1->0:ADD['TX'] 0
2961 87097 0->1:ADD['RX'] 1
2994 87210 0->1:ADD['RX'] 2
... ... ... ...
332375 1006663 1->0:DELETE['TX'] 413
333083 1008644 1->0:DELETE['TX'] 414
Pivot the data
Rearrange the DF.
In [307]: df3 = df2.pivot(index='time', columns='seg_oper',values='cum').reset_index()
In [308]: df3
Out[308]:
seg_oper time 0->1:ADD['RX'] 1->0:ADD['TX'] 1->0:DELETE['TX']
0 18475 NaN 0.0 NaN
1 78007 0.0 NaN NaN
2 78613 NaN NaN 0.0
3 87097 1.0 NaN NaN
4 87210 2.0 NaN NaN
.. ... ... ... ...
828 1002281 NaN NaN 410.0
829 1003545 NaN NaN 411.0
830 1004961 NaN NaN 412.0
831 1006663 NaN NaN 413.0
832 1008644 NaN NaN 414.0
[833 rows x 4 columns]
Fill the gaps
I'm assuming that the NaN values can be filled with the previous value of the row until the next one.
df3=df3.fillna(method='ffill')
At this point, if you plot df3 you'll obtain the same as doing sns.ecdfplot(df2) with seaborn.
I still want to substract one series from the other.
df3['diff'] = df3["0->1:ADD['RX']"] - df3["1->0:DELETE['TX']"]
df3.plot(x='time')
The following plot, is the result.
pd: I don't understand the negative vote on the question. If someone can explain, I'll appreciate it.

Related

How can I speed up this pandas dataframe for loop computation?

I have the following dataframe of BTC price for each minute from 2018-01-15 17:01:00 to 2020-10-31 09:59:00, as you can see this is 1,468,379 rows of data, so my code needs to be optimized otherwise computations can take a long time.
dfcondensed = df[["Open","Close","Buy", "Sell"]]
dfcondensed
Timestamp Open Close Buy Sell
2018-01-15 17:01:00 14174.00 14185.25 14185.11 NaN
2018-01-15 17:02:00 14185.11 14185.15 NaN NaN
2018-01-15 17:03:00 14185.25 14157.32 NaN NaN
2018-01-15 17:04:00 14177.52 14184.71 NaN NaN
2018-01-15 17:05:00 14185.03 14185.14 NaN NaN
... ... ... ... ...
2020-10-31 09:55:00 13885.00 13908.36 NaN NaN
2020-10-31 09:56:00 13905.38 13915.81 NaN NaN
2020-10-31 09:57:00 13909.02 13936.00 NaN NaN
2020-10-31 09:58:00 13936.00 13920.78 NaN NaN
2020-10-31 09:59:00 13924.56 13907.85 NaN NaN
1468379 rows × 4 columns
The algorithm that I'm trying to run is this:
PnL = []
for i in range(dfcondensed.shape[0]):
if str(dfcondensed['Buy'].isnull().values[i]) == "False":
for j in range(dfcondensed.shape[0]-i):
if str(dfcondensed['Sell'].isnull().values[i+j]) == "False":
PnL.append( ((dfcondensed["Open"].iloc[i+j+1] - dfcondensed["Open"].iloc[i+1]) / dfcondensed["Open"].iloc[i+1]) * 100 )
break
Basically, to make it clear, what I'm trying to do is assess the Profit/Loss of buying/selling at the points in the Buy/Sell column. So in the first row the strategy being tested in the dataframe says buy at 14185.11, which was the open price at 2018-01-15 17:02:00, the algrithm should then look for when the strategy tells it to sell and mark this down, then it should look for the time that it's next told to buy and mark this down, then look for the next sell and mark this down, by the end there was over 7,000 different trades, I want to see the profit per trade so I can do some analysis and improve my strategy.
Using the above code to get a PnL list seems to run for a long time and I gave up waiting for it. How can I speed up the algorithm?
I found a way to speed up my loop using list-comprehensions and unrolled loops:
buylist = df["Buy"]
selllist = df["Sell"]
buylist = [x for x in buylist if str(x) != 'nan']
selllist = [x for x in selllist if str(x) != 'nan']
profit = []
for i in range(len(selllist)):
profit.append( (selllist[i] - buylist[i]) / buylist[i] * 100)

Pivoting a repeating Time Series Data

I am trying to pivot this data in such a way that I get columns like eg: AK_positive AK_probableCases, AK_negative, AL_positive.. and so on.
You can get the data here, df = pd.read_csv('https://covidtracking.com/api/states/daily.csv')
Just flatten the original MultiIndex column into tuples using .to_flat_index(), and rearrange tuple elements into a new column name.
df_pivoted.columns = [f"{i[1]}_{i[0]}" for i in df_pivoted.columns.to_flat_index()]
Result:
# start from April
df_pivoted[df_pivoted.index >= 20200401].head(5)
AK_positive AL_positive AR_positive ... WI_grade WV_grade WY_grade
date ...
20200401 133.0 1077.0 584.0 ... NaN NaN NaN
20200402 143.0 1233.0 643.0 ... NaN NaN NaN
20200403 157.0 1432.0 704.0 ... NaN NaN NaN
20200404 171.0 1580.0 743.0 ... NaN NaN NaN
20200405 185.0 1796.0 830.0 ... NaN NaN NaN

TypeError: '(slice(None, 59, None), slice(None, None, None))' is an invalid key

I am having the below table where I want to remove these rows with NaN values.
date Open ... Real Lower Band Real Upper Band
0 2020-07-08 08:05:00 2.1200 ... NaN NaN
1 2020-07-08 09:00:00 2.1400 ... NaN NaN
2 2020-07-08 09:30:00 2.1800 ... NaN NaN
3 2020-07-08 09:35:00 2.2000 ... NaN NaN
4 2020-07-08 09:40:00 2.1710 ... NaN NaN
5 2020-07-08 09:45:00 2.1550 ... NaN NaN
These NaN values are til row no. 58
For this, I wrote the following code. But the above error occurred.
data.drop(data[:59,:],inplace= True)
print(data)
Please help me!
There are many options to choose from:
Drop rows by index label.
df.drop(list(range(59)), axis=0, inplace=True)
Drop if nans in selected columns.
df.dropna(axis=0, subset=['Real Upper Band'], inplace=True)
Select rows to keep by index label slice
df = df.loc[59:, :] # 59 is the label in index, if index was date then replace 59 with corresponding datetime
Select rows to keep by integer index slice (similar to slicing a list)
df = df.iloc[59:, :] # 59 is the 0-index row number, regardless of what index is set on df
Filter with .loc and boolean array returned by .isna()
df = df.loc[~df['Real Upper Band'].isna(), :]
Remember that loc and iloc work with two dimensions when applied to dataframes, it is recomended to use full slice : to avoid ambiguity and improve performance according to the docs https://pandas.pydata.org/docs/user_guide/indexing.html
You want to keep rows from 59-th on, so the shortest code you can run is:
data = data[59:]

Pandas append returns DF with NaN values

I'm appending data from a list to pandas df. I keep getting NaN in my entries.
Based on what I've read I think I might have to mention the data type for each column in my code.
dumps = [];features_df = pd.DataFrame()
for i in range (int(len(ids)/50)):
dumps = sp.audio_features(ids[i*50:50*(i+1)])
for i in range (len(dumps)):
print(list(dumps[0].values()))
features_df = features_df.append(list(dumps[0].values()), ignore_index = True)
Expected results, something like-
[0.833, 0.539, 11, -7.399, 0, 0.178, 0.163, 2.1e-06, 0.101, 0.385, 99.947, 'audio_features', '6MWtB6iiXyIwun0YzU6DFP', 'spotify:track:6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/tracks/6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/audio-analysis/6MWtB6iiXyIwun0YzU6DFP', 149520, 4]
for one row.
Actual-
danceability energy ... duration_ms time_signature
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
For all rows
append() strategy in a tight loop isn't a great way to do this. Rather, you can construct an empty DataFrame and then use loc to specify an insertion point. The DataFrame index should be used.
For example:
import pandas as pd
df = pd.DataFrame(data=[], columns=['n'])
for i in range(100):
df.loc[i] = i
print(df)
time python3 append_df.py
n
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
real 0m13.178s
user 0m12.287s
sys 0m0.617s
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Pandas rolling mean don't change numbers to NaN in DataFrame

I'm working with a pandas DataFrame which looks like this:
(**N.B - the offset is set as the index of the DataFrame)
offset X Y Z
0 -0.140137 -1.924316 -0.426758
10 -2.789123 -1.111212 -0.416016
20 -0.133789 -1.923828 -4.408691
30 -0.101112 -1.457891 -0.425781
40 -0.126465 -1.926758 -0.414062
50 -0.137207 -1.916992 -0.404297
60 -0.130371 -3.784591 -0.987654
70 -0.125000 -1.918457 -0.403809
80 -0.123456 -1.917480 -0.413574
90 -0.126465 -1.926758 -0.333554
I have applied the rolling mean with window size = 5, to the data frame using the following code.
I need to keep this window size = 5 and I need values for the whole dataframe for all of the offset values (no NaNs).
df = df.rolling(center=False, window=5).mean()
Which gives me:
offset X Y Z
0.0 NaN NaN NaN
10.0 NaN NaN NaN
20.0 NaN NaN NaN
30.0 NaN NaN NaN
40.0 -0.658125 -1.668801 -1.218262
50.0 -0.657539 -1.667336 -1.213769
60.0 -0.125789 -2.202012 -1.328097
70.0 -0.124031 -2.200938 -0.527121
80.0 -0.128500 -2.292856 -0.524679
90.0 -0.128500 -2.292856 -0.508578
I would like the DataFrame to be able to keep the first values that are NaN unchanged and have the the rest of the values as the result of the rolling mean. Is there a simple way that I would be able to do this? Thanks
i.e.
offset X Y Z
0.0 -0.140137 -1.924316 -0.426758
10.0 -2.789123 -1.111212 -0.416016
20.0 -0.133789 -1.923828 -4.408691
30.0 -0.101112 -1.457891 -0.425781
40.0 -0.658125 -1.668801 -1.218262
50.0 -0.657539 -1.667336 -1.213769
60.0 -0.125789 -2.202012 -1.328097
70.0 -0.124031 -2.200938 -0.527121
80.0 -0.128500 -2.292856 -0.524679
90.0 -0.128500 -2.292856 -0.508578
You can fill with the original df:
df.rolling(center=False, window=5).mean().fillna(df)
Out:
X Y Z
offset
0 -0.140137 -1.924316 -0.426758
10 -2.789123 -1.111212 -0.416016
20 -0.133789 -1.923828 -4.408691
30 -0.101112 -1.457891 -0.425781
40 -0.658125 -1.668801 -1.218262
50 -0.657539 -1.667336 -1.213769
60 -0.125789 -2.202012 -1.328097
70 -0.124031 -2.200938 -0.527121
80 -0.128500 -2.292856 -0.524679
90 -0.128500 -2.292856 -0.508578
There is also an argument, min_periods that you can use. If you pass min_periods=1 then it will take the first value as it is, second value as the mean of the first two etc. It might make more sense in some cases.
df.rolling(center=False, window=5, min_periods=1).mean()
Out:
X Y Z
offset
0 -0.140137 -1.924316 -0.426758
10 -1.464630 -1.517764 -0.421387
20 -1.021016 -1.653119 -1.750488
30 -0.791040 -1.604312 -1.419311
40 -0.658125 -1.668801 -1.218262
50 -0.657539 -1.667336 -1.213769
60 -0.125789 -2.202012 -1.328097
70 -0.124031 -2.200938 -0.527121
80 -0.128500 -2.292856 -0.524679
90 -0.128500 -2.292856 -0.508578
Assuming you don't have other rows with all NaN's, you can identify which rows have all NaN's in your rolling_df, and replace them with the corresponding rows from the original. Example:
df=pd.DataFrame(np.random.rand(13,5))
df_rolling=df.rolling(center=False,window=5).mean()
#identify which rows are all NaN
idx = df_rolling.index[df_rolling.isnull().all(1)]
#replace those rows with the original data
df_rolling.loc[idx,:]=df.loc[idx,:]

Resources