How do I iterate through 2 columns of pandas dataframe data? - python-3.x

I have imported some data and calculated the 5 day, 8 day and 21 day moving averages.
OPEN HIGH LOW LAST ma5 ma8 ma21
Date
11/23/2009 88.84 89.19 88.58 88.97 NaN NaN NaN
11/24/2009 88.97 89.07 88.36 88.50 NaN NaN NaN
11/25/2009 88.50 88.63 87.22 87.35 NaN NaN NaN
11/26/2009 87.35 87.48 86.30 86.59 NaN NaN NaN
11/27/2009 86.59 87.02 84.83 86.53 87.588 NaN NaN
11/30/2009 87.17 87.17 85.87 86.41 87.076 NaN NaN
Then I have iterated through the 5 day moving average (ma5) to work out if the average is rising (+1), falling (-1) or constant (0). Using;
ma5x = [0,]
lastItem = ma5[0]
for currItem in ma5[1:]:
if currItem > lastItem:
ma5x.append(1)
elif currItem < lastItem:
ma5x.append(-1)
else:
ma5x.append(0)
lastItem = currItem
However how do I iterate through 2 columns of data. For instance if I want to see if both the 8 day moving average (ma8) and the 21 day ma (ma21) are both rising together (+1), or both falling together (-1) or if they are moving in different directions (0)?
Secondly how do I then add this data to the original dataframe? I'm not sure how to concat the second dataframe we created above because the original data doesn't have a column index for the first 'Date' column. Many thanks.

Use the zip function to iterate through two or more items simultaneously.
ma = []
ma5Last = ma5[0]
ma8Curr = ma8[0]
for ma5Curr, ma8Curr in zip(ma5[1:], ma8[1:]):
if ma5Curr > ma5Last and ma8Curr > ma8Last:
ma.append(1)
elif ma5Curr < ma5Last and ma8Curr < ma8Last:
ma.append(-1)
else:
ma.append(0)
ma5Last = ma5Curr
ma8Last = ma8Curr
To combine the new dataframe with the original dataframe use merge
origData = origData.merge(otherData)

Related

How to correspondence of unique values ​between 2 tables?

I am fairly new to Python and I am trying to create a new function to work on my project.
The function will aim to detect which unique value is present in another column of another table.
At first, the function seeks to keep only the unique values ​​of the two tables, then merges them into a new dataframe
It's the rest that gets complicated because I would like to return which row and on which table my value is missing
If you have any other leads or thought patterns, I'm also interested.
Here is my code :
def correspondance_cle(df1, df2, col):
df11 = pd.DataFrame(df1[col].unique())
df11.columns= [col]
df11['test1'] = 1
df21 = pd.DataFrame(df2[col].unique())
df21.columns= [col]
df21['test2'] = 1
df3 = pd.merge(df11, df21, on=col, how='outer')
df3 = df3.loc[(fk['test1'].isna() == True) | (fk['test2'].isna() == True),:]
df3.info()
for row in df3[col]:
if df3['test1'].isna() == True:
print(row, "is not in df1")
else:
print(row, 'is not in df2')
Thanks to everyone who took the time to read the post.
First use outer join with remove duplicates by Series.drop_duplicates and Series.reset_index for avoid removed original indices:
df1 = pd.DataFrame({'a':[1,2,5,5]})
df2 = pd.DataFrame({'a':[2,20,5,8]})
col = 'a'
df = (df1[col].drop_duplicates().reset_index()
.merge(df2[col].drop_duplicates().reset_index(),
indicator=True,
how='outer',
on=col))
print (df)
index_x a index_y _merge
0 0.0 1 NaN left_only
1 1.0 2 0.0 both
2 2.0 5 2.0 both
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only
Then filter rows by helper column _merge:
print (df[df['_merge'].eq('left_only')])
index_x a index_y _merge
0 0.0 1 NaN left_only
print (df[df['_merge'].eq('right_only')])
index_x a index_y _merge
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only

How can I speed up this pandas dataframe for loop computation?

I have the following dataframe of BTC price for each minute from 2018-01-15 17:01:00 to 2020-10-31 09:59:00, as you can see this is 1,468,379 rows of data, so my code needs to be optimized otherwise computations can take a long time.
dfcondensed = df[["Open","Close","Buy", "Sell"]]
dfcondensed
Timestamp Open Close Buy Sell
2018-01-15 17:01:00 14174.00 14185.25 14185.11 NaN
2018-01-15 17:02:00 14185.11 14185.15 NaN NaN
2018-01-15 17:03:00 14185.25 14157.32 NaN NaN
2018-01-15 17:04:00 14177.52 14184.71 NaN NaN
2018-01-15 17:05:00 14185.03 14185.14 NaN NaN
... ... ... ... ...
2020-10-31 09:55:00 13885.00 13908.36 NaN NaN
2020-10-31 09:56:00 13905.38 13915.81 NaN NaN
2020-10-31 09:57:00 13909.02 13936.00 NaN NaN
2020-10-31 09:58:00 13936.00 13920.78 NaN NaN
2020-10-31 09:59:00 13924.56 13907.85 NaN NaN
1468379 rows × 4 columns
The algorithm that I'm trying to run is this:
PnL = []
for i in range(dfcondensed.shape[0]):
if str(dfcondensed['Buy'].isnull().values[i]) == "False":
for j in range(dfcondensed.shape[0]-i):
if str(dfcondensed['Sell'].isnull().values[i+j]) == "False":
PnL.append( ((dfcondensed["Open"].iloc[i+j+1] - dfcondensed["Open"].iloc[i+1]) / dfcondensed["Open"].iloc[i+1]) * 100 )
break
Basically, to make it clear, what I'm trying to do is assess the Profit/Loss of buying/selling at the points in the Buy/Sell column. So in the first row the strategy being tested in the dataframe says buy at 14185.11, which was the open price at 2018-01-15 17:02:00, the algrithm should then look for when the strategy tells it to sell and mark this down, then it should look for the time that it's next told to buy and mark this down, then look for the next sell and mark this down, by the end there was over 7,000 different trades, I want to see the profit per trade so I can do some analysis and improve my strategy.
Using the above code to get a PnL list seems to run for a long time and I gave up waiting for it. How can I speed up the algorithm?
I found a way to speed up my loop using list-comprehensions and unrolled loops:
buylist = df["Buy"]
selllist = df["Sell"]
buylist = [x for x in buylist if str(x) != 'nan']
selllist = [x for x in selllist if str(x) != 'nan']
profit = []
for i in range(len(selllist)):
profit.append( (selllist[i] - buylist[i]) / buylist[i] * 100)

Assign array values to NaN Dataframe Pandas

I am trying to fill a dataframe which originally has NaN values with the same number of values taken from an array. All the values in the dictionary leagueList (NFL,NBA, etc.) are individual dataframes.
Sorry, I can't place them here as the post will become too long.
The idea behind the loop below is to get the series of paired t-tests (p_value) between all leagues in the dataframe and compare them based on columns called 'win_loss_ratio'.
The resulting array with the same number of values as in the empty dataframe should be used to replace the NaN values in the dataframe but I am stuck on this part. How this could be accomplished?
leagueList={'NFL':NFL,'NBA':NBA,'NHL':NHL,'MLB':MLB}
df = pd.DataFrame(columns = leagueList, index = leagueList)
print(df)
NFL NBA NHL MLB
NFL NaN NaN NaN NaN
NBA NaN NaN NaN NaN
NHL NaN NaN NaN NaN
MLB NaN NaN NaN NaN
#Double loop for making all possible league combinations
for a in leagueList.values():
for b in leagueList.values():
df_comb=pd.merge(a,b,left_index=True,right_index=True,how='inner')
teststat,p_value=stats.ttest_rel(df_comb[['win_loss_ratio_x']],df_comb[['win_loss_ratio_y']])
print(p_value)
[nan]
[0.94179205]
[0.03088317]
[0.80206949]
[0.94179205]
[nan]
[0.02229705]
[0.95053998]
[0.03088317]
[0.02229705]
[nan]
[0.00070784]
[0.80206949]
[0.95053998]
[0.00070784]
[nan]
Put the p-values into a list to either use .fillna, or just construct it straight a way:
import pandas as pd
from scipy import stats
#some sample data
NFL = pd.DataFrame([.5,.6,.7], columns=['win_loss_ratio'])
NBA = pd.DataFrame([.7,.5,.3], columns=['win_loss_ratio'])
NHL = pd.DataFrame([.4,.3,.2], columns=['win_loss_ratio'])
MLB = pd.DataFrame([.9,.8,.9], columns=['win_loss_ratio'])
leagueList={'NFL':NFL,'NBA':NBA,'NHL':NHL,'MLB':MLB}
#Double loop for making all possible league combinations
rows = []
for a in leagueList.values():
for b in leagueList.values():
df_comb=pd.merge(a,b,left_index=True,right_index=True,how='inner')
teststat,p_value=stats.ttest_rel(df_comb[['win_loss_ratio_x']],df_comb[['win_loss_ratio_y']])
rows.append(p_value[0])
n=len(leagueList)
data = [rows[i * n:(i + 1) * n] for i in range((len(rows) + n - 1) // n )]
df = pd.DataFrame(data, columns = leagueList, index = leagueList)
Output:
print (df.to_string())
NFL NBA NHL MLB
NFL NaN 0.622036 0.12169 0.057191
NBA 0.622036 NaN 0.07418 0.092735
NHL 0.121690 0.074180 NaN 0.013560
MLB 0.057191 0.092735 0.01356 NaN

TypeError: '(slice(None, 59, None), slice(None, None, None))' is an invalid key

I am having the below table where I want to remove these rows with NaN values.
date Open ... Real Lower Band Real Upper Band
0 2020-07-08 08:05:00 2.1200 ... NaN NaN
1 2020-07-08 09:00:00 2.1400 ... NaN NaN
2 2020-07-08 09:30:00 2.1800 ... NaN NaN
3 2020-07-08 09:35:00 2.2000 ... NaN NaN
4 2020-07-08 09:40:00 2.1710 ... NaN NaN
5 2020-07-08 09:45:00 2.1550 ... NaN NaN
These NaN values are til row no. 58
For this, I wrote the following code. But the above error occurred.
data.drop(data[:59,:],inplace= True)
print(data)
Please help me!
There are many options to choose from:
Drop rows by index label.
df.drop(list(range(59)), axis=0, inplace=True)
Drop if nans in selected columns.
df.dropna(axis=0, subset=['Real Upper Band'], inplace=True)
Select rows to keep by index label slice
df = df.loc[59:, :] # 59 is the label in index, if index was date then replace 59 with corresponding datetime
Select rows to keep by integer index slice (similar to slicing a list)
df = df.iloc[59:, :] # 59 is the 0-index row number, regardless of what index is set on df
Filter with .loc and boolean array returned by .isna()
df = df.loc[~df['Real Upper Band'].isna(), :]
Remember that loc and iloc work with two dimensions when applied to dataframes, it is recomended to use full slice : to avoid ambiguity and improve performance according to the docs https://pandas.pydata.org/docs/user_guide/indexing.html
You want to keep rows from 59-th on, so the shortest code you can run is:
data = data[59:]

Pandas append returns DF with NaN values

I'm appending data from a list to pandas df. I keep getting NaN in my entries.
Based on what I've read I think I might have to mention the data type for each column in my code.
dumps = [];features_df = pd.DataFrame()
for i in range (int(len(ids)/50)):
dumps = sp.audio_features(ids[i*50:50*(i+1)])
for i in range (len(dumps)):
print(list(dumps[0].values()))
features_df = features_df.append(list(dumps[0].values()), ignore_index = True)
Expected results, something like-
[0.833, 0.539, 11, -7.399, 0, 0.178, 0.163, 2.1e-06, 0.101, 0.385, 99.947, 'audio_features', '6MWtB6iiXyIwun0YzU6DFP', 'spotify:track:6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/tracks/6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/audio-analysis/6MWtB6iiXyIwun0YzU6DFP', 149520, 4]
for one row.
Actual-
danceability energy ... duration_ms time_signature
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
For all rows
append() strategy in a tight loop isn't a great way to do this. Rather, you can construct an empty DataFrame and then use loc to specify an insertion point. The DataFrame index should be used.
For example:
import pandas as pd
df = pd.DataFrame(data=[], columns=['n'])
for i in range(100):
df.loc[i] = i
print(df)
time python3 append_df.py
n
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
real 0m13.178s
user 0m12.287s
sys 0m0.617s
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Resources