How can I find out which operands are supported before looping through pandas dataframe? - python-3.x

I am attempting to iterate over rows in a Series within a Pandas DataFrame. I would like to take the value in each row of the column csv_df['Strike'] and plug it into variable K, which gets called in function a.
Then, I want the output a1 and a2 to be put into their own columns within the DataFrame.
I am receiving the error: TypeError: unsupported operand type(s) for *: 'int' and 'zip', and I figure that if I can find out which operands are supported, I could convert a1 and a2to that.
Am I thinking about this correctly?
Note: S is just a static number as the df is just one row, while K has many rows.
Code is below:
from scipy.stats import norm
from math import sqrt, exp, log, pi
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
import fix_yahoo_finance as yf
yf.pdr_override()
import numpy as np
import datetime
from pandas_datareader import data, wb
import matplotlib.pyplot as plt
#To get data:
start = datetime.datetime.today()
end = datetime.datetime.today()
df = data.get_data_yahoo('AAPL', start, end) #puts data into a pandas dataframe
csv_df = pd.read_csv('./AAPL_TEST.csv')
for row in csv_df.itertuples():
def a(S, K):
a1 = 100 * K
a2 = S
return a1
S = df['Adj Close'].items()
K = csv_df['strike'].items()
a1, a2 = a(S, K)
df['new'] = a1
df['new2'] = a2

It seems an alternate way of doing what you want would be to apply your method to each data frame separately, as in:
df = data.get_data_yahoo('AAPL', start, end)
csv_df = pd.read_csv('./AAPL_TEST.csv')
df['new'] = csv_df['strike'].apply(lambda x: 100 * x)
df['new2'] = df['Adj Close']
Perhaps, applying the calculation directly to the Pandas Series (a column of your data frame) is a way to avoid defining a method that is used only once.
Plus, I wouldn't define a method within a loop as you have.
Cheers
ps. I believe you have forgotten to return both values in your method.
def a(S, K):
a1 = 100 * K
a2 = S
return (a1, a2)

Related

Altair/Vega-Lite heatmap: Filter top k

I am attempting to create a heatmap and retain only the top 5 samples based on the mean of the relative abundance. I am able to sort the heatmap properly, but I can't figure out how to retain only the top 5, in this case samples c, e, b, y, a. I am pasting a subset of the df with the image. I've tried myriad permutations of the "Top K Items Tutorial" link at the altair-viz website. I'd prefer to use altair to do the filtering if possible, as opposed to filtering the df itself in the python code.
Dateframe:
,Lowest_Taxon,p1,p2,p3,p4,p5,p6,p7
0,a,0.03241281,0.0,0.467738067,3.14456785,0.589519651,13.5744323,0.0
1,b,0.680669,9.315121951,2.848404893,13.99058458,2.139737991,16.60779366,7.574639383
2,c,40.65862829,1.244878049,71.01223315,4.82197541,83.18777293,0.0,0.0
3,d,0.0,0.0,0.0,0.548471137,0.272925764,0.925147183,0.0
4,e,0.090755867,13.81853659,5.205085152,27.75721011,1.703056769,19.6691898,12.27775914
5,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,g,0.187994295,0.027317073,0.0,0.0,0.0,0.02242781,0.0
7,h,0.16854661,0.534634146,1.217318302,7.271813154,1.73580786,0.57751612,0.57027843
8,i,0.142616362,2.528780488,1.163348525,0.34279446,0.0,0.0,0.0
9,j,1.711396344,0.694634146,0.251858959,4.273504274,0.087336245,1.721334455,0.899027172
10,k,0.0,1.475121951,0.0,0.0,0.0,5.573310906,0.0
11,l,0.194476857,0.253658537,1.517150396,2.413273002,0.949781659,5.147182506,1.650452868
12,m,0.0,1.736585366,0.0,0.063988299,0.0,8.42724979,0.623951694
13,n,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,o,4.68689226,0.12097561,0.0,0.0,0.0,0.0,0.0
15,p,0.0,0.885853659,0.0,0.0,0.0,0.913933277,0.046964106
16,q,0.252819914,0.050731707,0.023986568,0.0,0.087336245,0.0,0.0
17,r,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18,s,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,t,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,u,0.0,0.058536585,0.089949628,0.356506239,0.0,0.285954584,1.17410265
21,v,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,w,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,x,1.471541553,2.396097561,0.593667546,0.278806161,0.065502183,0.280347631,0.952700436
24,y,0.0,0.32,0.0,0.461629873,0.0,7.804878049,18.38980208
25,z,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Code block:
import pandas as pd
import numpy as np
import altair as alt
from vega_datasets import data
from altair_saver import save
# Read in the file and fill empty cells with zero
df = pd.read_excel("path\to\df")
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
# Tell altair to plot as many rows as is necessary
alt.data_transformers.disable_max_rows()
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
)
If you know what you want to show is the entries with c, e, b, y and a (and it will not change later) you could simply apply a transform_filter on the field Lowest_Taxon.
If you want to calculate on the spot which ones make it into the top five, it needs a bit more effort, i.e. a combination of joinaggregate, window and filter transforms.
For both I paste an example below. By the way, I converted the original data that you pasted into a csv file which is imported by the code snippets. You can make it easier for others to to use your pandas toy data by providing it as a dict, which can then be simply read directly in the code.
Simple approach:
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_filter(
alt.FieldOneOfPredicate(field='Lowest_Taxon', oneOf=['c', 'e', 'b', 'y', 'a'])
)
Flexible approach:
set n to how many of the top entries you want to see
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
n = 5 # number of entries to display
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_joinaggregate(
mean_rel_ab = 'mean(Relative_abundance)',
count_of_samples = 'valid(Relative_abundance)',
groupby = ['Lowest_Taxon']
).transform_window(
rank='rank(mean_rel_ab)',
sort=[alt.SortField('mean_rel_ab', order='descending')],
frame = [None, None]
).transform_filter(
(alt.datum.rank <= (n-1) * alt.datum.count_of_samples + 1)

Apply function on a Pandas Dataframe

Apply function on a Pandas Dataframe
I have a code (C01) that calculates the moving averages (21 periods) of a given stock (individual) on the stock exchange (IBOV - B3-BRAZIL). Then I created a for loop where it determines that an asset is in an upward trend after 6 highs followed by moving averages (hypothesis, considering that there are more variables to determine this).
However, I want to do this loop for more than one asset, in this case C02, that is, it applies a function in each column of my code and returns only the name of the assets that are in an upward trend (in this case, the column name). I tried to turn the for loop into a function and apply that function using the pandas 'apply' to each column (axis = 1, I tried tbm axis = 'columns'). But I'm having an error creating the function. When I execute the function using apply, the message "ValueError: Lengths must match to compare" appears. How can I fix this?
Grateful for the attention.
import numpy as np
import pandas as pd
from pandas_datareader import data as wb
from mpl_finance import candlestick_ohlc
from pandas_datareader import data as wb
from datetime import datetime
import matplotlib.dates as mpl_dates
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
#STOCK
ativo = 'WEGE3.SA'
acao2 = ativo.upper()
#START AND END ANALYSIS
inicio = '2020-1-1'
fim = '2021-1-27'
#MAKE DATAFRAME
df00 = wb.DataReader(acao2, data_source='yahoo', start=inicio, end=fim)
df00.index.names = ['Data']
df= df00.copy(deep=True)
df['Data'] = df.index.map(mdates.date2num)
# MOVING AVERAGE
df['ema21'] = df['Close'].ewm(span=21, adjust=False).mean()
df['ema72'] = df['Close'].ewm(span=72, adjust=False).mean()
#DF PLOT
df1=df
df2=df[-120:]
#TREND RULE
alta=1
for i in range(6):
if(df2.ema21[-i-1] < df2.ema21[-i-2]):
alta=0
baixa=1
for i in range(6):
if(df2.ema21[-i-1] > df2.ema21[-i-2]):
baixa=0
if (alta==1 and baixa==0):
a1 = ativo.upper()+ ' HIGH TREND'
elif (alta==0 and baixa==1):
a1 = ativo.upper()+ ' LOW TREND!'
else:
a1 = ativo.upper()+ ' UNDEFINED'
#PLOT RESULTS
print("---------------------------------------")
print(a1)
print("---------------------------------------")
ohlc = df[['Data', 'Open', 'High', 'Low', 'Close']]
f1, ax = plt.subplots(figsize=(14, 8))
# plot the candlesticks
candlestick_ohlc(ax, ohlc.values, width=.6, colorup='green', colordown='red')
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
label_ = acao2.upper() + ' EMA26'
label_2 = acao2.upper() + ' EMA09'
ax.plot(df.index, df1['ema21'], color='black', label=label_)
ax.plot(df.index, df1['ema72'], color='blue', label=label_)
ax.grid(False)
ax.legend()
ax.grid(True)
plt.title(acao2.upper() + ' : Gráfico Diário')
plt.show(block=True)
#C02
#START/END ANALISYS
inicio = '2020-1-1'
fim = '2021-1-27'
#STOCKS
ativos = ['SAPR11.SA','WEGE3.SA']
#DATAFRAME
mydata = pd.DataFrame()
for t in ativos:
mydata[t] = wb.DataReader(t, data_source='yahoo', start=inicio, end=fim)['Close']
df2 = mydata
#MOVING AVERAGE
df3 = df2.apply(lambda x: x.rolling(window=21).mean())
#MAKE FUNCTION
def trend(x):
tendencia_alta=1
for i in range(6):
if(df3.columns[-i-1:] > df3.columns[-i-2:]):
tendencia_alta=0
print()
if (alta==1 and baixa==0):
a1 = ativo.upper()+ ' HIGH TREND'
elif (alta==0 and baixa==1):
a1 = ativo.upper()+ ' LOW TREND!'
else:
a1 = ativo.upper()+ ' UNDEFINED'
#TRYING TO APPLY THE FUNCTION IN EVERY DF3 COLUMN
df3.apply(trend, axis=1)´´´
something like:
def myfunc(x):
#do things here where x is the group of rows sent to function
#instead of df['column'], you'll use x['column']
#because you are passing the rows into x
return x
df.groupby('yourcolumn').apply(myfunc)

Finding Specific word in a pandas column and assigning to a new column and replicate the row

I am trying to find specific words from a pandas column and assign it to a new column and column may contain two or more words. Once I find it I wish to replicate the row by creating it for that word.
import pandas as pd
import numpy as np
import re
wizard=pd.read_excel(r'C:\Python\L\Book1.xlsx'
,sheet_name='Sheet1'
, header=0)
test_set = {'941', '942',}
test_set2={'MN','OK','33/3305'}
wizard['ZTYPE'] = wizard['Comment'].apply(lambda x: any(i in test_set for i in x.split()))
wizard['ZJURIS']=wizard['Comment'].apply(lambda x: any(i in test_set2 for i in x.split()))
wizard_new = pd.DataFrame(np.repeat(wizard.values,3,axis=0))
wizard_new.columns = wizard.columns
wizard_new.head()
I am getting true and false, however unable to split it.
Above is how the sample data reflects. I need to find anything like this '33/3305', Year could be entered as '19', '2019', and quarter could be entered are 'Q1'or '1Q' or 'Q 1' or '1 Q' and my test set lists.
ZJURIS = dict(list(itertools.chain(*[[(y_, x) for y_ in y] for x, y in wizard.comment()])))
def to_category(x):
for w in x.lower().split(" "):
if w in ZJURIS:
return ZJURIS[w]
return None
Finally, apply the method on the column and save the result to a new one:
wizard["ZJURIS"] = wizard["comment"].apply(to_category)
I tried the above solution well it did not
Any suggestions how to do I get the code to work.
Sample data.
data={ 'ID':['351362278576','351539320880','351582465214','351609744560','351708198604'],
'BU':['SBS','MAS','NAS','ET','SBS'],
'Comment':['940/941/w2-W3NYSIT/SUI33/3305/2019/1q','OK SUI 2Q19','941 - 3Q2019NJ SIT - 3Q2019NJ SUI/SDI - 3Q2019','IL,SUI,2016Q4,2017Q1,2017Q2','1Q2019 PA 39/5659 39/2476','UT SIT 1Q19-3Q19']
}
df = pd.DataFrame(data)
Based on the data sample data set attached is the output.

calculation of distance matrix in a faster approach

I have a dataframe
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
a = {'b':['cat','bat','cat','cat','bat','No Data','bat','No Data']}
df11 = pd.DataFrame(a,index=['x1','x2','x3','x4','x5','x6','x7','x8'])
and i have a distance function
def distancemetric(x):
list1 = x['b'].tolist()
result11 =[]
sortlist11 = [process.extract(ele, list1, limit=11000000, scorer=fuzz.token_set_ratio) for ele in list1]
d11 = [dict(element) for element in sortlist11]
finale11 = [(k, element123[k]) for k in list1 for element123 in d11]
result11.extend([x[1] for x in finale11])
final_result11=np.reshape(result11, (len(x.index),len(x.index)))
return final_result11
I call the funtion by
values1 = distancemetric(df11)
Here the token_set_ratio methods compares only two strings. When i pass an array of strings it gives me avg which i dont need.
This code is working but it is slower. Is there any way which could make it run faster

Python3, with pandas.dataframe, how to select certain data by some rules to show

I have a pandas.dataframe, and I want to select certain data by some rules.
The following codes generate the dataframe
import datetime
import pandas as pd
import numpy as np
today = datetime.date.today()
dates = list()
for k in range(10):
a_day = today - datetime.timedelta(days=k)
dates.append(np.datetime64(a_day))
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(10, 3)),
columns=('other1', 'actual', 'other2'),
index=['{}'.format(i) for i in range(10)])
df.insert(0, 'dates', dates)
df['err_m'] = np.random.rand(10, 1)*0.1
df['std'] = np.random.rand(10, 1)*0.05
df['gain'] = np.random.rand(10, 1)
Now, I want select by the following rules:
1. compute the sum of 'err_m' and 'std', then sort the df so that the sum is descending
2. from the result of step 1, select the part where 'actual' is > 50
Thanks
Create a new column and then sort by this one:
df['errsum'] = df['err_m'] + df['std']
# Return a sorted dataframe
df_sorted = df.sort('errsum', ascending = False)
Select the lines you want
# Create an array with True where the condition is met
selector = df_sorted['errsum'] > 50
# Return a view of sorted_dataframe with only the lines you want
df_sorted[selector]

Resources