Compare Percent Change values inside a for loop - python-3.x

I have a list like below
I want to be able to compare the percent change of QQQ at 9:35 to various other stocks like AAPL and AMD at the same time. So check if percent change of AAPL at 9:35 is greater than percent change of QQQ at 9:35. Same thing for AMD at 9:35 and then at 9:40 and then 9:45 and so on.
I want to do this via python
This is what i have so far but not quite correct
import pandas as pd
import time
import yfinance as yf
import datetime as dt
from pandas_datareader import data as pdr
from collections import Counter
from tkinter import Tk
from tkinter.filedialog import askopenfilename
import os
from pandas import ExcelWriter
d1 = dt.datetime(2020, 8, 5,9,00,00)
d2 = dt.datetime(2020, 8, 5,16,00,00)
pc=Counter()
filePath=r"C:\Users\Adil\Documents\Python Test - ET\Data\Trail.xlsx"
stocklist = pd.read_excel(filePath)
for i in stocklist.index:
symbol=stocklist['Symbol'][i]
date=stocklist['Date'][i]
close=stocklist['Close'][i]
pc=stocklist['PercentChange'][i]
if (pc[i]>pc['QQQ']):
print(pc[i])

Alright,
I got from a comment, an explanation of what the OP wants:
Yes so i want to see if within a given time 5 min time period if a
stock performed better than QQQ
First thing you need to do, is make it so you can look up your information by the time and symbol. Here is how I would do that:
my_data = {}
for i in stocklist.index:
symbol=stocklist['Symbol'][i]
date=stocklist['Date'][i]
pc=stocklist['PercentChange'][i]
my_data[symbol, date] = pc
That makes a dictionary where you can lookup percent changes by calling my_data['ABCD', 'datetime']
Then, I would make a list of all the times.
time_set = set()
for i in stocklist.index:
date = stocklist['Date'][i]
time_set.add(date)
times = list(time_set)
times.sort()
If you are pressed for computer resources, you could combine those two loops and run them together, but I think having them separate makes the code easier to understand.
And then do the same thing for symbols:
sym_set = set()
for i in stocklist.index:
date = stocklist['Symbol'][i]
sym_set.add(date)
symbols = list(sym_set)
symbols.sort()
Once again, you could have made this set during the first for-loop, but this way you can see what we are trying to accomplish a bit better.
Last thing to do, is actually make the comparisons:
for i in times:
qs = my_data['QQQ', i]
for j in symbols:
if qs != 'QQQ':
which = "better" if my_data[j, i]>qs else "worse"
print(j + " did " + which + " than QQQ at " + i)
Now, this just prints the information out to the console, you should replace the print command with however you want to output it. (Also, I assumed higher was better; I hope that was right.)

Related

Fuzzywuzzy match 2 columns... script keeps running

I'm trying to match 2 columns of ~50.000 instances with Fuzzywuzzy.
Column A (companies) contains company names, with some typos. Column B (correct) contains the correct company names.
I'm trying to match the typo ones with correct ones. When running my script below, the kernel keeps executing for hours & doesn't provide a result.
Any ideas on how to improve?
Many thanks!
Update link to files: https://fromsmash.com/STLz.VEub2-ct
import pandas as pd
from fuzzywuzzy import process, fuzz
import matplotlib.pyplot as plt
correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")
actual_comp = []
similarity = []
for i in companies.Customers:
ratio = process.extract(i, correct.Correct, limit=1)
actual_comp.append(ratio[0][0])
similarity.append(ratio[0][1])
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity)
companies.head(10)
There are a couple of things you can change to improve the performance:
Use Rapidfuzz instead of Fuzzywuzzy, since it implements the same algorithms, but is quite a bit faster (I am the author)
The process functions are preprocessing all strings you pass to them (lowercases them, removes non alpha numeric characters and trims whitespaces). Right now your preprocessing correct.Correct len(companies.Customers) times, which costs a lot of time and could be done once in front of the loop instead
Your only using the best match, so it is better to use process.extractOne instead of process.extract. This is more readable and inside extractOne rapidfuzz is using the results of previous comparision to improve the performance
The following snippet implements these changes for your code. Keep in mind, that your still performing 50k^2 comparisions, so while this should be a lot faster than your current solution it will still take a while.
import pandas as pd
from rapidfuzz import process, fuzz, utils
import matplotlib.pyplot as plt
correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")
actual_comp = []
similarity = []
company_mapping = {company: utils.default_process(company) for company in correct.Correct}
for customer in companies.Customers:
_, score, comp = process.extractOne(
utils.default_process(customer),
company_mapping,
processor=None)
actual_comp.append(comp)
similarity.append(score)
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity)
companies.head(10)
Out of interest I performed a quick benchmark calculating the average runtime when using your datasets. On my machine each lookup requires around 1 second with this solution (so a total of around 4.7 hours), while your previous solution took around 55 seconds per lookup (so a total of around 10.8 days).

Creating P/L Column in Pandas from and Open and Close Price based on whether is a long or short position

This one has been bothering me for awhile. I have all the pieces (I think) that work individually to create the output I'm looking for (calculate a profit and loss for a stock), but when put together they return nothing.
The dataframe itself is pretty self-explanatory so I haven't included an example. Basically the series includes Stock Symbol, Opening Time, Opening Price, Closing Time, Closing Price, and whether or not it was a long or short position.
Here's my code to calculate the P-L for a long position:
import pandas as pd
from yahoo_fin import stock_info as si
from datetime import datetime, timedelta, date
import time
def create_df3():
return pd.read_excel('Base_Sheet.xlsx', sheet_name="Closed_Pos", header=0)
def update_price(sym):
return si.get_live_price(sym)
long_pl_calc = ((df3['Close_Price']) / (df3['Entry_Price'])) - 1
close_long_pl = df3['P-L'].isnull and (df3['Long_Short'] == 'Long')
for row in df3.iterrows():
if close_long_pl is True:
return df3['P-L'].apply(long_pl_calc)
If I print long_pl_calc or close_long_pl, I get exactly what I expect. However, when I iterate through the series to return the calculation, I still end up with a 'NaN' value (but not an error).
Any help would be appreciated! I already know the solution I came to is terrible, but I've also tried at least a dozen other iterations with no success either.
Create a column df3['Long'] with 1 for the date you are long and 0 for the rest, then to have your long P&L (you could do the same for the short but don't forget to take the opposite sign of the daily return) you can do :
df['P&L Long'] = ((df3['Close_Price'] / df3['Entry_Price']) - 1) * df['Long']
Then for your df3['P-L'] it will be:
df['P-L'] = df['P&L Long'] + df['P&L Short']

How do you take one dataframe that covers multiple years, and break it into a separate DF for each year

I already looked on SE and couldn't find an answer to my question. I am still new to this.
I am trying to take a purchasing csv file and break it into separate dataframes for each year.
For example, if I have a listing with full dates in MM/DD/YYYY format, I am trying to separate them into dataframes for each year. Like Ord2015, Ord2014, etc...
I tried to covert the full date into just the year, and also attempted to use slicing to only look at the last four of the date to no avail.
Here is my current (incomplete) attempt:
import pandas as pd
import csv
import numpy as np
import datetime as dt
import re
purch1 = pd.read_csv('purchases.csv')
#Remove unneeded fluff
del_colmn = ['pid', 'notes', 'warehouse_id', 'env_notes', 'budget_notes']
purch1 = purch1.drop(del_colmn, 1)
#break down by year only
purch1.sort_values(by=['order_date'])
Ord2015 = ()
Ord2014 = ()
for purch in purch1:
Order2015.add(purch1['order_date'] == 2015)
Per req by #anon01... here are the results of the code you had me run. I only used a sample of four as that was all I was initially playing with... The record has almost 20k lines, so I only pulled aside a few to play with.
'{"pid":{"0":75,"2":95,"3":117,"1":82},"env_id":{"0":12454,"2":12532,"3":12623,"1":12511},"ord_date":{"0":"10\/2\/2014","2":"11\/22\/2014","3":"2\/17\/2015","1":"11\/8\/2014"},"cost_center":{"0":"Ops","2":"Cons","3":"Net","1":"Net"},"dept":{"0":"Ops","2":"Cons","3":"Ops","1":"Ops"},"signing_mgr":{"0":"M. Dodd","2":"L. Price","3":"M. Dodd","1":"M. Dodd"},"check_num":{"0":null,"2":null,"3":null,"1":82301.0},"rec_date":{"0":"10\/11\/2014","2":"12\/2\/2014","3":"3\/1\/2015","1":"11\/20\/2014"},"model":{"0":null,"2":null,"3":null,"1":null},"notes":{"0":"Shipped to east WH","2":"Rec'd by L.Price","3":"Shipped to Client (1190)","1":"Rec'd by K. Wilson"},"env_notes":{"0":"appr by K.Polt","2":"appr by S. Crane","3":"appr by K.Polt","1":"appr by K.Polt"},"budget_notes":{"0":null,"2":"OOB expense","3":"Bill to client","1":null},"cost_year":{"0":2014.0,"2":2015.0,"3":null,"1":2014.0}}'
You can add parse_dates to read_csv for convert column to datetimes and then create dictionary of DataFrames dfs, for selecting is used keys:
purch1 = pd.read_csv('purchases.csv', parse_dates=['ord_date'])
dfs = dict(tuple(purch1.groupby(df['ord_date'].dt.year)))
Ord2015 = dfs[2015]
Ord2016 = dfs[2016]
It is not recommended, but possible create DataFrames by years groups:
for i, g in df.groupby(purch1['ord_date'].dt.year):
globals()['Ord' + str(i)] = g

Pandas for Excel and selenium loop

I am trying to print out values from excel, and values are in numbers. I goal is to read these values and search in google one by one. Will stop for x seconds when the value is 'nan', then skip this 'nan' and then keep moving on to next.
Problems faced:
It is printing out in scientific notation format
Want to stop doing something when its 'nan' in excel
Copy UPC[i] into google search, but i wanted to only copy once, due to i want to design it open new tab then copy the second UPC[i]
My solution:
I have 'lambda x: '%0.2f' % x' inside set_option to make it print out xxxxxx.00 with 2 decimal. Even i want it in int, but its already better than scientific notation format
Used 'if' to see if value in upc[i] equal to 'nan' <--nan is what i got from print. But it still print out range of 20 values with 'nan'.
I can't think of something now
Code:
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
import msvcrt
import datetime
import time
driver = webdriver.Chrome()
#Settings
pd.set_option('display.width',10, 'display.max_rows', 10, 'display.max_colwidth',100, 'display.width',10, 'display.float_format', lambda x: '%0.2f' % x)
df = pd.read_excel(r"BARCODE.xlsx", skiprows = 2, sheet_name = 'products id')
#Unnamed: 1 is also an empty column, i just didn't input UPC as title in excel.
upc = df['Unnamed: 1']
#I can't print out as interger...It will always have a xxxxx.0
print((upc[0:20]))
count = len(upc)
i = 0
for i in range(count ):
if upc[i] == 'nan':
'skip for x seconds and continue, i am not sure how to do yet'
else:
print(int(upc[i]))
driver.get('https://www.google.com')
driver.find_element_by_name('q').send_keys(int(upc[i]))
i = i + 1
Print out:
3337872411991.0
3433422408159.0
3337875598071.0
3337872412516.0
3337875518451.0
3337875613491.0
3337872413025.0
3337875398961.0
3337872410208.0
nan <- i want program to stop here so i can do something else.
3337872411991.0
3433422408159.0
3337875598071.0
3337872412516.0
3337875518451.0
3337875613491.0
3337872413025.0
3337875398961.0
3337872410208.0
nan
Name: Unnamed: 1, Length: 20, dtype: float64
3337872411991
3433422408159
3337875598071
3337872412516
3337875518451
etc....
Googled some format about number, such as set printing format, but I got confused between .format and lambda.
It is printing out in scientific notation format
It seems you have numbers like UPC and EANs.
You can probably solve that by marking numbers as text instead. If you need to have always length 13 you can correct it with appending zeroes at start.
Want to stop doing something when its nan in excel
Simplest solution could be to use input and accept any character to continue executing your code. But if you want to have few seconds time.sleep() is good as well
Copy UPC[i] into google search, but i wanted to only copy once, due to i want to design it open new tab then copy the second UPC[i]
Some points you may want to reconsider:
Iterating in python can be done with enumerate() if you need index values. If you do not need index you may simply drop it instead. for value in data_frame['UPC']:
With selenium you can directly scrape results instead of using new tabs.
Below you can check out working example (at least on my machine with python3, w10 and chrome exe driver).
import pandas as pd
from time import sleep
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
# Settings
pd.set_option('display.width', 10, 'display.max_rows', 10, 'display.max_colwidth', 100, 'display.width', 10,
'display.float_format', lambda x: '%0.2f' % x)
data_frame = pd.read_excel('test.xlsx', sheet_name='products id', skip_blank_lines=False)
# I have chrome driver in exe, so this is how I need to inject it to get driver out
driver = webdriver.Chrome('chromedriver.exe')
google = 'https://www.google.com'
for index, value in enumerate(data_frame['UPC']): # named the column in excel file
if pd.isna(value):
print('{}: zzz'.format(index))
sleep(2) # will sleep for 2 seconds, use input() if you want to wait indefinitely instead
else:
print('{}: {} {}'.format(index, value, type(value)))
# since given values are float, you can convert it to int
value = int(value)
driver.get(google)
google_search = driver.find_element_by_name('q')
google_search.send_keys(value)
google_search.send_keys('\uE007') # this is "ENTER" for committing your search in google or Keys.ENTER
sleep(0.5)
# you may want to wait a bit before page loads fully, then scrape info you want
# also consider using try-except blocks if something unexpected happens
# if you want to open new tab (windows + chrome driver)
# open a link in a new window - workaround
helping_link = driver.find_element_by_link_text('Help')
actions = ActionChains(driver)
actions.key_down(Keys.CONTROL).click(helping_link).key_up(Keys.CONTROL).perform()
driver.switch_to.window(driver.window_handles[-1])
# close your instance of chrome driver or leave it if you need your tabs
# driver.close()
check this post
if upc[i].isnull():
time.sleep(3)
check out this post, which boils down to:
driver.execute_script("window.open('https://www.google.com');")
driver.switch_to.window(driver.window_handles[-1])

Adding numerical values from dict to a new column in a Pandas DataFrame

I am practicing machine learning and working with a movie/rating dataset. I am trying to create a new column in the dataframe which numerically identifies each genre (around 1300 of them). My logic was to create a dictionary of the unique genres and label with a integer. Then create a for loop to iterate through each row of the dataframe, checking the genre of each, then assigning its appropriate value to a new column named "genre_Id". However this has been causing a infinite loop in which I can not even break with ctrl-c. Same issue when working in Jupyter ( Interrupt Kernel fails to stop it). Below is a summarized version of my approach.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
movies_data = pd.read_csv("C://mypython/moviedata/movies.csv")
ratings_data = pd.read_csv("C://mypython/moviedata/ratings.csv")
joined = pd.merge(movies_data,ratings_data, how = 'inner', on=['movieId'])
print(joined.head())
pd.options.display.float_format = '{:,.2f}'.format
genres = joined['genres'].unique()
genre_dict = {}
Id = 1
for i in genres:
genre_dict[i] = Id
Id += 1
joined['genre_id'] = 0
increment = 0
for i in joined['genres']:
if i in genre_dict:
joined['genre_id'][increment] = genre_dict[i]
increment += 1
I know I should probably be taking a smaller sample to work with as there is about 20,000,000 rows in the dataset but I figured I'd try this as a exercise.
I also recieve the "setting values from copy warning" though this hasn't caused me issues in the past for my other projects. Any thoughts on how to do this would be greatly appreciated.
EDIT Found a solution using the Series map feature.
joined['genre_id'] = joined.genres.map(genre_dict)
I have no permission to just comment. This is a suggestion and right procedure to handle categorical values in a dataset. You can use inbuilt sklearn.preprocessing.OneHotEncoder function which do the work you wanted to do.
For better understanding with examples check this One Hot Encode Sequence Data in Python. Let me know if this works for you.

Resources