Python: Identify invalid online link for a zip file - python-3.x

I am trying to automate stock price data extraction from https://www.nseindia.com/. Data is stored as a zip file and url for the zip file file varies by date. If on a certain date stock market is closed eg - weekends and holidays, there would be no file/url.
I want to identify invalid links (links that dont exist) and skip to next link.
This is a valid link -
path = 'https://archives.nseindia.com/content/historical/EQUITIES/2021/MAY/cm05MAY2021bhav.csv.zip'
This is an invalid link - (as 1st May is a weekend and stock market is closed for the day)
path2 = 'https://archives.nseindia.com/content/historical/EQUITIES/2021/MAY/cm01MAY2021bhav.csv.zip'
This is what I do to extract the data
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
import pandas as pd
import datetime
start_date = datetime.date(2021, 5, 3)
end_date = datetime.date(2021, 5, 7)
delta = datetime.timedelta(days=1)
final = pd.DataFrame()
while start_date <= end_date:
print(start_date)
day = start_date.strftime('%d')
month = start_date.strftime('%b').upper()
year = start_date.strftime('%Y')
start_date += delta
path = 'https://archives.nseindia.com/content/historical/EQUITIES/' + year + '/' + month + '/cm' + day + month + year + 'bhav.csv.zip'
file = 'cm' + day + month + year + 'bhav.csv'
try:
with urlopen(path) as f:
with BytesIO(f.read()) as b, ZipFile(b) as myzipfile:
foofile = myzipfile.open(file)
df = pd.read_csv(foofile)
final.append(df)
except:
print(file + 'not there')
If the path is invalid, python is stuck and I have to restart Python. I am not able to error handle or identify invalid link while looping over multiple dates.
What I have tried so far to differentiate between valid and invalid links -
# Attempt 1
import os
os.path.exists(path)
os.path.isfile(path)
os.path.isdir(path)
os.path.islink(path)
# output is False for both Path and Path2
# Attempt 2
import validators
validators.url(path)
# output is True for both Path and Path2
# Attempt 3
import requests
site_ping = requests.get(path)
site_ping.status_code < 400
# Output for Path is True, but Python crashes/gets stuck when I run requests.get(path2) and I have to restart everytime.
Thanks for your help in advance.

As suggested by SuperStormer - adding a timeout to the request solved the issue
try:
with urlopen(zipFileURL, timeout = 5) as f:
with BytesIO(f.read()) as b, ZipFile(b) as myzipfile:
foofile = myzipfile.open(file)
df = pd.read_csv(foofile)
final.append(df)
except:
print(file + 'not there')

Related

pandas datareader. Save all data to one dataframe

I am new to Python and I have trouble getting data into one dataframe.
I have the following code.
from pandas_datareader import data as pdr
from datetime import date
from datetime import timedelta
import yfinance as yf
yf.pdr_override()
import pandas as pd
# tickers list
ticker_list = ['0P0001A532.CO','0P00018Q4V.CO','0P00017UBI.CO','0P00000YYT.CO','PFIBAA.CO','PFIBAB.CO','PFIBAC.CO','PFIDKA.CO','PFIGLA.CO','PFIMLO.CO','PFIKRB.CO','0P00019SMI.F','WEKAFKI.CO','0P0001CICW.CO','WEISTA.CO','WEISTS.CO','WEISA.CO','WEITISOP.CO']
today = date.today()
# We can get data by our choice by days bracket
if date.today().weekday()==0:
start_date = (today + timedelta((4 + today.weekday()) % 7)) - timedelta(days=7) # Friday. If it is monday we do not have a price since it is based on the previous day close.
else:
start_date = today - timedelta(days=1)
files=[]
allData = []
dafr_All = []
def getData(ticker):
print(ticker)
data = pdr.get_data_yahoo(ticker, start= start_date, end=(today + timedelta(days=2)))['Adj Close']
dataname = ticker+'_'+str(today)
files.append(dataname)
allData.append(data)
SaveData(data, dataname)
# Create a data folder in your current dir.
def SaveData(df, filename):
df.to_csv('./data/'+filename+'.csv')
#This loop will iterate over ticker list, will pass one ticker to get data, and save that data as file.
for tik in ticker_list:
getData(tik)
for i in range(0,11):
df1= pd.read_csv('./data/'+ str(files[i])+'.csv')
print (df1.head())
I get several csv files containing the adjusted close values (if there exists an adjusted close).
I want to save all the data to a dataframe where the first column consist of tickers, while the second column consist of adjusted close values. The dataframe then needs to be exported into a csv-file.

Python how to add repeating values to list

What I am trying to figure out is how to add "Cases" and "Deaths" for each day, so that it starts with: "1/19/2020 Cases" and "1/19/2020 Deaths" then "1/20/2020 Cases" etc. It seems the append function does not work for this, and I don't know how else to add this. It doesn't seem like python has a way to do this task. My eventual goal is to make this a pandas dataframe.
import pandas as pd
dates = pd.date_range(start = '1/19/2020', end = '12/31/2021')
lst = dates.repeat(repeats = 2)
print(lst)
Thanks
If I am not mistaken, I don't think there's a way to do it with purely pandas. However with python and datetime, you can do so:
import pandas as pd
from datetime import timedelta, date
def daterange(start_date, end_date):
# Credit: https://stackoverflow.com/a/1060330/10640517
for n in range(int((end_date - start_date).days)):
yield start_date + timedelta(n)
dates = []
start_date = date(2020, 1, 19) # Start date here
end_date = date(2021, 12, 31) # End date here
for single_date in daterange(start_date, end_date):
dates.append(single_date.strftime("%m/%d/%Y") + " Cases")
dates.append(single_date.strftime("%m/%d/%Y") + " Deaths")
pdates = pd.DataFrame(dates)
print (pdates)
Is this what you want? If not, I can delete it.

How can I speed these API queries up?

I am feeding a long list of inputs in a function that calls an API to retrieve data. My list is around 40.000 unique inputs. Currently, the function returns output every 1-2 seconds or so. Quick maths tells me that it would take over 10+ hrs before my function will be done. I therefore want to speed this process up, but have struggles finding a solution. I am quite a beginner, so threading/pooling is quite difficult for me. I hope someone is able to help me out here.
The function:
import quandl
import datetime
import numpy as np
quandl.ApiConfig.api_key = 'API key here'
def get_data(issue_date, stock_ticker):
# Prepare var
stock_ticker = "EOD/" + stock_ticker
# Volatility
date_1 = datetime.datetime.strptime(issue_date, "%d/%m/%Y")
pricing_date = date_1 + datetime.timedelta(days=-40) # -40 days of issue date
volatility_date = date_1 + datetime.timedelta(days=-240) # -240 days of issue date (-40,-240 range)
# Check if code exists : if not -> return empty array
try:
stock = quandl.get(stock_ticker, start_date=volatility_date, end_date=pricing_date) # get pricing data
except quandl.errors.quandl_error.NotFoundError:
return []
daily_close = stock['Adj_Close'].pct_change() # returns using adj.close
stock_vola = np.std(daily_close) * np.sqrt(252) # annualized volatility
# Average price
stock_pricing_date = date_1 + datetime.timedelta(days=-2) # -2 days of issue date
stock_pricing_date2 = date_1 + datetime.timedelta(days=-12) # -12 days of issue date
stock_price = quandl.get(stock_ticker, start_date=stock_pricing_date2, end_date=stock_pricing_date)
stock_price_average = np.mean(stock_price['Adj_Close']) # get average price
# Amihuds Liquidity measure
liquidity_pricing_date = date_1 + datetime.timedelta(days=-20)
liquidity_pricing_date2 = date_1 + datetime.timedelta(days=-120)
stock_data = quandl.get(stock_ticker, start_date=liquidity_pricing_date2, end_date=liquidity_pricing_date)
p = np.array(stock_data['Adj_Close'])
returns = np.array(stock_data['Adj_Close'].pct_change())
dollar_volume = np.array(stock_data['Adj_Volume'] * p)
illiq = (np.divide(returns, dollar_volume))
print(np.nanmean(illiq))
illiquidity_measure = np.nanmean(illiq, dtype=float) * (10 ** 6) # multiply by 10^6 for expositional purposes
return [stock_vola, stock_price_average, illiquidity_measure]
I then use a seperate script to select my csv file with the list with rows, each row containing the issue_date, stock_ticker
import function
import csv
import tkinter as tk
from tkinter import filedialog
# Open File Dialog
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
# Load Spreadsheet data
f = open(file_path)
csv_f = csv.reader(f)
next(csv_f)
result_data = []
# Iterate
for row in csv_f:
try:
return_data = function.get_data(row[1], row[0])
if len(return_data) != 0:
# print(return_data)
result_data_loc = [row[1], row[0]]
result_data_loc.extend(return_data)
result_data.append(result_data_loc)
except AttributeError:
print(row[0])
print('\n\n')
print(row[1])
continue
if result_data is not None:
with open('resuls.csv', mode='w', newline='') as result_file:
csv_writer = csv.writer(result_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for result in result_data:
# print(result)
csv_writer.writerow(result)
else:
print("No results found!")
It is quite messy, but like I mentioned before, I am definitely a beginner. Speeding this up would greatly help me.

Series format pandas

import pandas as pd
from datetime import datetime
import os
# get username
user = os.getlogin()
def file_process():
data = pd.read_excel('C:\\Users\\' + user + '\\My Documents\\XINVST.xls')
# Change the date and time formatting
data["INVDAT"] = data["INVDAT"].apply(lambda x: datetime.combine(x, datetime.min.time()))
data["INVDAT"] = data["INVDAT"].dt.strftime("%m-%d-%Y")
print(data)
# output to new file
# new_data = data
# new_data.to_excel('C:\\Users\\' + user + '\\Desktop\\XINVST.xls', index=None)
if __name__ == '__main__':
file_process()
I'm trying to format the INVDAT column to correct date format like 11/25/19, I've tried multiple solutions but keep running into errors like this one: TypeError: combine() argument 1 must be datetime.date, not int, I then tried to convert the integer to date type but it errors also.
Or you can simply use df["INVDAT"] = pd.to_datetime(df["INVDAT"], format="%m/%d/%y"), in this case you don't need the datetime pakage. For further information you should look the docs.
data['INVDAT'] = data['INVDAT'].astype('str')
data["INVDAT"] = pd.to_datetime(data["INVDAT"])
data["INVDAT"] = data["INVDAT"].dt.strftime("%m/%d/%Y")
This solution works but if the date representation is a single month like 12519 ( expected output 1/25/19), it fails. I tried using a conditional to add a 0 to the front if len() < 6 but it gives me an error that the dtype is int64.
import pandas as pd
import os
# get username
user = os.getlogin()
def file_process():
data = pd.read_excel('C:\\Users\\' + user + '\\My Documents\\XINVST.xls')
# Change the date and time formatting
data['INVDAT'] = data['INVDAT'].astype('str')
length = len(data['INVDAT'])
data['INVDAT'].pop(length - 1)
for i in data['INVDAT'].str.len():
if i <= 5:
data['INVDAT'] = data['INVDAT'].apply(lambda x: '{0:0>6}'.format(x))
length = len(data['INVDAT'])
data['INVDAT'].pop(length - 1)
data["INVDAT"] = pd.to_datetime(data["INVDAT"])
data["INVDAT"] = data["INVDAT"].dt.strftime("%m/%d/%Y")
else:
data["INVDAT"] = pd.to_datetime(data["INVDAT"])
data["INVDAT"] = data["INVDAT"].dt.strftime("%m/%d/%Y")
# output to new file
new_data = data
new_data.to_excel('C:\\Users\\' + user + '\\Desktop\\XINVST.xls', index=None)
if __name__ == '__main__':
file_process()
This is the solution, it's sloppy but works

How to convert a list of tuples to a csv file

Very new to programming and was trying to create an amortization table. Found some great questions and answers on here, but now I am stuck trying to convert the results into a csv file.
from datetime import date
from collections import OrderedDict
from dateutil.relativedelta import *
import csv
def amortization_schedule(rate, principal, period):
start_date=date.today()
#defining the monthly payment for a loan
payment = -float(principal / ((((1 + (rate / 12)) ** period) - 1) / ((rate / 12) * (1 + (rate / 12)) ** period)))
beg_balance = principal
end_balance = principal
period = 1
while end_balance > 0 and period <= 60 * 12:
#Recalculate the interest based on the current balance
interest_paid = round((rate / 12) * beg_balance, 2)
#Determine payment based on whether or not this period will pay off the loan
payment = round(min(payment, beg_balance + interest_paid), 2)
principal = round(-payment - interest_paid, 2)
yield OrderedDict([('Month', start_date),
('Period', period),
('Begin Balance', beg_balance),
('Payment', payment),
('Principal', principal),
('Interest', interest_paid),
('End Balance', end_balance)])
#increment the counter, date and balance
period +=1
start_date += relativedelta(months=1)
beg_balance = end_balance
I attempted to use this link as part of my solution but ended up with a csv that looked like the following:
M,o,n,t,h
P,e,r,i,o,d
B,e,g,i,n, ,B,a,l,a,n,c,e
P,a,y,m,e,n,t
P,r,i,n,c,i,p,a,l
I,n,t,e,r,e,s,t
E,n,d, ,B,a,l,a,n,c,e
Here is my code for the conversion to csv.
for start_date, period, beg_balance, payment, principal,
interest_paid, end_balance in amortization_schedule(user_rate,
user_principal, user_period):
start_dates.append(start_date)
periods.append(period)
beg_balances.append(beg_balance)
payments.append(payment)
principals.append(principal)
interest_paids.append(interest_paid)
end_balances.append(end_balance)
with open('amortization.csv', 'w') as outfile:
csvwriter = csv.writer(outfile)
csvwriter.writerow(start_dates)
csvwriter.writerow(periods)
csvwriter.writerow(beg_balances)
csvwriter.writerow(payments)
csvwriter.writerow(principals)
csvwriter.writerow(interest_paids)
csvwriter.writerow(end_balances)
Any help would be appreciated!
with open('amortization.csv', 'w', newline='') as outfile:
fieldnames = ['Month', 'Period', 'Begin Balance', 'Payment',
'Principal', 'Interest', 'End Balance']
csvwriter = csv.DictWriter(outfile, fieldnames)
for line in amortization_schedule(user_rate, user_principal, user_period):
csvwriter.writerow(line)
Code for write of csv file.
collections.OrderedDict is a dictionary, so may need to use csv.DictWriter
to write the dictionary. It is a dictionary so you should not need all of the
lines that you have for the conversion to csv.

Resources