EmptyDataError: No columns to parse from file. (Generate files with "for" in Python) - python-3.x

The following code obtains specific data from an internet financial portal (Morningstar). I obtain data from different companies, in this case from Dutch companies. Each one is represented by a ticker.
import pandas as pd
import numpy as np
def financials_download(ticker,report,frequency):
if frequency == "A" or frequency == "a":
frequency = "12"
elif frequency == "Q" or frequency == "q":
frequency = "3"
url = 'http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t='+ticker+'&region=usa&culture=en-US&cur=USD&reportType='+report+'&period='+frequency+'&dataType=R&order=desc&columnYear=5&rounding=3&view=raw&r=640081&denominatorView=raw&number=3'
df = pd.read_csv(url, skiprows=1, index_col=0)
return df
def ratios_download(ticker):
url = 'http://financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t='+ticker+'&region=usa&culture=en-US&cur=USD&order=desc'
df = pd.read_csv(url, skiprows=2, index_col=0)
return df
holland=("AALBF","ABN","AEGOF", "AHODF", "AKZO","ALLVF","AMSYF","ASML","KKWFF","KDSKF","GLPG","GTOFF","HINKF","INGVF","KPN","NN","LIGHT","RANJF","RDLSF","RDS.A","SBFFF", "UNBLF", "UNLVF", "VOPKF", "WOLTF")
def finance(country):
for ticker in country:
frequency = "a"
df1 = financials_download(ticker,'bs',frequency)
df2 = financials_download(ticker,'is',frequency)
df3 = ratios_download(ticker)
d1 = df1.loc['Total assets']
if np.any("EBITDA" in df2.index) == True:
d2 = df2.loc["EBITDA"]
else:
d2 = None
if np.any("Revenue USD Mil" in df3.index) == True:
d3 = df3.loc["Revenue USD Mil"]
else:
d3 = df3.loc["Revenue EUR Mil"]
d4 = df3.loc["Operating Margin %"]
d5 = df3.loc["Return on Assets %"]
d6 = df3.loc["Return on Equity %"]
d7 = df3.loc["EBT Margin"]
d8 = df3.loc["Net Margin %"]
d9 = df3.loc["Free Cash Flow/Sales %"]
if d2 is not None:
d1=d1.to_frame().T
d2=d2.to_frame().T
d3=d3.to_frame().T
d4=d4.to_frame().T
d5=d5.to_frame().T
d6=d6.to_frame().T
d7=d7.to_frame().T
d8=d8.to_frame().T
d9=d9.to_frame().T
df_new=pd.concat([d1,d2,d3,d4,d5,d6,d7,d8,d9])
else:
d1=d1.to_frame().T
d3=d3.to_frame().T
d4=d4.to_frame().T
d5=d5.to_frame().T
d6=d6.to_frame().T
d7=d7.to_frame().T
d8=d8.to_frame().T
d9=d9.to_frame().T
df_new=pd.concat([d1,d3,d4,d5,d6,d7,d8,d9])
df_new.to_csv(ticker+'.csv')
The problem is that when I use a for loop so that it goes through all the tickers of the variable holland and generates a csv document for each of them, it returns the following error:
File "pandas/_libs/parsers.pyx", line 565, in
pandas._libs.parsers.TextReader.__cinit__ (pandas\_libs\parsers.c:6260)
EmptyDataError: No columns to parse from file
On the other hand, it runs without error, if I just select one company ticker after the other.
I'd really appreciate it if you could help me.

When you run your script several times, it fails on different tickers and different calls. This gives you an indication that the problem is not associated with a specific ticker, but rather that the call from the csv reader doesn't return a value that can be read into the data frame. You can address this problem, by using Python's error handling routines, e.g. for your financials_download function:
df = ""
i = 0
#some data in df?
while len(df) == 0:
#try to download data and load them into df
try:
df = pd.read_csv(url, skiprows=1, index_col=0)
#not successful? Count failed attempts
except:
i += 1
print("Trial", i, "failed")
#five attempts failed? Unlikely that this server will respond
if i == 5:
print("ticker", ticker, ": server is down")
break
#print("downloaded", ticker)
#print("financial download data frame:")
#print(df)
This tries five times to retrieve the data from the ticker and if this fails, it prints a message that it was not successful. But now you have to deal with this situation in your main program and adjust it, because some of the data frames are empty.
I would like to point you for this kind of basic debugging to a blog post.

Related

Comparing user input to csv value

So I have this csv data
Medium narrow body, £8, 2650, 180, 8
Large narrow body, £7, 5600, 220, 10
Medium wide body, £5, 4050, 406, 14
The data I need to use are the numbers all the way on the right which have been given a fieldname 'first_class' and second from right given field name 'Capacity'
and I have made this code
import csv
def menu():
print ("""
1.Enter airport details
2.Enter flight details
3.Enter price plan and calculate profit
4. Clear data
5.Quit
""")
if b == '2':
a1 = input('Enter the type of aircraft: ')
airplane_info = open('airplane.csv', 'r')
csvreader = csv.DictReader(airplane_info,delimiter = ',',fieldnames=('Body_type','Running_cost','Max_flight','Capacity','first_class'))
for row in csvreader:
if row['Body_type'] == a1:
print(row)
if row['Body_type'] != a1:
print('Wrong aircraft type')
flag = False
else:
d1 = input('Enter number of first class seats on the aircraft')
if d1 != 0:
(That flag was sending user back to the options menu ignore it)
Now I need to use the aircraft type that the user input and use it's 'first_class' fieldname with the amount of first class seats the user enters. Let's say if the user input an aircraft type 'Medium wide body'. It has 14 first class seats. When the user is asked to enter first class seats and ends up entering lower then 14 an error message should pop up. How would I do it? Would I input the csv data into an array and then use it for the comparison?
Here is a quick example using Pandas library. You will need to install it:
pip install --user pandas
Using pandas you can parse the csv into a dataframe object and then work on it as you desire:
import pandas as pd
df = pd.read_csv('airplane.csv', names=["Body_type", "Running_cost", "Max_flight", "Capacity", "first_class"])
a1 = input('Enter the type of aircraft: ')
if a1 in df.Body_type.values:
body = df[df.Body_type == a1]
d1 = int(input('Enter number of first class seats on the aircraft: '))
if d1 < body["first_class"].values[0]:
print("Error")
# ...
else:
print('Wrong aircraft type')

How to scrape a table and its links

What I want to do is to take thw following website
https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html
view-source:https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html
And pick the year of execution, enter the Last Statement Link, and retrieve the statement... perhaps I would be creating 2 dictionaries, both with the execution number as key.
Afterwards, I would classify the statements by length, besides "flagging" the refusals to give it or if it was just not given.
Finally, all would be compiled in a SQLite database, and I would display a graph that shows how many messages, clustered by type, have been given each year.
Beautiful Soup seems to be the path to follow, I'm already having troubles with just printing the year of execution... Of course, I'm not ultimately interested in printing the years of execution, but it seems like a good way of checking if at least my code is properly locating the tags I want.
tags = soup('td')
for tag in tags:
print(tag.get('href', None))
Why does the previous code only print None?
Thanks beforehand.
Use pandas to get and manipulate the table. The links are static and by that I mean they can be easily recreated with offender's first and last name.
Then, you can use requests and BeautifulSoup to scrape for offender's last statement, which are quite moving.
Here's how:
import requests
import pandas as pd
def clean(first_and_last_name: list) -> str:
name = "".join(first_and_last_name).replace(" ", "").lower()
return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")
base_url = "https://www.tdcj.texas.gov/death_row"
response = requests.get(f"{base_url}/dr_executed_offenders.html")
df = pd.read_html(response.text, flavor="bs4")
df = pd.concat(df)
df.rename(columns={'Link': "Offender Information", "Link.1": "Last Statement URL"}, inplace=True)
df["Offender Information"] = df[
["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)
df["Last Statement URL"] = df[
["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)
df.to_csv("offenders.csv", index=False)
This gets you:
EDIT:
I actually went ahead and added the code that fetches all offenders' last statements.
import random
import time
import pandas as pd
import requests
from lxml import html
base_url = "https://www.tdcj.texas.gov/death_row"
response = requests.get(f"{base_url}/dr_executed_offenders.html")
statement_xpath = '//*[#id="content_right"]/p[6]/text()'
def clean(first_and_last_name: list) -> str:
name = "".join(first_and_last_name).replace(" ", "").lower()
return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")
def get_last_statement(statement_url: str) -> str:
page = requests.get(statement_url).text
statement = html.fromstring(page).xpath(statement_xpath)
text = next(iter(statement), "")
return " ".join(text.split())
df = pd.read_html(response.text, flavor="bs4")
df = pd.concat(df)
df.rename(
columns={'Link': "Offender Information", "Link.1": "Last Statement URL"},
inplace=True,
)
df["Offender Information"] = df[
["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)
df["Last Statement URL"] = df[
["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)
offender_data = list(
zip(
df["First Name"],
df["Last Name"],
df["Last Statement URL"],
)
)
statements = []
for item in offender_data:
*names, url = item
print(f"Fetching statement for {' '.join(names)}...")
statements.append(get_last_statement(statement_url=url))
time.sleep(random.randint(1, 4))
df["Last Statement"] = statements
df.to_csv("offenders_data.csv", index=False)
This will take a couple of minutes because the code "sleeps" for anywhere between 1 to 4 seconds, more or less, so the server doesn't get abused.
Once this gets done, you'll end up with a .csv file with all offenders' data and their statements, if there was one.

How can I speed these API queries up?

I am feeding a long list of inputs in a function that calls an API to retrieve data. My list is around 40.000 unique inputs. Currently, the function returns output every 1-2 seconds or so. Quick maths tells me that it would take over 10+ hrs before my function will be done. I therefore want to speed this process up, but have struggles finding a solution. I am quite a beginner, so threading/pooling is quite difficult for me. I hope someone is able to help me out here.
The function:
import quandl
import datetime
import numpy as np
quandl.ApiConfig.api_key = 'API key here'
def get_data(issue_date, stock_ticker):
# Prepare var
stock_ticker = "EOD/" + stock_ticker
# Volatility
date_1 = datetime.datetime.strptime(issue_date, "%d/%m/%Y")
pricing_date = date_1 + datetime.timedelta(days=-40) # -40 days of issue date
volatility_date = date_1 + datetime.timedelta(days=-240) # -240 days of issue date (-40,-240 range)
# Check if code exists : if not -> return empty array
try:
stock = quandl.get(stock_ticker, start_date=volatility_date, end_date=pricing_date) # get pricing data
except quandl.errors.quandl_error.NotFoundError:
return []
daily_close = stock['Adj_Close'].pct_change() # returns using adj.close
stock_vola = np.std(daily_close) * np.sqrt(252) # annualized volatility
# Average price
stock_pricing_date = date_1 + datetime.timedelta(days=-2) # -2 days of issue date
stock_pricing_date2 = date_1 + datetime.timedelta(days=-12) # -12 days of issue date
stock_price = quandl.get(stock_ticker, start_date=stock_pricing_date2, end_date=stock_pricing_date)
stock_price_average = np.mean(stock_price['Adj_Close']) # get average price
# Amihuds Liquidity measure
liquidity_pricing_date = date_1 + datetime.timedelta(days=-20)
liquidity_pricing_date2 = date_1 + datetime.timedelta(days=-120)
stock_data = quandl.get(stock_ticker, start_date=liquidity_pricing_date2, end_date=liquidity_pricing_date)
p = np.array(stock_data['Adj_Close'])
returns = np.array(stock_data['Adj_Close'].pct_change())
dollar_volume = np.array(stock_data['Adj_Volume'] * p)
illiq = (np.divide(returns, dollar_volume))
print(np.nanmean(illiq))
illiquidity_measure = np.nanmean(illiq, dtype=float) * (10 ** 6) # multiply by 10^6 for expositional purposes
return [stock_vola, stock_price_average, illiquidity_measure]
I then use a seperate script to select my csv file with the list with rows, each row containing the issue_date, stock_ticker
import function
import csv
import tkinter as tk
from tkinter import filedialog
# Open File Dialog
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
# Load Spreadsheet data
f = open(file_path)
csv_f = csv.reader(f)
next(csv_f)
result_data = []
# Iterate
for row in csv_f:
try:
return_data = function.get_data(row[1], row[0])
if len(return_data) != 0:
# print(return_data)
result_data_loc = [row[1], row[0]]
result_data_loc.extend(return_data)
result_data.append(result_data_loc)
except AttributeError:
print(row[0])
print('\n\n')
print(row[1])
continue
if result_data is not None:
with open('resuls.csv', mode='w', newline='') as result_file:
csv_writer = csv.writer(result_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for result in result_data:
# print(result)
csv_writer.writerow(result)
else:
print("No results found!")
It is quite messy, but like I mentioned before, I am definitely a beginner. Speeding this up would greatly help me.

Summarize non-zero values or any values from pandas dataframe with timestamps- From_Time & To_Time

I have a dataframe given below
I want to extract all the non-zero values from each column to put it in a summarize way like this
If any value repeated for period of time then starting time of value should go in 'FROM' column and end time of value should go in 'TO' column with column name in 'BLK-ASB-INV' column and value should go in 'Scount' column. For this I have started to write the code like this
import pandas as pd
df = pd.read_excel("StringFault_Bagewadi_16-01-2020.xlsx")
df = df.set_index(['Date (+05:30)'])
cols=['BLK-ASB-INV', 'Scount', 'FROM', 'TO']
res=pd.DataFrame(columns=cols)
for col in df.columns:
ss=df[col].iloc[df[col].to_numpy().nonzero()[0]]
.......
After that I am unable to think how should I approach to get the desired output. Is there any way to do this in python? Thanks in advance for any help.
Finally I have solved my problem, I have written the code given below works perfectly for me.
import pandas as pd
df = pd.read_excel("StringFault.xlsx")
df = df.set_index(['Date (+05:30)'])
cols=['BLK-ASB-INV', 'Scount', 'FROM', 'TO']
res=pd.DataFrame(columns=cols)
for col in df.columns:
device = []
for i in range(len(df[col])):
if df[col][i] == 0:
None
else:
if i < len(df[col])-1 and df[col][i]==df[col][i+1]:
try:
if df[col].index[i] > device[2]:
continue
except IndexError:
device.append(df[col].name)
device.append(df[col][i])
device.append(df[col].index[i])
continue
else:
if len(device)==3:
device.append(df[col].index[i])
res = res.append({'BLK-ASB-INV':device[0], 'Scount':device[1], 'FROM':device[2], 'TO': device[3]}, ignore_index=True)
device=[]
else:
device.append(df[col].name)
device.append(df[col][i])
if i == 0:
device.append(df[col].index[i])
else:
device.append(df[col].index[i-1])
device.append(df[col].index[i])
res = res.append({'BLK-ASB-INV':device[0], 'Scount':device[1], 'FROM':device[2], 'TO': device[3]}, ignore_index=True)
device=[]
For reference, here is the output datafarme

Unable to retrieve data from frame

I am trying to retrieve specific data from data-frame with particular condition, but it show empty data frame. I am new to data science, trying to learn data science. Here is my code.
file = open('/home/jeet/files1/files/ch03/adult.data', 'r')
def chr_int(a):
if a.isdigit(): return int(a)
else: return 0
data = []
for line in file:
data1 = line.split(',')
if len(data1) == 15:
data.append([chr_int(data1[0]), data1[1],
chr_int(data1[2]), data1[3],
chr_int(data1[4]), data1[5],
data1[6], data1[7], data1[8],
data1[9], chr_int(data1[10]),
chr_int(data1[11]),
chr_int(data1[12]),
data1[13], data1[14]])
import pandas as pd
df = pd.DataFrame(data)
df.columns = ['age', 'type-employer', 'fnlwgt', 'education','education_num', 'marital','occupation', 'relationship','race','sex','capital_gain','capital_loss','hr_per_week','country','income']
ml = df[(df.sex == 'Male')] # here i retrive data who is male
ml1 = df[(df.sex == 'Male') & (df.income == '>50K\n')]
print(ml1.head()) # here i printing that data
fm =df[(df.sex == 'Female')]
fm1 = df [(df.sex == 'Female') & (df.income =='>50K\n')]
output:
Empty DataFrame
Columns: [age, type-employer, fnlwgt, education, education_num, marital, occupation, relationship, race, sex, capital_gain, capital_loss, hr_per_week, country, income]
Index: []
what's wrong with the code. why data frame is empty.
If you check the values carefully, you may see the problem:
print(df.income.unique())
>>> [' <=50K\n' ' >50K\n']
There are spaces in front of each values. So values should be either processed to get rid of these spaces, or the code should be modified like this:
ml1 = df[(df.sex == 'Male') & (df.income == ' >50K\n')]
fm1 = df [(df.sex == 'Female') & (df.income ==' <=50K\n')]

Resources