So I have this csv data
Medium narrow body, £8, 2650, 180, 8
Large narrow body, £7, 5600, 220, 10
Medium wide body, £5, 4050, 406, 14
The data I need to use are the numbers all the way on the right which have been given a fieldname 'first_class' and second from right given field name 'Capacity'
and I have made this code
import csv
def menu():
print ("""
1.Enter airport details
2.Enter flight details
3.Enter price plan and calculate profit
4. Clear data
5.Quit
""")
if b == '2':
a1 = input('Enter the type of aircraft: ')
airplane_info = open('airplane.csv', 'r')
csvreader = csv.DictReader(airplane_info,delimiter = ',',fieldnames=('Body_type','Running_cost','Max_flight','Capacity','first_class'))
for row in csvreader:
if row['Body_type'] == a1:
print(row)
if row['Body_type'] != a1:
print('Wrong aircraft type')
flag = False
else:
d1 = input('Enter number of first class seats on the aircraft')
if d1 != 0:
(That flag was sending user back to the options menu ignore it)
Now I need to use the aircraft type that the user input and use it's 'first_class' fieldname with the amount of first class seats the user enters. Let's say if the user input an aircraft type 'Medium wide body'. It has 14 first class seats. When the user is asked to enter first class seats and ends up entering lower then 14 an error message should pop up. How would I do it? Would I input the csv data into an array and then use it for the comparison?
Here is a quick example using Pandas library. You will need to install it:
pip install --user pandas
Using pandas you can parse the csv into a dataframe object and then work on it as you desire:
import pandas as pd
df = pd.read_csv('airplane.csv', names=["Body_type", "Running_cost", "Max_flight", "Capacity", "first_class"])
a1 = input('Enter the type of aircraft: ')
if a1 in df.Body_type.values:
body = df[df.Body_type == a1]
d1 = int(input('Enter number of first class seats on the aircraft: '))
if d1 < body["first_class"].values[0]:
print("Error")
# ...
else:
print('Wrong aircraft type')
Related
python 3: I am trying to code a bingo game which asks the user for 1-3 players, assigns each player a name, and then creates their Bingo card by choosing 25 elements from the list listOfStrings. the list contains 53 elements which are all strings. I am getting the error "ValueError: Sample larger than population or is negative" but 25<53 ? is the size of my list incorrect or do I have to assign each element a number? not sure why this is happening. im new to programming so probably a simple mistake or miscomprehension
import urllib.request
import random
listOfStrings = []
def createList():
try:
with urllib.request.urlopen('https://www.cs.queensu.ca/home/cords2/bingo.txt') as f:
d = f.read().decode('utf-8')
split= d.split("\n")
listOfStrings = split
except urllib.error.URLError as e:
print(e.reason)
def players(listOfStrings):
noPlayers = input("How many players would like to play Bingo? Choose 1-3: ")
if noPlayers == "1":
name = input("What is the name of player 1? ")
player1card = random.sample(list(listOfStrings), 25)
createList()
players(listOfStrings)
print(player1card)
I'm wondering what appropriate data structure I'm going to use to store information about chemical elements that I have in a text file. My program should
read and process input from the user. If the user enters an integer then it program
should display the symbol and name of the element with the number of protons
entered. If the user enters a string then my program should display the number
of protons for the element with that name or symbol.
The text file is formatted as below
# element.txt
1,H,Hydrogen
2,He,Helium
3,Li,Lithium
4,Be,Beryllium
...
I thought of dictionary but figured that mapping a string to a list can be tricky as my program would respond based on whether the user provides an integer or a string.
You shouldn't be worried about the "performance" of looking for an element:
There are no more than 200 elements, which is a small number for a computer;
Since the program interacts with a human user, the human will be orders of magnitude slower than the computer anyway.
Option 1: pandas.DataFrame
Hence I suggest a simple pandas DataFrame:
import pandas as pd
df = pd.read_csv('element.txt')
df.columns = ['Number', 'Symbol', 'Name']
def get_column_and_key(s):
s = s.strip()
try:
k = int(s)
return 'Number', k
except ValueError:
if len(s) <= 2:
return 'Symbol', s
else:
return 'Name', s
def find_element(s):
column, key = get_column_and_key(s)
return df[df[column] == key]
def play():
keep_going = True
while keep_going:
s = input('>>>> ')
if s[0] == 'q':
keep_going = False
else:
print(find_element(s))
if __name__ == '__main__':
play()
See also:
Finding elements in a pandas dataframe
Option 2: three redundant dicts
One of python's most used data structures is dict. Here we have three different possible keys, so we'll use three dict.
import csv
with open('element.txt', 'r') as f:
data = csv.reader(f)
elements_by_num = {}
elements_by_symbol = {}
elements_by_name = {}
for row in data:
num, symbol, name = int(row[0]), row[1], row[2]
elements_by_num[num] = num, symbol, name
elements_by_symbol[symbol] = num, symbol, name
elements_by_name[name] = num, symbol, name
def get_dict_and_key(s):
s = s.strip()
try:
k = int(s)
return elements_by_num, k
except ValueError:
if len(s) <= 2:
return elements_by_symbol, s
else:
return elements_by_name, s
def find_element(s):
d, key = get_dict_and_key(s)
return d[key]
def play():
keep_going = True
while keep_going:
s = input('>>>> ')
if s[0] == 'q':
keep_going = False
else:
print(find_element(s))
if __name__ == '__main__':
play()
You are right that it is tricky. However, I suggest you just make three dictionaries. You certainly can just store the data in a 2d list, but that'd be way harder to make and access than using three dicts. If you desire, you can join the three dicts into one. I personally wouldn't, but the final choice is always up to you.
weight = {1: ("H", "Hydrogen"), 2: ...}
symbol = {"H": (1, "Hydrogen"), "He": ...}
name = {"Hydrogen": (1, "H"), "Helium": ...}
If you want to get into databases and some QLs, I suggest looking into sqlite3. It's a classic, thus it's well documented.
Edit 12/07/19: The problem was not in fact with pd.rename fuction but the fact that I did not return from the function the pandas dataframe and as a result the column change did not exist when printing. i.e.
def change_column_names(as_pandas, old_name, new_name):
as_pandas.rename(columns={old_name: new_name}, inplace=)
return as_pandas <- This was missing*
Please see the user comment below to uptick them for finding this error for me.
Alternatively, you can continue reading.
The data can be downloaded from this link, yet I have added a sample dataset. The formatting of the file is not a typical CSV file and I believe this may have been an assessment piece and is related to Hidden Decision Tree article. I have given the portion of the code as it solves the issues surrounding the format of the text file as mentioned above and allows the user to rename the column.
The problem occured when I tried to assign create a re-naming function:
def change_column_names(as_pandas, old_name, new_name):
as_pandas.rename(columns={old_name: new_name}, inplace=)
However, it seem to work when I set the variable names inside rename function.
def change_column_names(as_pandas):
as_pandas.rename(columns={'Unique Pageviews': 'Page_Views'}, inplace=True)
return as_pandas
Sample Dataset
Title URL Date Unique Pageviews
oupUrl=tutorials 18-Apr-15 5608
"An Exclusive Interview with Data Expert, John Bottega" http://www.datasciencecentral.com/forum/topics/an-exclusive-interview-with-data-expert-john-bottega?groupUrl=announcements 10-Jun-14 360
Announcing Composable Analytics http://www.datasciencecentral.com/forum/topics/announcing-composable-analytics 15-Jun-14 367
Announcing the release of Spark 1.5 http://www.datasciencecentral.com/forum/topics/announcing-the-release-of-spark-1-5 12-Sep-15 156
Are Extreme Weather Events More Frequent? The Data Science Answer http://www.datasciencecentral.com/forum/topics/are-extreme-weather-events-more-frequent-the-data-science-answer 5-Oct-15 204
Are you interested in joining the University of California for an empiricalstudy on 'Big Data'? http://www.datasciencecentral.com/forum/topics/are-you-interested-in-joining-the-university-of-california-for-an 7-Feb-13 204
Are you smart enough to work at Google? http://www.datasciencecentral.com/forum/topics/are-you-smart-enough-to-work-at-google 11-Oct-15 3625
"As a software engineer, what's the best skill set to have for the next 5-10years?" http://www.datasciencecentral.com/forum/topics/as-a-software-engineer-what-s-the-best-skill-set-to-have-for-the- 12-Feb-16 2815
A Statistician's View on Big Data and Data Science (Updated) http://www.datasciencecentral.com/forum/topics/a-statistician-s-view-on-big-data-and-data-science-updated-1 21-May-14 163
A synthetic variance designed for Hadoop and big data http://www.datasciencecentral.com/forum/topics/a-synthetic-variance-designed-for-hadoop-and-big-data?groupUrl=research 26-May-14 575
A Tough Calculus Question http://www.datasciencecentral.com/forum/topics/a-tough-calculus-question 10-Feb-16 937
Attribution Modeling: Key Analytical Strategy to Boost Marketing ROI http://www.datasciencecentral.com/forum/topics/attribution-modeling-key-concept 24-Oct-15 937
Audience expansion http://www.datasciencecentral.com/forum/topics/audience-expansion 6-May-13 223
Automatic use of insights http://www.datasciencecentral.com/forum/topics/automatic-use-of-insights 27-Aug-15 122
Average length of dissertations by higher education discipline. http://www.datasciencecentral.com/forum/topics/average-length-of-dissertations-by-higher-education-discipline 4-Jun-15 1303
This is the full code that produces the Key Error:
def change_column_names(as_pandas):
as_pandas.rename(columns={'Unique Pageviews': 'Page_Views'}, inplace=True)
def change_column_names(as_pandas, old_name, new_name):
as_pandas.rename(columns={old_name: new_name}, inplace=True)
def change_column_names(as_pandas):
as_pandas.rename(columns={'Unique Pageviews': 'Page_Views'},
inplace=True)
def open_as_dataframe(file_name_in):
reader = pd.read_csv(file_name_in, encoding='windows-1251')
return reader
# Get each column of data including the heading and separate each element
i.e. Title, URL, Date, Page Views
# and save to string_of_rows with comma separator for storage as a csv
# file.
def get_columns_of_data(*args):
# Function that accept variable length arguments
string_of_rows = str()
num_cols = len(args)
try:
if num_cols > 0:
for number, element in enumerate(args):
if number == (num_cols - 1):
string_of_rows = string_of_rows + element + '\n'
else:
string_of_rows = string_of_rows + element + ','
except UnboundLocalError:
print('Empty file \'or\' No arguments received, cannot be zero')
return string_of_rows
def open_file(file_name):
try:
with open(file_name) as csv_file_in, open('HDT_data5.txt', 'w') as csv_file_out:
csv_read = csv.reader(csv_file_in, delimiter='\t')
for row in csv_read:
try:
row[0] = row[0].replace(',', '')
csv_file_out.write(get_columns_of_data(*row))
except TypeError:
continue
print("The file name '{}' was successfully opened and read".format(file_name))
except IOError:
print('File not found \'OR\' Not in current directory\n')
# All acronyms used in variable naming correspond to the function at time
# of return from function.
# csv_list being a list of the v file contents the remainder i.e. 'st' of
# csv_list_st = split_title().
def main():
open_file('HDTdata3.txt')
multi_sets = open_as_dataframe('HDT_data5.txt')
# change_column_names(multi_sets)
change_column_names(multi_set, 'Old_Name', 'New_Name')
print(multi_sets)
main()
I cleaned up your code so it would run. You were changing the column names but not returning the result. Try the following:
import pandas as pd
import numpy as np
import math
def set_new_columns(as_pandas):
titles_list = ['Year > 2014', 'Forum', 'Blog', 'Python', 'R',
'Machine_Learning', 'Data_Science', 'Data',
'Analytics']
for number, word in enumerate(titles_list):
as_pandas.insert(len(as_pandas.columns), titles_list[number], 0)
def title_length(as_pandas):
# Insert new column header then count the number of letters in 'Title'
as_pandas.insert(len(as_pandas.columns), 'Title_Length', 0)
as_pandas['Title_Length'] = as_pandas['Title'].map(str).apply(len)
# Although it is log, percentage of change is inverse linear comparison of
#logX1 - logX2
# therefore you could think of it as the percentage change in Page Views
# map
# function allows for function to be performed on all rows in column
# 'Page_Views'.
def log_page_view(as_pandas):
# Insert new column header
as_pandas.insert(len(as_pandas.columns), 'Log_Page_Views', 0)
as_pandas['Log_Page_Views'] = as_pandas['Page_Views'].map(lambda x: math.log(1 + float(x)))
def change_to_numeric(as_pandas):
# Check for missing values then convert the column to numeric.
as_pandas = as_pandas.replace(r'^\s*$', np.nan, regex=True)
as_pandas['Page_Views'] = pd.to_numeric(as_pandas['Page_Views'],
errors='coerce')
def change_column_names(as_pandas):
as_pandas.rename(columns={'Unique Pageviews': 'Page_Views'}, inplace=True)
return as_pandas
def open_as_dataframe(file_name_in):
reader = pd.read_csv(file_name_in, encoding='windows-1251')
return reader
# Get each column of data including the heading and separate each element
# i.e. Title, URL, Date, Page Views
# and save to string_of_rows with comma separator for storage as a csv
# file.
def get_columns_of_data(*args):
# Function that accept variable length arguments
string_of_rows = str()
num_cols = len(args)
try:
if num_cols > 0:
for number, element in enumerate(args):
if number == (num_cols - 1):
string_of_rows = string_of_rows + element + '\n'
else:
string_of_rows = string_of_rows + element + ','
except UnboundLocalError:
print('Empty file \'or\' No arguments received, cannot be zero')
return string_of_rows
def open_file(file_name):
import csv
try:
with open(file_name) as csv_file_in, open('HDT_data5.txt', 'w') as csv_file_out:
csv_read = csv.reader(csv_file_in, delimiter='\t')
for row in csv_read:
try:
row[0] = row[0].replace(',', '')
csv_file_out.write(get_columns_of_data(*row))
except TypeError:
continue
print("The file name '{}' was successfully opened and read".format(file_name))
except IOError:
print('File not found \'OR\' Not in current directory\n')
# All acronyms used in variable naming correspond to the function at time
# of return from function.
# csv_list being a list of the v file contents the remainder i.e. 'st' of
# csv_list_st = split_title().
def main():
open_file('HDTdata3.txt')
multi_sets = open_as_dataframe('HDT_data5.txt')
multi_sets = change_column_names(multi_sets)
change_to_numeric(multi_sets)
log_page_view(multi_sets)
title_length(multi_sets)
set_new_columns(multi_sets)
print(multi_sets)
main()
The following code obtains specific data from an internet financial portal (Morningstar). I obtain data from different companies, in this case from Dutch companies. Each one is represented by a ticker.
import pandas as pd
import numpy as np
def financials_download(ticker,report,frequency):
if frequency == "A" or frequency == "a":
frequency = "12"
elif frequency == "Q" or frequency == "q":
frequency = "3"
url = 'http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t='+ticker+'®ion=usa&culture=en-US&cur=USD&reportType='+report+'&period='+frequency+'&dataType=R&order=desc&columnYear=5&rounding=3&view=raw&r=640081&denominatorView=raw&number=3'
df = pd.read_csv(url, skiprows=1, index_col=0)
return df
def ratios_download(ticker):
url = 'http://financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t='+ticker+'®ion=usa&culture=en-US&cur=USD&order=desc'
df = pd.read_csv(url, skiprows=2, index_col=0)
return df
holland=("AALBF","ABN","AEGOF", "AHODF", "AKZO","ALLVF","AMSYF","ASML","KKWFF","KDSKF","GLPG","GTOFF","HINKF","INGVF","KPN","NN","LIGHT","RANJF","RDLSF","RDS.A","SBFFF", "UNBLF", "UNLVF", "VOPKF", "WOLTF")
def finance(country):
for ticker in country:
frequency = "a"
df1 = financials_download(ticker,'bs',frequency)
df2 = financials_download(ticker,'is',frequency)
df3 = ratios_download(ticker)
d1 = df1.loc['Total assets']
if np.any("EBITDA" in df2.index) == True:
d2 = df2.loc["EBITDA"]
else:
d2 = None
if np.any("Revenue USD Mil" in df3.index) == True:
d3 = df3.loc["Revenue USD Mil"]
else:
d3 = df3.loc["Revenue EUR Mil"]
d4 = df3.loc["Operating Margin %"]
d5 = df3.loc["Return on Assets %"]
d6 = df3.loc["Return on Equity %"]
d7 = df3.loc["EBT Margin"]
d8 = df3.loc["Net Margin %"]
d9 = df3.loc["Free Cash Flow/Sales %"]
if d2 is not None:
d1=d1.to_frame().T
d2=d2.to_frame().T
d3=d3.to_frame().T
d4=d4.to_frame().T
d5=d5.to_frame().T
d6=d6.to_frame().T
d7=d7.to_frame().T
d8=d8.to_frame().T
d9=d9.to_frame().T
df_new=pd.concat([d1,d2,d3,d4,d5,d6,d7,d8,d9])
else:
d1=d1.to_frame().T
d3=d3.to_frame().T
d4=d4.to_frame().T
d5=d5.to_frame().T
d6=d6.to_frame().T
d7=d7.to_frame().T
d8=d8.to_frame().T
d9=d9.to_frame().T
df_new=pd.concat([d1,d3,d4,d5,d6,d7,d8,d9])
df_new.to_csv(ticker+'.csv')
The problem is that when I use a for loop so that it goes through all the tickers of the variable holland and generates a csv document for each of them, it returns the following error:
File "pandas/_libs/parsers.pyx", line 565, in
pandas._libs.parsers.TextReader.__cinit__ (pandas\_libs\parsers.c:6260)
EmptyDataError: No columns to parse from file
On the other hand, it runs without error, if I just select one company ticker after the other.
I'd really appreciate it if you could help me.
When you run your script several times, it fails on different tickers and different calls. This gives you an indication that the problem is not associated with a specific ticker, but rather that the call from the csv reader doesn't return a value that can be read into the data frame. You can address this problem, by using Python's error handling routines, e.g. for your financials_download function:
df = ""
i = 0
#some data in df?
while len(df) == 0:
#try to download data and load them into df
try:
df = pd.read_csv(url, skiprows=1, index_col=0)
#not successful? Count failed attempts
except:
i += 1
print("Trial", i, "failed")
#five attempts failed? Unlikely that this server will respond
if i == 5:
print("ticker", ticker, ": server is down")
break
#print("downloaded", ticker)
#print("financial download data frame:")
#print(df)
This tries five times to retrieve the data from the ticker and if this fails, it prints a message that it was not successful. But now you have to deal with this situation in your main program and adjust it, because some of the data frames are empty.
I would like to point you for this kind of basic debugging to a blog post.
I try to select specific fields from my Qdata.txt file and use field[2] to calculate average for every years separate. My code give only total average.
data file looks like: (1. day of year: 101 and last: 1231)
Date 3700300 6701500
20000101 21.00 223.00
20000102 20.00 218.00
. .
20001231 7.40 104.00
20010101 6.70 104.00
. .
20130101 8.37 111.63
. .
20131231 45.00 120.98
import sys
td=open("Qdata.txt","r") # open file Qdata
total=0
count=0
row1=True
for row in td :
if (row1) :
row1=False # row1 is for topic
else:
fields=row.split()
try:
total=total+float(fields[2])
count=count+1
# Errors.
except IndexError:
continue
except ValueError:
print("File is incorrect.")
sys.exit()
print("Average in 2000 was: ",total/count)
You could use itertools.groupby using the first four characters as the key for grouping.
with open("data.txt") as f:
next(f) # skip first line
groups = itertools.groupby(f, key=lambda s: s[:4])
for k, g in groups:
print(k, [s.split() for s in g])
This gives you the entries grouped by year, for further processing.
Output for your example data:
2000 [['20000101', '21.00', '223.00'], ['20000102', '20.00', '218.00'], ['20001231', '7.40', '104.00']]
2001 [['20010101', '6.70', '104.00']]
2013 [['20130101', '8.37', '111.63'], ['20131231', '45.00', '120.98']]
You could create a dict (or even a defaultdict) for total and count instead:
import sys
from collections import defaultdict
td=open("Qdata.txt","r") # open file Qdata
total=defaultdict(float)
count=defaultdict(int)
row1=True
for row in td :
if (row1) :
row1=False # row1 is for topic
else:
fields=row.split()
try:
year = int(fields[0][:4])
total[year] += float(fields[2])
count[year] += 1
# Errors.
except IndexError:
continue
except ValueError:
print("File is incorrect.")
sys.exit()
print("Average in 2000 was: ",total[2000]/count[2000])
Every year separate? You have to divide your input into groups, something like this might be what you want:
from collections import defaultdict
row1 = True
year_sums = defaultdict(list)
for row in td:
if row1:
row1 = False
continue
fields = row.split()
year = fields[0][:4]
year_sums[year].append(float(fields[2]))
for year in year_sums:
avarage = sum(year_sums[year])/count(year_sums[year])
print("Avarage in {} was: {}".format(year, avarage)
That is just some example code, I don't know if it works for sure, but should give you an idea what you can do. year_sums is a defaultdict containing lists of values grouped by years. You can then use it for other statistics if you want.