creating objects with attributes from a text file - python-3.x

Im making a program that reads the info of some football teams from a textile(named 30eapril.txt) and uses this data when creating some team-objects. I wonder how I can make the program read the number of teams in the textile and create objects of them. The code I've written so far works but has a lot of repetitive parts!
class team:
def __init__(self, teamdata):
self.name = teamdata[0]
self.wins = teamdata[1]
self.drawn = teamdata[2]
self.losses = teamdata[3]
def __repr__(self):
return self.name.ljust(15) + '{} {} {}'.format(self.wins, self.drawn, self.losses)
laglista = []
with open('30eapril.txt', 'rt') as file:
for line in file:
laglista.append(line)
team1data = (laglista[0]).split()
team2data = (laglista[1]).split()
team3data = (laglista[2]).split()
team4data = (laglista[3]).split()
lag1 = team(team1data)
lag2 = team(team2data)
lag3 = team(team3data)
lag4 = team(team4data)
print(lag1)
print(lag2)
print(lag3)
print(lag4)
this is what was in the textfile
Arsenal 2 1 0
Manchester 2 0 0
Liverpool 0 1 2
Newcastle 0 0 2
Hope that someone can help!
//Peter

Shortened code: (could certainly be even better)
#!/usr/bin/env python3
class team:
def __init__(self, teamdata):
self.name, self.wins, self.drawn, self.losses = teamdata
def __repr__(self):
return self.name.ljust(15) + '{} {} {}'.format(self.wins, self.drawn, self.losses)
lag = []
with open('30eapril.txt', 'rt') as file:
for line in file:
lag.append(team(line.split()))
#print("Number of teams: " + str(len(lag)))
for l in lag:
print(l)
You don't need to know the number of lines of your file.
With the same content of '30eapril.txt', the output is:
$ ./test_script3.py
Arsenal 2 1 0
Manchester 2 0 0
Liverpool 0 1 2
Newcastle 0 0 2
Same script on '30eapril.txt' having a extra line:
$ ./test_script3.py
Arsenal 2 1 0
Manchester 2 0 0
Liverpool 0 1 2
Newcastle 0 0 2
AnotherClub 1 0 2

Related

Hackerrank - why is my output being written one character at a time?

I am solving the following "Vertical Sticks "hackerrank challenge: https://www.hackerrank.com/challenges/vertical-sticks/problem?isFullScreen=true&h_r=next-challenge&h_v=zen&h_r=next-challenge&h_v=zen
Here is my solution:
def solve(y):
out = []
x = list(itertools.permutations(y))
for yp in x:
arr = []
arr.append(1)
for i in range(int(1),int(len(yp))):
#flag = 0
for j in range(int(i-1),int(-1),int(-1)):
if yp[j] >= yp[i]:
arr.append(i-j)
#flag+=1
break
if j==0:
arr.append(i+1)
out.append(sum(arr))
p = round((sum(out)/len(out)),2)
pp = "%0.2f" % (p)
print(pp)
return pp
if __name__ == '__main__':
fptr = open(os.environ['OUTPUT_PATH'], 'w')
t = int(input().strip())
for t_itr in range(t):
y_count = int(input().strip())
y = list(map(int, input().rstrip().split()))
result = solve(y)
fptr.write('\n'.join(map(str, result)))
fptr.write('\n')
fptr.close()
My print(pp) output comes out correctly for the test case as:
4.33
3.00
4.00
6.00
5.80
11.15
But my return pp stdout comes out as:
4
.
3
3
3
.
0
0
4
.
0
0
6
.
0
0
5
.
8
0
1
1
.
1
5
i.e. one character per line, and is classified incorrect. Could somebody point me into the direction of why this is?
The return from solve is already a string. When you call join on it, you are splitting it into its individual characters, separated by newlines.

Why do I get a view error when enumerating a Dataframe

Why do I get a "view" error:
ndf = pd.DataFrame()
ndf['Signals'] = [1,1,1,1,1,0,0,0,0,0]
signals_diff = ndf.Signals.diff()
ndf['Revals'] = [101,102,105,104,105,106,107,108,109,109]
ndf['Entry'] = 0
for i,element in enumerate(signals_diff):
if (i==0):
ndf.iloc[i]['Entry'] = ndf.iloc[i]['Revals']
elif (element == 0):
ndf.iloc[i]['Entry'] = ndf.iloc[i - 1]['Entry']
else:
ndf.iloc[i]['Entry'] = ndf.iloc[i]['Revals']
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
ndf.iloc[i]['Entry'] = ndf.iloc[i]['Revals']
instead of iloc use loc:
ndf = pd.DataFrame()
ndf['Signals'] = [1,1,1,1,1,0,0,0,0,0]
signals_diff = ndf.Signals.diff()
ndf['Revals'] = [101,102,105,104,105,106,107,108,109,109]
ndf['Entry'] = 0
for i,element in enumerate(signals_diff):
if (i==0):
ndf.loc[i,'Entry'] = ndf.loc[i,'Revals']
elif (element == 0):
ndf.loc[i,'Entry'] = ndf.loc[i - 1,'Entry']
else:
ndf.loc[i,'Entry'] = ndf.loc[i,'Revals']
This will solve the problem but when assigning, the index should be same. So because of the index thing you might not be able to get the expected result.
Do not chain indexes like ndf.iloc[i]['Entry'] when trying to assign something. See why does that not work.
That said, your code can be rewrite as:
ndf['Entry'] = ndf['Revals'].where(signals_diff != 0).ffill()
Output:
Signals Revals Entry
0 1 101 101.0
1 1 102 101.0
2 1 105 101.0
3 1 104 101.0
4 1 105 101.0
5 0 106 106.0
6 0 107 106.0
7 0 108 106.0
8 0 109 106.0
9 0 109 106.0
Let us keep using the index position slice with get_indexer
for i,element in enumerate(signals_diff):
if (i==0):
ndf.iloc[i,ndf.columns.get_indexer(['Entry'])] = ndf.iloc[i,ndf.columns.get_indexer(['Revals'])]
elif (element == 0):
ndf.iloc[i,ndf.columns.get_indexer(['Entry'])] = ndf.iloc[i - 1,ndf.columns.get_indexer(['Entry'])]
else:
ndf.iloc[i,ndf.columns.get_indexer(['Entry'])] = ndf.iloc[i,ndf.columns.get_indexer(['Revals'])]

Max Points on a Line with python 3

algorithm question:
Given n points on a 2D plane, find the maximum number of points that lie on the same straight line.
Example 1:
Input: [[1,1],[2,2],[3,3]]
Output: 3
Explanation:
^
|
| o
| o
| o
+------------->
0 1 2 3 4
Example 2:
Input: [[1,1],[3,2],[5,3],[4,1],[2,3],[1,4]]
Output: 4
the working python 3 code is below:
wondering
snippet 1 d[slope] = d.get(slope, 1) + 1 is working
but why this snippet 2 is not working correctly for example 2 even though snippet 1 and 2 are the same
if slope in d:
d[slope] += 1
else:
d[slope] = 1
def gcd(self, a, b):
if b == 0:
return a
return self.gcd(b, a%b)
def get_slope(self, p1, p2):
dx = p1[0] - p2[0]
dy = p1[1] - p2[1]
c = self.gcd(dx, dy)
dx /= c
dy /= c
return str(dy) + "/" + str(dx)
def is_same_points(self, p1:List[int], p2:List[int]):
return p1[0] == p2[0] and p1[1] == p2[1]
def maxPoints(self, points: List[List[int]]) -> int:
if not points:
return 0
n = len(points)
count = 1
for i in range(0, n):
d = {}
duped = 0
localmax = 1
p1 = points[i]
for j in range(i+1, n):
p2 = points[j]
if self.is_same_points(p1, p2):
duped += 1
else:
slope = self.get_slope(p1, p2)
# 1) not work: output is 3 in example 2
# if slope in d:
# d[slope] += 1
# else:
# d[slope] = 1
# 2) works: correct output 4 for example 2
d[slope] = d.get(slope, 1) + 1
localmax = max(localmax, d[slope]);
count = max(count, localmax + duped)
return count
Interesting problem and nice solution.
The reason why the commented out code doesn't work is because of that:
else:
d[slope] = 1 ## correct would be d[slope] = 2
Every 2 points are on the same line, you are counting only one point for the first two p1 p2, thus you get one less in the final answer.

Optimizing using Pandas Data Frame

I have the following function that loads a csv into a data frame then does some calculations. It takes about 4-5 minutes to do calculation on the csv with a little over 100,000 lines. I was hoping there is a faster way.
def calculate_adeck_errors(in_file):
print(f'Starting Data Calculations: {datetime.datetime.now().strftime("%I:%M%p on %B %d, %Y")}')
pd.set_option('display.max_columns', 12)
# read in the raw csv
adeck_df = pd.read_csv(in_file)
#print(adeck_df)
#extract only the carq items and remove duplicates
carq_data = adeck_df[(adeck_df.MODEL == 'CARQ') & (adeck_df.TAU == 0)].drop_duplicates(keep='last')
#print(carq_data)
#remove carq items from original
final_df = adeck_df[adeck_df.MODEL != 'CARQ']
#print(final_df)
row_list = []
for index, row in carq_data.iterrows():
position_time = row['POSDATETIME']
for index, arow in final_df.iterrows():
if arow['POSDATETIME'] == position_time:
# match, so do calculations
storm_id = arow['STORMID']
model_base_time = arow['MODELDATETIME']
the_hour = arow['TAU']
the_model = arow['MODEL']
point1 = float(row['LAT']), float(row['LON'])
point2 = float(arow['LAT']), float(arow['LON'])
if arow['LAT'] == 0.0:
dist_error = None
else:
dist_error = int(round(haversine(point1, point2, miles=True)))
if arow['WIND'] != 0:
wind_error = int(abs(int(row['WIND']) - int(arow['WIND'])))
else: wind_error = None
if arow['PRES'] != 0:
pressure_error = int(abs(int(row['PRES']) - int(arow['PRES'])))
else:
pressure_error = None
lat_carq = row['LAT']
lon_carq = row['LON']
lat_model = arow['LAT']
lon_model = arow['LON']
wind_carq = row['WIND']
wind_model = arow['WIND']
pres_carq = row['PRES']
pres_model = arow['PRES']
row_list.append([storm_id, model_base_time, the_model, the_hour, lat_carq, lon_carq, lat_model, lon_model, dist_error,
wind_carq, wind_model, wind_error, pres_carq, pres_model, pressure_error])
result_df = pd.DataFrame(row_list)
result_df = result_df.where((pd.notnull(result_df)), None)
result_cols = ['StormID', 'ModelBasetime', 'Model' , 'Tau',
'LatCARQ', 'LonCARQ', 'LatModel', 'LonModel', 'DistError',
'WindCARQ', 'WindModel','WindError',
'PresCARQ', 'PresModel','PresError']
result_df.columns = result_cols
calculate_adeck_errors(infile)
To clarify what I'm doing:
1. The CARQ entries are the control (actual).
2. The other models are the guesses.
3. I'm comparing the control (CARQ) to the guesses to see what their errors are.
4. The basis of the comparison is the MODELBASETIME = POSBASETIME
4. A sample file I'm processing is here: http://vortexweather.com/downloads/adeck/aal062018.csv
I was hoping there is a faster way than i'm doing it, or another pandas method besides iterrows
Many thanks for suggestion.
Bryan
This code takes about 10 seconds to run your entire dataset!
The code looks very similar to what you have written, with the exception that all of the operations within the main_function have been vectorized. See Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects
2018-09-13_adeck_error_calculations.ipynb
import pandas as pd
import numpy as np
import datetime
from haversine import haversine
def main_function(df, row):
"""
The main difference here is that everything is vectorized
Returns: DataFrame
"""
df_new = pd.DataFrame()
df_storage = pd.DataFrame()
pos_datetime = df.POSDATETIME.isin([row['POSDATETIME']]) # creates a Boolean map
array_len = len(pos_datetime)
new_index = pos_datetime.index
df_new['StormID'] = df.loc[pos_datetime, 'STORMID']
df_new['ModelBaseTime'] = df.loc[pos_datetime, 'MODELDATETIME']
df_new['Model'] = df.loc[pos_datetime, 'MODEL']
df_new['Tau'] = df.loc[pos_datetime, 'TAU']
# Distance
df_new['LatCARQ'] = pd.DataFrame(np.full((array_len, 1), row['LAT']), index=new_index).loc[pos_datetime, 0]
df_new['LonCARQ'] = pd.DataFrame(np.full((array_len, 1), row['LON']), index=new_index).loc[pos_datetime, 0]
df_new['LatModel'] = df.loc[pos_datetime, 'LAT']
df_new['LonModel'] = df.loc[pos_datetime, 'LON']
def calc_dist_error(row):
return round(haversine((row['LatCARQ'], row['LonCARQ']), (row['LatModel'], row['LonModel']), miles=True)) if row['LatModel'] != 0.0 else None
df_new['DistError'] = df_new.apply(calc_dist_error, axis=1)
# Wind
df_new['WindCARQ'] = pd.DataFrame(np.full((array_len, 1), row['WIND']), index=new_index).loc[pos_datetime, 0]
df_new['WindModel'] = df.loc[pos_datetime, 'WIND']
df_storage['row_WIND'] = pd.DataFrame(np.full((array_len, 1), row['WIND']), index=new_index).loc[pos_datetime, 0]
df_storage['df_WIND'] = df.loc[pos_datetime, 'WIND']
def wind_error_calc(row):
return (row['row_WIND'] - row['df_WIND']) if row['df_WIND'] != 0 else None
df_new['WindError'] = df_storage.apply(wind_error_calc, axis=1)
# Air Pressure
df_new['PresCARQ'] = pd.DataFrame(np.full((array_len, 1), row['PRES']), index=new_index).loc[pos_datetime, 0]
df_new['PresModel'] = df.loc[pos_datetime, 'PRES']
df_storage['row_PRES'] = pd.DataFrame(np.full((array_len, 1), row['PRES']), index=new_index).loc[pos_datetime, 0]
df_storage['df_PRES'] = df.loc[pos_datetime, 'PRES']
def pres_error_calc(row):
return abs(row['row_PRES'] - row['df_PRES']) if row['df_PRES'] != 0 else None
df_new['PresError'] = df_storage.apply(pres_error_calc, axis=1)
del(df_storage)
return df_new
def calculate_adeck_errors(in_file):
"""
Retruns: DataFrame
"""
print(f'Starting Data Calculations: {datetime.datetime.now().strftime("%I:%M:%S%p on %B %d, %Y")}')
pd.set_option('max_columns', 20)
pd.set_option('max_rows', 300)
# read in the raw csv
adeck_df = pd.read_csv(in_file)
adeck_df['MODELDATETIME'] = pd.to_datetime(adeck_df['MODELDATETIME'], format='%Y-%m-%d %H:%M')
adeck_df['POSDATETIME'] = pd.to_datetime(adeck_df['POSDATETIME'], format='%Y-%m-%d %H:%M')
#extract only the carq items and remove duplicates
carq_data = adeck_df[(adeck_df.MODEL == 'CARQ') & (adeck_df.TAU == 0)].drop_duplicates(keep='last')
print('Len carq_data: ', len(carq_data))
#remove carq items from original
final_df = adeck_df[adeck_df.MODEL != 'CARQ']
print('Len final_df: ', len(final_df))
df_out_new = pd.DataFrame()
for index, row in carq_data.iterrows():
test_df = main_function(final_df, row) # function call
df_out_new = df_out_new.append(test_df, sort=False)
df_out_new = df_out_new.reset_index(drop=True)
df_out_new = df_out_new.where((pd.notnull(df_out_new)), None)
print(f'Finishing Data Calculations: {datetime.datetime.now().strftime("%I:%M:%S%p on %B %d, %Y")}')
return df_out_new
in_file = 'aal062018.csv'
df = calculate_adeck_errors(in_file)
>>>Starting Data Calculations: 02:18:30AM on September 13, 2018
>>>Len carq_data: 56
>>>Len final_df: 137999
>>>Finishing Data Calculations: 02:18:39AM on September 13, 2018
print(len(df))
>>>95630
print(df.head(20))
Please don't forget to check the accepted solution. Enjoy!
Looks like you are creating two dataframes out of the same dataframe, and then processing them. Two things that may cut your time.
First, you are iterating over both dataframes and checking for a condition:
for _, row in carq_data.iterrows():
for _, arow in final_df.iterrows():
if arow['POSDATETIME'] == row['POSDATETIME']:
# do something by using both tables
This is essentially an implementation of a join. You are joining carq_data with final_df on 'POSDATETIME'.
As a first step, you should merge the tables:
merged = carq_data.merge(final_df, on=['POSDATETIME'])
At this point you will get multiple rows for each similar 'POSDATETIME'. In the below, let's assume column b is POSDATETIME:
>>> a
a b
0 1 11
1 1 33
>>> b
a b
0 1 2
1 1 3
2 1 4
>>> merged = a.merge(b, on=['a'])
>>> merged
a b_x b_y
0 1 11 2
1 1 11 3
2 1 11 4
3 1 33 2
4 1 33 3
5 1 33 4
Now, to do your conditional calculations, you can use the apply() function.
First, define a function:
def calc_dist_error(row):
return int(round(haversine(row['b_x'], row['b_y'], miles=True))) if row['a'] != 0.0 else None
Then apply it to every row:
merged['dist_error'] = merged.apply(calc_dist_error, axis=1)
Continuing my small example:
>>> merged['c'] = [1, 0, 0, 0, 2, 3]
>>> merged
a b_x b_y c
0 1 11 2 1
1 1 11 3 0
2 1 11 4 0
3 1 33 2 0
4 1 33 3 2
5 1 33 4 3
>>> def foo(row):
... return row['b_x'] - row['b_y'] if row['c'] != 0 else None
...
>>> merged['dist_error'] = merged.apply(foo, axis=1)
>>> merged
a b_x b_y c dist_error
0 1 11 2 1 9.0
1 1 11 3 0 NaN
2 1 11 4 0 NaN
3 1 33 2 0 NaN
4 1 33 3 2 30.0
5 1 33 4 3 29.0
This should help you reduce run time (see also this for how to check using %timeit). Hope this helps!

Sorting a text document python 3

Here's the text document: The first string is the type of metal, the second is the amount of the metal bars, the third is the weight, and the fourth is the value.
Gold 1 5 750
Silver 1 1 400
Rhodium 1 4 500
Platinum 1 6 1000
I have to sort this list by value using insertion sort. Here's what I have so far
def sortMetalsByValuePerBar(metals):
for i in range(1,len(metals)):
j = i
while j > 0 and metals[j-1] > metals[j]:
metals[j - 1], metals[j] = metals[j], metals[j - 1]
j -= 1
return metals
Is this correct?
Try to learn from this solution.
data="""
Gold 1 5 750
Silver 1 1 400
Rhodium 1 4 500
Platinum 1 6 1000
"""
data = filter(None, data.splitlines())
data = [l.split() for l in data]
data = [ (l[0], int(l[1]), int(l[2]), int(l[3])) for l in data ]
def insertion_sort(l, keyfunc=lambda i:i):
for i in range(1, len(l)):
j = i-1
key = l[i]
while keyfunc(l[j]) > keyfunc(key) and (j >= 0):
l[j+1] = l[j]
j -= 1
l[j+1] = key
insertion_sort(data, keyfunc=lambda l: l[3])
for l in data:
print(l)
# Output:
# ('Silver', 1, 1, 400)
# ('Rhodium', 1, 4, 500)
# ('Gold', 1, 5, 750)
# ('Platinum', 1, 6, 1000)

Resources