exception handling attempt in pandas - python-3.x

I am having difficulty creating two columns, "Home Score" and "Away Score", in the wikipedia table I am trying to parse.
I tried the following script with two try-except-else statements to see if that would work.
test_matches = pd.read_html('https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results')
test_matches = test_matches[1]
test_matches['Year'] = test_matches['Date'].str[-4:].apply(pd.to_numeric)
test_matches_worst = test_matches[(test_matches['Winner'] != 'Wales') & (test_matches['Year'] >= 2007) & (test_matches['Competition'].str.contains('Nations'))]
try:
test_matches_worst['Home Score'] = test_matches_worst['Score'].str.split("–").str[0].apply(pd.to_numeric)
except:
print("let's try again")
else:
test_matches_worst['Home Score'] = test_matches_worst['Score'].str.split("-").str[0].apply(pd.to_numeric)
try:
test_matches_worst['Away Score'] = test_matches_worst['Score'].str.split("–").str[1].apply(pd.to_numeric)
except:
print("let's try again")
else:
test_matches_worst['Away Score'] = test_matches_worst['Score'].str.split("-").str[1].apply(pd.to_numeric)
test_matches_worst['Margin'] = (test_matches_worst['Home Score'] - test_matches_worst['Away Score']).abs()
test_matches_worst.sort_values('Margin', ascending=False).reset_index(drop = True)#.head(20)
However, I would receive a Key error message and the "Home Score" is not displayed in the dataframe when shortening the code. What is the best way to handle this particular table and to generate the columns that I want? Any assistance on this would be greatly appreciated. Thanks in advance.

The problem of the data you collect is the hyphen or dash. Except the last row, all score separator are the 'En Dash' (U+2013) and not the 'Hyphen' (U+002D):
sep = r'[-\u2013]'
# df is test_matches_worst
df[['Home Score','Away Score']] = df['Score'].str.split(sep, expand=True).astype(int)
df['Margin'] = df['Home Score'].sub(df['Away Score']).abs
Output:
>>> df[['Score', 'Home Score', 'Away Score', 'Margin']]
Score Home Score Away Score Margin
565 9–19 9 19 10
566 21–9 21 9 12
567 32–21 32 21 11
568 23–20 23 20 3
593 21–16 21 16 5
595 15–17 15 17 2
602 30–17 30 17 13
604 20–26 20 26 6
605 27–12 27 12 15
614 19–26 19 26 7
618 28–9 28 9 19
644 22–30 22 30 8
656 26–3 26 3 23
658 29–18 29 18 11
666 16–21 16 21 5
679 16–16 16 16 0
682 25–21 25 21 4
693 16–21 16 21 5
694 29–13 29 13 16
696 20–18 20 18 2
704 12–6 12 6 6
705 37–27 37 27 10
732 24–14 24 14 10
733 23–27 23 27 4
734 33–30 33 30 3
736 10–14 10 14 4
737 32–9 32 9 23
739 13–24 13 24 11
745 32–30 32 30 2
753 29-7 29 7 22
Note: you will probably receive a SettingWithCopyWarning
To solve it, use test_matches = test_matches[1].copy()
Bonus
Pandas function like to_datetime, to_timedelta or to_numeric can take a Series as parameter so you can avoid apply:
test_matches['Year'] = pd.to_numeric(test_matches['Date'].str[-4:])

Related

Create new column in data frame by interpolating other column in between a particular date range - Pandas

I have a df as shown below.
the data is like this.
Date y
0 2020-06-14 127
1 2020-06-15 216
2 2020-06-16 4
3 2020-06-17 90
4 2020-06-18 82
5 2020-06-19 70
6 2020-06-20 59
7 2020-06-21 48
8 2020-06-22 23
9 2020-06-23 25
10 2020-06-24 24
11 2020-06-25 22
12 2020-06-26 19
13 2020-06-27 10
14 2020-06-28 18
15 2020-06-29 157
16 2020-06-30 16
17 2020-07-01 14
18 2020-07-02 343
The code to create the data frame.
# Create a dummy dataframe
import pandas as pd
import numpy as np
y0 = [127,216,4,90, 82,70,59,48,23,25,24,22,19,10,18,157,16,14,343]
def initial_forecast(data):
data['y'] = y0
return data
# Initial date dataframe
df_dummy = pd.DataFrame({'Date': pd.date_range('2020-06-14', periods=19, freq='1D')})
# Dates
start_date = df_dummy.Date.iloc[1]
print(start_date)
end_date = df_dummy.Date.iloc[17]
print(end_date)
# Adding y0 in the dataframe
df_dummy = initial_forecast(df_dummy)
df_dummy
From the above I would like to interpolate the data for a particular date range.
I would like to interpolate(linear) between 2020-06-17 to 2020-06-27.
ie from 2020-06-17 to 2020-06-27 'y' values changes from 90 to 10 in 10 steps. so at an average in each step it reduces 8.
ie (90-10)/10(number of steps) = 8 in each steps
The expected output:
Date y y_new
0 2020-06-14 127 127
1 2020-06-15 216 216
2 2020-06-16 4 4
3 2020-06-17 90 90
4 2020-06-18 82 82
5 2020-06-19 70 74
6 2020-06-20 59 66
7 2020-06-21 48 58
8 2020-06-22 23 50
9 2020-06-23 25 42
10 2020-06-24 24 34
11 2020-06-25 22 26
12 2020-06-26 19 18
13 2020-06-27 10 10
14 2020-06-28 18 18
15 2020-06-29 157 157
16 2020-06-30 16 16
17 2020-07-01 14 14
18 2020-07-02 343 343
Note: In the remaining date range y_new value should be same as y value.
I tried below code, that is not giving desired output
# Function
def df_interpolate(df, start_date, end_date):
df["Date"]=pd.to_datetime(df["Date"])
df.loc[(df['Date'] >= start_date) & (df['Date'] <= end_date), 'y_new'] = np.nan
df['y_new'] = df['y'].interpolate().round()
return df
df1 = df_interpolate(df_dummy, '2020-06-17', '2020-06-27')
With some tweaks to your function it works. np.where to create the new column, removing the = from your conditionals, and casting to int as per your expected output.
def df_interpolate(df, start_date, end_date):
df["Date"] = pd.to_datetime(df["Date"])
df['y_new'] = np.where((df['Date'] > start_date) & (df['Date'] < end_date), np.nan, df['y'])
df['y_new'] = df['y_new'].interpolate().round().astype(int)
return df
Date y y_new
0 2020-06-14 127 127
1 2020-06-15 216 216
2 2020-06-16 4 4
3 2020-06-17 90 90
4 2020-06-18 82 82
5 2020-06-19 70 74
6 2020-06-20 59 66
7 2020-06-21 48 58
8 2020-06-22 23 50
9 2020-06-23 25 42
10 2020-06-24 24 34
11 2020-06-25 22 26
12 2020-06-26 19 18
13 2020-06-27 10 10
14 2020-06-28 18 18
15 2020-06-29 157 157
16 2020-06-30 16 16
17 2020-07-01 14 14
18 2020-07-02 343 343

random number generator issue: removing None at the end [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 2 years ago.
Improve this question
I created a lottery number generator with Python 3.7. It shows however None at the end of each try. Here's my code.
import random
def lotto_gen():
n = 1
while n < 7:
print(random.randint(1, 45), end='\t')
n += 1
return
for numbers in range(100):
print(lotto_gen())
And the result goes like this:
6 12 42 37 13 44 None
36 31 32 41 4 30 None
20 31 38 42 14 19 None
8 18 29 22 34 29 None
26 34 15 1 20 38 None
10 17 28 35 22 38 None
23 34 42 22 4 43 None
25 16 17 36 17 4 None
44 8 20 1 43 43 None
29 32 9 2 8 5 None
16 44 35 17 42 10 None
5 1 39 28 21 40 None
35 25 12 31 23 21 None
13 25 9 10 41 7 None
12 34 14 36 27 5 None
32 30 12 5 41 14 None
23 30 5 30 7 9 None
38 25 6 17 17 20 None
12 1 13 10 30 32 None
15 1 3 23 28 6 None
1 2 24 33 36 31 None
28 13 42 39 9 39 None
41 44 2 9 41 34 None
25 19 30 26 8 44 None
39 36 44 4 22 7 None
7 44 29 38 1 8 None
37 6 44 6 41 11 None
29 29 23 40 23 36 None
25 39 30 40 40 4 None
28 14 33 4 15 34 None
41 35 7 26 30 24 None
10 34 26 45 12 10 None
32 6 45 16 24 18 None
14 7 8 26 32 4 None
22 43 40 3 20 31 None
6 42 38 11 18 20 None
6 40 5 18 25 29 None
37 19 26 19 45 41 None
39 8 17 19 17 22 None
I want to remove that None bool type. Can someone tell me how can I edit my code?
Rakesh has given the correct answer, but I would like to explain why your code isn't working. The problem seems to be that for a particular iteration, your code is only able to generate 6 random numbers. Take note, that you have initialized n=1, inside the function lotto_gen() and as the condition for executing the while loop is n<7, the code inside lotto_gen() executes only 6 times.
Now the reason why you receive None at the end is because you are trying to print the value returned by lotto_gen, but take note, that the return field inside your code's function is empty, hence None is returned by the function and hence that gets printed.
So for correcting the code you only need to initialize n as n=0, and to remove the appearance of the none, don't call the function inside a print statement, and create a list which contains the 7 values of each iteration and return it. So, you'll need to modify the code in this manner:
import random
def lotto_gen():
n = 0
a=[]
while n < 7:
a.append(random.randint(1, 45))
n += 1
return a
for numbers in range(100):
print(lotto_gen())
You can use this approach too and my code will execute faster as well! :P
This is one approach.
Ex:
import random
def lotto_gen():
return "\t".join(str(random.randint(1, 45)) for _ in range(6))
for numbers in range(100):
print(lotto_gen())

Scraping html data from a web site with <li> tags

I am trying to to get data from this lottery website:
https://www.lotterycorner.com/tx/lotto-texas/2019
The data I would like scrape is the dates and the winning numbers for 2017 to 2019. Then I would like to convert the data into a list and save to a csv file or excel file.
I do apologize if my question isn't understandable i am new to python. Here is a code I tried, but I don't know what to do after this
page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2017')
soup = BeautifulSoup(page.content,'html.parser')
week = soup.find(class_='win-number-table row no-brd-reduis')
dates = (week.find_all(class_='win-nbr-date col-sm-3 col-xs-4'))
wn = (week.find_all(class_='nbr-grp'))
I would like my result to be something like this:
Don't use BeautifulSoup if there are table tags. It's much easier to let Pandas do the work for you (it uses BeautifulSoup to parse tables under the hood).
import pandas as pd
years = [2017, 2018, 2019]
df = pd.DataFrame()
for year in years:
url = 'https://www.lotterycorner.com/tx/lotto-texas/%s' %year
table = pd.read_html(url)[0][1:]
win_nums = table.loc[:,1].str.split(" ",expand=True).reset_index(drop=True)
dates = pd.DataFrame(list(table.loc[:,0]), columns=['date'])
table = dates.merge(win_nums, left_index=True, right_index=True)
df = df.append(table, sort=True).reset_index(drop=True)
df['date']= pd.to_datetime(df['date'])
df = df.sort_values('date').reset_index(drop=True)
df.to_csv('file.csv', index=False, header=False)
Output:
print (df)
date 0 1 2 3 4 5
0 2017-01-04 5 7 36 39 40 44
1 2017-01-07 2 5 14 18 26 27
2 2017-01-11 4 13 16 19 43 51
3 2017-01-14 7 8 10 18 47 48
4 2017-01-18 6 11 17 37 40 49
5 2017-01-21 2 13 17 39 41 46
6 2017-01-25 1 14 19 32 37 46
7 2017-01-28 5 7 30 48 51 52
8 2017-02-01 12 19 26 29 37 54
9 2017-02-04 8 13 19 25 26 29
10 2017-02-08 10 15 47 49 51 52
11 2017-02-11 24 25 26 29 41 53
12 2017-02-15 1 4 5 43 53 54
13 2017-02-18 5 11 14 21 38 44
14 2017-02-22 4 8 21 27 52 53
15 2017-02-25 16 37 42 46 49 54
16 2017-03-01 3 24 33 34 45 51
17 2017-03-04 2 4 5 17 48 50
18 2017-03-08 15 19 24 33 34 47
19 2017-03-11 5 6 24 28 29 37
20 2017-03-15 4 11 19 27 32 46
21 2017-03-18 12 15 16 23 38 43
22 2017-03-22 3 5 15 27 36 52
23 2017-03-25 21 25 27 30 36 48
24 2017-03-29 7 9 11 18 23 43
25 2017-04-01 3 21 28 33 38 52
26 2017-04-05 8 20 21 26 51 52
27 2017-04-08 10 11 12 47 48 52
28 2017-04-12 5 26 30 31 46 54
29 2017-04-15 2 11 36 40 42 53
.. ... .. .. .. .. .. ..
265 2019-07-20 3 35 38 45 50 51
266 2019-07-24 2 9 16 22 46 49
267 2019-07-27 1 2 6 8 20 53
268 2019-07-31 20 24 34 36 41 44
269 2019-08-03 6 17 18 20 26 34
270 2019-08-07 1 3 16 22 31 35
271 2019-08-10 18 19 27 36 48 52
272 2019-08-14 22 23 29 36 39 49
273 2019-08-17 14 18 21 23 40 44
274 2019-08-21 18 28 29 36 48 52
275 2019-08-24 11 31 42 48 50 52
276 2019-08-28 9 21 40 42 49 53
277 2019-08-31 5 7 30 41 44 54
278 2019-09-04 4 26 36 37 45 50
279 2019-09-07 22 23 31 33 40 42
280 2019-09-11 8 11 12 30 31 49
281 2019-09-14 1 3 24 28 31 41
282 2019-09-18 3 24 26 29 45 50
283 2019-09-21 2 20 31 43 45 54
284 2019-09-25 5 9 26 38 41 44
285 2019-09-28 16 18 39 45 49 54
286 2019-10-02 9 26 39 42 47 49
287 2019-10-05 6 10 18 24 32 37
288 2019-10-09 14 18 19 27 33 41
289 2019-10-12 3 11 15 29 44 49
290 2019-10-16 12 15 25 39 46 49
291 2019-10-19 19 29 41 46 50 51
292 2019-10-23 4 5 11 35 44 50
293 2019-10-26 1 2 26 41 42 54
294 2019-10-30 10 11 28 31 40 53
[295 rows x 7 columns]
Code below create csv files by year with data with all headers and values, in example below will be 3 files: data_2017.csv, data_2018.csv and data_2019.csv.
You can add another year to years = ['2017', '2018', '2019'] if needed.
Winning Numbers formatted to be as 1-2-3-4-5.
from bs4 import BeautifulSoup
import requests
import pandas as pd
base_url = 'https://www.lotterycorner.com/tx/lotto-texas/'
years = ['2017', '2018', '2019']
with requests.session() as s:
for year in years:
data = []
page = requests.get(f'https://www.lotterycorner.com/tx/lotto-texas/{year}')
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.select(".win-number-table tr")
headers = [td.text.strip() for td in rows[0].find_all("td")]
# remove header line
del rows[0]
for row in rows:
td = [td.text.strip() for td in row.select("td")]
# replace whitespaces in Winning Numbers with -
td[headers.index("Winning Numbers")] = '-'.join(td[headers.index("Winning Numbers")].split())
data.append(td)
df = pd.DataFrame(data, columns=headers)
df.to_csv(f'data_{year}')
To save only Winning Numbers, replace df.to_csv(f'data_{year}') with:
df.to_csv(f'data_{year}', columns=["Winning Numbers"], index=False, header=False)
Example output for 2017, only Winning Numbers, no header:
9-14-16-27-45-51 2-4-15-38-48-53 8-22-23-29-34-36
6-10-11-22-30-45 5-10-16-22-26-46 12-14-19-34-39-47
4-5-10-21-34-40 1-25-35-42-48-51
This should export the data you need in a csv file:
from bs4 import BeautifulSoup
from csv import writer
import requests
page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2019')
soup = BeautifulSoup(page.content,'html.parser')
header = {
'date': 'win-nbr-date col-sm-3 col-xs-4',
'winning numbers': 'nbr-grp',
'jackpot': 'win-nbr-jackpot col-sm-3 col-xs-3',
}
table = []
for header_key, header_value in header.items():
items = soup.find_all(class_=f"{header_value}")
column = [','.join(item.get_text().split()) if header_key=='winning numbers'
else ''.join(item.get_text().split()) if header_key == 'jackpot'
else item.get_text() for item in items]
table.append(column)
rows = list(zip(*table))
with open("winning numbers.csv", "w") as f:
csv_writer = writer(f)
csv_writer.writerow(header)
for row in rows:
csv_writer.writerow(row)
header is a dictionary mapping what will be your csv headers to their html class values
In the for loop we're building up the data per column. Some special handling was required for "winning numbers" and "jackpot", where I'm replacing any whitespace/hidden characters with comma/empty string.
Each column will be added to a list called table. We write everything in a csv file, but as csv writes one row at a time, we need to prepare our rows using the zip function (rows = list(zip(*table)))
Here is a concise way with bs4 4.7.1+ that uses :not to exclude header and zip to combine columns for output. Results are as on page. Session is used for efficiency of tcp connection re-use.
import requests, re, csv
from bs4 import BeautifulSoup as bs
dates = []; winning_numbers = []
with requests.Session() as s:
for year in range(2017, 2020):
r = s.get(f'https://www.lotterycorner.com/tx/lotto-texas/{year}')
soup = bs(r.content)
dates.extend([i.text for i in soup.select('.win-nbr-date:not(.blue-bg)')])
winning_numbers.extend([re.sub('\s+','-',i.text.strip()) for i in soup.select('.nbr-list')])
with open("lottery.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['date','numbers'])
for row in zip(dates, winning_numbers):
w.writerow(row)
This one works:
import requests
from bs4 import BeautifulSoup
import io
import re
def main():
page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2018')
soup = BeautifulSoup(page.content,'html.parser')
week = soup.find(class_='win-number-table row no-brd-reduis')
wn = (week.find_all(class_='nbr-grp'))
file = open ("vit.txt","w+")
for winning_number in wn:
line = remove_html_tags(str(winning_number.contents).strip('[]'))
line = line.replace(" ", "")
file.write(line + "\n")
file.close()
def remove_html_tags(text):
import re
clean = re.compile('<.*?>')
return re.sub(clean, '', text)
This part of the code loops through the wn variable and writes every line to the "vit.txt" file:
for winning_number in wn:
line = remove_html_tags(str(winning_number.contents).strip('[]'))
line = line.replace(" ", "")
file.write(line + "\n")
file.close()
The "stripping" of the <li> tags could be probably done better, e.g. there should be an elegant way to save the winning_number to a list and print the list with 1 line.

printing a string like a matrix

Trying to let the user input a number, and print a table according to the square of its size. Here's an example.
Size--> 3
0 1 2
3 4 5
6 7 8
Size--> 4
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Size--> 6
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
Size--> 9
0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16 17
18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44
45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62
63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80
Here's is the code that i have tried.
length=int(input('Size--> '))
size=length*length
biglist=[]
for i in range(size):
biglist.append(i)
biglist = [str(i) for i in biglist]
for i in range(0, len(biglist), length):
print(' '.join(biglist[i: i+length]))
but instead here's what i got
Size--> 3
0 1 2
3 4 5
6 7 8
Size--> 4
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Size--> 6
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
As you can see the rows are not aligned properly like the example.
What's the simplest way of presenting it in a proper alignment? Thx :)
Using .format on string with right aligning.
And strlen is the number of characters required for each number.
length = int(input('Size--> '))
size = length*length
biglist = []
for i in range(size):
biglist.append(i)
biglist = [str(i) for i in biglist]
strlen = len(str(length**2-1))+1
for i in range(0, len(biglist), length):
# print(' '.join(biglist[i: i+length]))
for x in biglist[i: i+length]:
print(f"{x:>{strlen}}", end='')
print()

R: Reversing the data in a time series object

I figured out a way to backcast (ie. predicting the past) with a time series. Now I'm just struggling with the programming in R.
I would like to reverse the time series data so that I can forecast the past. How do I do this?
Say the original time series object looks like this:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2008 116 99 115 101 112 120 120 110 143 136 147 142
2009 117 114 133 134 139 147 147 131 125 143 136 129
I want it to look like this for the 'backcasting':
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2008 129 136 143 125 131 147 147 139 134 133 114 117
2009 142 147 136 143 110 120 120 112 101 115 99 116
Note, I didn't forget to change the years - I am basically mirroring/reversing the data and keeping the years, then going to forecast.
I hope this can be done in R? Or should I export and do it in Excel somehow?
Try this:
tt <- ts(1:24, start = 2008, freq = 12)
tt[] <- rev(tt)
ADDED. This also works and does not modify tt :
replace(tt, TRUE, rev(tt))
You can just coerce the matrix to a vector, reverse it, and make it a matrix again. Here's an example:
mat <- matrix(seq(24),nrow=2,byrow=TRUE)
> mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 1 2 3 4 5 6 7 8 9 10 11 12
[2,] 13 14 15 16 17 18 19 20 21 22 23 24
> matrix( rev(mat), nrow=nrow(mat) )
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 24 23 22 21 20 19 18 17 16 15 14 13
[2,] 12 11 10 9 8 7 6 5 4 3 2 1
I found this post of Hyndman under http://www.r-bloggers.com/backcasting-in-r/ and am basically pasting in his solution, which in my opinion provids a complete answer to you question.
library(forecast)
x <- WWWusage
h <- 20
f <- frequency(x)
# Reverse time
revx <- ts(rev(x), frequency=f)
# Forecast
fc <- forecast(auto.arima(revx), h)
plot(fc)
# Reverse time again
fc$mean <- ts(rev(fc$mean),end=tsp(x)[1] - 1/f, frequency=f)
fc$upper <- fc$upper[h:1,]
fc$lower <- fc$lower[h:1,]
fc$x <- x
# Plot result
plot(fc, xlim=c(tsp(x)[1]-h/f, tsp(x)[2]))

Resources