I have a df as shown below.
the data is like this.
Date y
0 2020-06-14 127
1 2020-06-15 216
2 2020-06-16 4
3 2020-06-17 90
4 2020-06-18 82
5 2020-06-19 70
6 2020-06-20 59
7 2020-06-21 48
8 2020-06-22 23
9 2020-06-23 25
10 2020-06-24 24
11 2020-06-25 22
12 2020-06-26 19
13 2020-06-27 10
14 2020-06-28 18
15 2020-06-29 157
16 2020-06-30 16
17 2020-07-01 14
18 2020-07-02 343
The code to create the data frame.
# Create a dummy dataframe
import pandas as pd
import numpy as np
y0 = [127,216,4,90, 82,70,59,48,23,25,24,22,19,10,18,157,16,14,343]
def initial_forecast(data):
data['y'] = y0
return data
# Initial date dataframe
df_dummy = pd.DataFrame({'Date': pd.date_range('2020-06-14', periods=19, freq='1D')})
# Dates
start_date = df_dummy.Date.iloc[1]
print(start_date)
end_date = df_dummy.Date.iloc[17]
print(end_date)
# Adding y0 in the dataframe
df_dummy = initial_forecast(df_dummy)
df_dummy
From the above I would like to interpolate the data for a particular date range.
I would like to interpolate(linear) between 2020-06-17 to 2020-06-27.
ie from 2020-06-17 to 2020-06-27 'y' values changes from 90 to 10 in 10 steps. so at an average in each step it reduces 8.
ie (90-10)/10(number of steps) = 8 in each steps
The expected output:
Date y y_new
0 2020-06-14 127 127
1 2020-06-15 216 216
2 2020-06-16 4 4
3 2020-06-17 90 90
4 2020-06-18 82 82
5 2020-06-19 70 74
6 2020-06-20 59 66
7 2020-06-21 48 58
8 2020-06-22 23 50
9 2020-06-23 25 42
10 2020-06-24 24 34
11 2020-06-25 22 26
12 2020-06-26 19 18
13 2020-06-27 10 10
14 2020-06-28 18 18
15 2020-06-29 157 157
16 2020-06-30 16 16
17 2020-07-01 14 14
18 2020-07-02 343 343
Note: In the remaining date range y_new value should be same as y value.
I tried below code, that is not giving desired output
# Function
def df_interpolate(df, start_date, end_date):
df["Date"]=pd.to_datetime(df["Date"])
df.loc[(df['Date'] >= start_date) & (df['Date'] <= end_date), 'y_new'] = np.nan
df['y_new'] = df['y'].interpolate().round()
return df
df1 = df_interpolate(df_dummy, '2020-06-17', '2020-06-27')
With some tweaks to your function it works. np.where to create the new column, removing the = from your conditionals, and casting to int as per your expected output.
def df_interpolate(df, start_date, end_date):
df["Date"] = pd.to_datetime(df["Date"])
df['y_new'] = np.where((df['Date'] > start_date) & (df['Date'] < end_date), np.nan, df['y'])
df['y_new'] = df['y_new'].interpolate().round().astype(int)
return df
Date y y_new
0 2020-06-14 127 127
1 2020-06-15 216 216
2 2020-06-16 4 4
3 2020-06-17 90 90
4 2020-06-18 82 82
5 2020-06-19 70 74
6 2020-06-20 59 66
7 2020-06-21 48 58
8 2020-06-22 23 50
9 2020-06-23 25 42
10 2020-06-24 24 34
11 2020-06-25 22 26
12 2020-06-26 19 18
13 2020-06-27 10 10
14 2020-06-28 18 18
15 2020-06-29 157 157
16 2020-06-30 16 16
17 2020-07-01 14 14
18 2020-07-02 343 343
I am trying to to get data from this lottery website:
https://www.lotterycorner.com/tx/lotto-texas/2019
The data I would like scrape is the dates and the winning numbers for 2017 to 2019. Then I would like to convert the data into a list and save to a csv file or excel file.
I do apologize if my question isn't understandable i am new to python. Here is a code I tried, but I don't know what to do after this
page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2017')
soup = BeautifulSoup(page.content,'html.parser')
week = soup.find(class_='win-number-table row no-brd-reduis')
dates = (week.find_all(class_='win-nbr-date col-sm-3 col-xs-4'))
wn = (week.find_all(class_='nbr-grp'))
I would like my result to be something like this:
Don't use BeautifulSoup if there are table tags. It's much easier to let Pandas do the work for you (it uses BeautifulSoup to parse tables under the hood).
import pandas as pd
years = [2017, 2018, 2019]
df = pd.DataFrame()
for year in years:
url = 'https://www.lotterycorner.com/tx/lotto-texas/%s' %year
table = pd.read_html(url)[0][1:]
win_nums = table.loc[:,1].str.split(" ",expand=True).reset_index(drop=True)
dates = pd.DataFrame(list(table.loc[:,0]), columns=['date'])
table = dates.merge(win_nums, left_index=True, right_index=True)
df = df.append(table, sort=True).reset_index(drop=True)
df['date']= pd.to_datetime(df['date'])
df = df.sort_values('date').reset_index(drop=True)
df.to_csv('file.csv', index=False, header=False)
Output:
print (df)
date 0 1 2 3 4 5
0 2017-01-04 5 7 36 39 40 44
1 2017-01-07 2 5 14 18 26 27
2 2017-01-11 4 13 16 19 43 51
3 2017-01-14 7 8 10 18 47 48
4 2017-01-18 6 11 17 37 40 49
5 2017-01-21 2 13 17 39 41 46
6 2017-01-25 1 14 19 32 37 46
7 2017-01-28 5 7 30 48 51 52
8 2017-02-01 12 19 26 29 37 54
9 2017-02-04 8 13 19 25 26 29
10 2017-02-08 10 15 47 49 51 52
11 2017-02-11 24 25 26 29 41 53
12 2017-02-15 1 4 5 43 53 54
13 2017-02-18 5 11 14 21 38 44
14 2017-02-22 4 8 21 27 52 53
15 2017-02-25 16 37 42 46 49 54
16 2017-03-01 3 24 33 34 45 51
17 2017-03-04 2 4 5 17 48 50
18 2017-03-08 15 19 24 33 34 47
19 2017-03-11 5 6 24 28 29 37
20 2017-03-15 4 11 19 27 32 46
21 2017-03-18 12 15 16 23 38 43
22 2017-03-22 3 5 15 27 36 52
23 2017-03-25 21 25 27 30 36 48
24 2017-03-29 7 9 11 18 23 43
25 2017-04-01 3 21 28 33 38 52
26 2017-04-05 8 20 21 26 51 52
27 2017-04-08 10 11 12 47 48 52
28 2017-04-12 5 26 30 31 46 54
29 2017-04-15 2 11 36 40 42 53
.. ... .. .. .. .. .. ..
265 2019-07-20 3 35 38 45 50 51
266 2019-07-24 2 9 16 22 46 49
267 2019-07-27 1 2 6 8 20 53
268 2019-07-31 20 24 34 36 41 44
269 2019-08-03 6 17 18 20 26 34
270 2019-08-07 1 3 16 22 31 35
271 2019-08-10 18 19 27 36 48 52
272 2019-08-14 22 23 29 36 39 49
273 2019-08-17 14 18 21 23 40 44
274 2019-08-21 18 28 29 36 48 52
275 2019-08-24 11 31 42 48 50 52
276 2019-08-28 9 21 40 42 49 53
277 2019-08-31 5 7 30 41 44 54
278 2019-09-04 4 26 36 37 45 50
279 2019-09-07 22 23 31 33 40 42
280 2019-09-11 8 11 12 30 31 49
281 2019-09-14 1 3 24 28 31 41
282 2019-09-18 3 24 26 29 45 50
283 2019-09-21 2 20 31 43 45 54
284 2019-09-25 5 9 26 38 41 44
285 2019-09-28 16 18 39 45 49 54
286 2019-10-02 9 26 39 42 47 49
287 2019-10-05 6 10 18 24 32 37
288 2019-10-09 14 18 19 27 33 41
289 2019-10-12 3 11 15 29 44 49
290 2019-10-16 12 15 25 39 46 49
291 2019-10-19 19 29 41 46 50 51
292 2019-10-23 4 5 11 35 44 50
293 2019-10-26 1 2 26 41 42 54
294 2019-10-30 10 11 28 31 40 53
[295 rows x 7 columns]
Code below create csv files by year with data with all headers and values, in example below will be 3 files: data_2017.csv, data_2018.csv and data_2019.csv.
You can add another year to years = ['2017', '2018', '2019'] if needed.
Winning Numbers formatted to be as 1-2-3-4-5.
from bs4 import BeautifulSoup
import requests
import pandas as pd
base_url = 'https://www.lotterycorner.com/tx/lotto-texas/'
years = ['2017', '2018', '2019']
with requests.session() as s:
for year in years:
data = []
page = requests.get(f'https://www.lotterycorner.com/tx/lotto-texas/{year}')
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.select(".win-number-table tr")
headers = [td.text.strip() for td in rows[0].find_all("td")]
# remove header line
del rows[0]
for row in rows:
td = [td.text.strip() for td in row.select("td")]
# replace whitespaces in Winning Numbers with -
td[headers.index("Winning Numbers")] = '-'.join(td[headers.index("Winning Numbers")].split())
data.append(td)
df = pd.DataFrame(data, columns=headers)
df.to_csv(f'data_{year}')
To save only Winning Numbers, replace df.to_csv(f'data_{year}') with:
df.to_csv(f'data_{year}', columns=["Winning Numbers"], index=False, header=False)
Example output for 2017, only Winning Numbers, no header:
9-14-16-27-45-51 2-4-15-38-48-53 8-22-23-29-34-36
6-10-11-22-30-45 5-10-16-22-26-46 12-14-19-34-39-47
4-5-10-21-34-40 1-25-35-42-48-51
This should export the data you need in a csv file:
from bs4 import BeautifulSoup
from csv import writer
import requests
page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2019')
soup = BeautifulSoup(page.content,'html.parser')
header = {
'date': 'win-nbr-date col-sm-3 col-xs-4',
'winning numbers': 'nbr-grp',
'jackpot': 'win-nbr-jackpot col-sm-3 col-xs-3',
}
table = []
for header_key, header_value in header.items():
items = soup.find_all(class_=f"{header_value}")
column = [','.join(item.get_text().split()) if header_key=='winning numbers'
else ''.join(item.get_text().split()) if header_key == 'jackpot'
else item.get_text() for item in items]
table.append(column)
rows = list(zip(*table))
with open("winning numbers.csv", "w") as f:
csv_writer = writer(f)
csv_writer.writerow(header)
for row in rows:
csv_writer.writerow(row)
header is a dictionary mapping what will be your csv headers to their html class values
In the for loop we're building up the data per column. Some special handling was required for "winning numbers" and "jackpot", where I'm replacing any whitespace/hidden characters with comma/empty string.
Each column will be added to a list called table. We write everything in a csv file, but as csv writes one row at a time, we need to prepare our rows using the zip function (rows = list(zip(*table)))
Here is a concise way with bs4 4.7.1+ that uses :not to exclude header and zip to combine columns for output. Results are as on page. Session is used for efficiency of tcp connection re-use.
import requests, re, csv
from bs4 import BeautifulSoup as bs
dates = []; winning_numbers = []
with requests.Session() as s:
for year in range(2017, 2020):
r = s.get(f'https://www.lotterycorner.com/tx/lotto-texas/{year}')
soup = bs(r.content)
dates.extend([i.text for i in soup.select('.win-nbr-date:not(.blue-bg)')])
winning_numbers.extend([re.sub('\s+','-',i.text.strip()) for i in soup.select('.nbr-list')])
with open("lottery.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['date','numbers'])
for row in zip(dates, winning_numbers):
w.writerow(row)
This one works:
import requests
from bs4 import BeautifulSoup
import io
import re
def main():
page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2018')
soup = BeautifulSoup(page.content,'html.parser')
week = soup.find(class_='win-number-table row no-brd-reduis')
wn = (week.find_all(class_='nbr-grp'))
file = open ("vit.txt","w+")
for winning_number in wn:
line = remove_html_tags(str(winning_number.contents).strip('[]'))
line = line.replace(" ", "")
file.write(line + "\n")
file.close()
def remove_html_tags(text):
import re
clean = re.compile('<.*?>')
return re.sub(clean, '', text)
This part of the code loops through the wn variable and writes every line to the "vit.txt" file:
for winning_number in wn:
line = remove_html_tags(str(winning_number.contents).strip('[]'))
line = line.replace(" ", "")
file.write(line + "\n")
file.close()
The "stripping" of the <li> tags could be probably done better, e.g. there should be an elegant way to save the winning_number to a list and print the list with 1 line.
I have a dataset that I need to filter once a value has been exceeded but not after based on a groupby() of a second column. Here is an example of the dataframe:
df2 = df.groupby(['UWI']).[df.DIP > 85].reset_index(drop = True)
where I have a dataframe that looks like this:
UWI DIP
0 17 70
1 17 80
2 17 90
3 17 80
4 17 83
5 2 62
6 2 75
7 2 87
8 2 91
I want the returned dataframe to look like this:
UWI DIP
0 17 90
1 17 80
2 17 83
3 2 87
4 2 91
This is a large dataframe so efficiency would be appreciated.
IIUC using cummax
df[df.DIP.gt(85).groupby(df['UWI']).cummax()]
UWI DIP
2 17 90
3 17 80
4 17 83
7 2 87
8 2 91
I have a list of 1 column and 50 rows.
I want to divide it into 5 segments. And each segment has to become a column of a dataframe. I do not want the NAN to appear (figure2). How can I solve that?
Like this:
df = pd.DataFrame(result_list)
AWA=df[:10]
REM=df[10:20]
S1=df[20:30]
S2=df[30:40]
SWS=df[40:50]
result = pd.concat([AWA, REM, S1, S2, SWS], axis=1)
result
Figure2
You can use numpy's reshape function:
result_list = [i for i in range(50)]
pd.DataFrame(np.reshape(result_list, (10, 5), order='F'))
Out:
0 1 2 3 4
0 0 10 20 30 40
1 1 11 21 31 41
2 2 12 22 32 42
3 3 13 23 33 43
4 4 14 24 34 44
5 5 15 25 35 45
6 6 16 26 36 46
7 7 17 27 37 47
8 8 18 28 38 48
9 9 19 29 39 49
This question already has answers here:
Calculate Percentile in Excel 2010
(3 answers)
Closed 9 years ago.
I am trying to calculate how many calls came back in 95 percentile of time. Below is my Result Set. I am working with Excel 2010
Milliseconds Number
0 1702
1 15036
2 14262
3 13190
4 9137
5 5635
6 3742
7 2628
8 1899
9 1298
10 963
11 727
12 503
13 415
14 311
15 235
16 204
17 140
18 109
19 83
20 72
21 55
22 52
23 35
24 33
25 25
26 15
27 18
28 14
29 15
30 13
31 19
32 23
33 19
34 21
35 20
36 25
37 26
38 13
39 12
40 10
41 17
42 6
43 7
44 8
45 4
46 7
47 9
48 11
49 12
50 9
51 9
52 9
53 8
54 10
55 10
56 11
57 3
58 7
59 7
60 2
61 5
62 7
63 5
64 5
65 2
66 3
67 2
68 1
70 1
71 2
72 1
73 4
74 1
75 1
76 1
77 3
80 1
81 1
85 1
87 2
93 1
96 1
100 1
107 1
112 1
116 1
125 1
190 1
356 1
450 1
492 1
497 1
554 1
957 1
Just some background what does above information means-
1702 calls came back in 0 milliseconds
15036 calls came back in 1 milliseconds
14262 calls came back in 2 milliseconds
etc etc
So to calculate the 95th percentile from the above data, I am using this formula in excel 2010-
=PERCENTILE.EXC(IF(TRANSPOSE(ROW(INDIRECT("1:"&MAX(H$2:H$96))))<=H$2:H$96,A$2:A$96),0.95)
Can anyone help me whether the way I am doing in Excel 2010 is right or not?
I am getting 95th percentile as 10 by using the above scenario.
Thanks for the help.
that's essentially the same question you asked here and the formula I suggested. As per my last comments in that question - that formula should work OK as long as you use CTRL+SHIFT+ENTER correctly. I get 10 as the answer for this example using that formula.
I think you can verify manually that that is indeed the correct answer. If you have a running total in an adjacent column then you can see where the 95th percentile is reached......