My first post in stack overflow, and I am also fairly new to the world of Beautiful Soup, so pls be kind.
I am trying to learn web scraping, and in particular, I want to be able to extract data from a government website. See code extract below:
url = 'https://data.gov.sg/dataset/rainfall-monthly-total?view_id=be92ce92-d02e-4920-9479-ff6da87b45e8&resource_id=778814b8-1b96-404b-9ac9-68d6c00e637b'
# Persists parameters across requests
s = requests.Session()
r = s.get(url)
soup = BeautifulSoup(r.content, "html.parser")
# Look for iframe, and src as attribute
iframe_src = soup.select_one("iframe").attrs["src"]
r = s.get(f"{iframe_src}")
print("Data Table is extracted from: ", iframe_src)
soup = BeautifulSoup(r.content, "html.parser")
#print(soup.prettify())
rows = soup.findAll("tr", {'class': ['odd', 'even']})
dat = []
for row in rows:
dat.append( map( str, row.findAll('td')))
As you can see, I was just trying to directly extract the table data, with the eventual goal of converting it to csv. However, for some reason, I cant seem to see any table data when I execute the code
Upon further reading, I learnt BeautifulSoup does not work with Javascript. I suspect that is the cause, and my preliminary investigation of teh data set seems to indicate this to be the case, but I am not sure. Thus,
1) Is the reason for the lack of output precisely that i.e. Javascript?
2) I read Selenium can be used - how do I incoporate it here?
3) I have multiple pages; how do I get all the data into 1 csv table?
Thank you in advance.
If learning to web scrape with Beautiful Soup, I think you have the right idea to go after a table (although once you learn Beautiful Soup to extract <table> tags, you should really use pandas as it'll make it a lot easier. But tables are a good way to understand how to work through tags in html.
I would advise to start with a different site. You are correct that this is js rendered, so you would need to have the html rendered first (by something like Selenium) to then go and parse the html. Since you are just starting off, don't over complicate by having to learn bs4 and Selenium. Here's a good site to practice scraping a table. Still a a little tricky, but a static table to play on.
This table can be pulled from a separate request URL that returns a JSON response. You can take that JSON and easily turn that into a table with pandas (see code below). Any API, or such, that you can use is the preferred method. The data comes in a nice JSON structure (sometimes not as nice, but at least consistent), and you don't necessarily have to iterate through "pages":
Code:
import pandas as pd
import requests
url = 'https://data.gov.sg/dataset/5942f8bd-4240-4f68-acd2-a5a276958237/resource/778814b8-1b96-404b-9ac9-68d6c00e637b/data'
payload = {'limit':'2000'}
# Persists parameters across requests
s = requests.Session()
jsonData = s.get(url, params=payload).json()
df = pd.DataFrame(jsonData['records'])
Output:
print (df)
_id month total_rainfall
0 459 2020-03 108.8
1 458 2020-02 65
2 457 2020-01 88.4
3 456 2019-12 421.5
4 455 2019-11 137.4
5 454 2019-10 176.8
6 453 2019-09 22.8
7 452 2019-08 11.8
8 451 2019-07 12.2
9 450 2019-06 173.8
10 449 2019-05 69
11 448 2019-04 174.8
12 447 2019-03 72.2
13 446 2019-02 31.6
14 445 2019-01 63.6
15 444 2018-12 172.6
16 443 2018-11 169.6
17 442 2018-10 234.4
18 441 2018-09 144.4
19 440 2018-08 121.6
20 439 2018-07 143.2
21 438 2018-06 182.6
22 437 2018-05 132.2
23 436 2018-04 61.2
24 435 2018-03 44.6
25 434 2018-02 14.8
26 433 2018-01 287
27 432 2017-12 371.2
28 431 2017-11 268.6
29 430 2017-10 120.8
.. ... ... ...
429 30 1984-06 255
430 29 1984-05 186.5
431 28 1984-04 153.1
432 27 1984-03 361.3
433 26 1984-02 470.4
434 25 1984-01 251.2
435 24 1983-12 370.6
436 23 1983-11 228.8
437 22 1983-10 212.7
438 21 1983-09 170.8
439 20 1983-08 262.2
440 19 1983-07 190
441 18 1983-06 94
442 17 1983-05 160.8
443 16 1983-04 33.6
444 15 1983-03 18.6
445 14 1983-02 5.6
446 13 1983-01 246
447 12 1982-12 494.1
448 11 1982-11 134
449 10 1982-10 39.3
450 9 1982-09 54.1
451 8 1982-08 169.5
452 7 1982-07 76.3
453 6 1982-06 59.5
454 5 1982-05 102.2
455 4 1982-04 157
456 3 1982-03 160.8
457 2 1982-02 27.8
458 1 1982-01 107.1
[459 rows x 3 columns]
Related
I have a text file consisting of data that is separated by tab-delimited columns. There are many ways to read data in from the file into python, but I am specifically trying to use a method similar to one outlined below. When using a context manager like with open(...) as ..., I've seen that the general concept is to have all of the subsequent code indented within the with statement. Yet when defining a function, the return statement is usually placed at the same indentation as the first line of code within the function (excluding cases with awkward if-else loops). In this case, both approaches work. Is one method considered correct or generally preferred over the other?
def read_in(fpath, contents=[], row_limit=np.inf):
"""
fpath is filelocation + filename + '.txt'
contents is the initial data that the file data will be appeneded to
row_limit is the maximum number of rows to be read (in case one would like to not read in every row).
"""
nrows = 0
with open(fpath, 'r') as f:
for row in f:
if nrows < row_limit:
contents.append(row.split())
nrows += 1
else:
break
# return contents
return contents
Below is a snippet of the text-file I am using for this example.
1996 02 08 05 17 49 263 70 184 247 126 0 -6.0 1.6e+14 2.7e+28 249
1996 02 12 05 47 26 91 53 160 100 211 236 2.0 1.3e+15 1.6e+29 92
1996 02 17 02 06 31 279 73 317 257 378 532 9.9 3.3e+14 1.6e+29 274
1996 02 17 05 18 59 86 36 171 64 279 819 27.9 NaN NaN 88
1996 02 19 05 15 48 98 30 266 129 403 946 36.7 NaN NaN 94
1996 03 02 04 11 53 88 36 108 95 120 177 1.0 1.5e+14 8.7e+27 86
1996 03 03 04 12 30 99 26 186 141 232 215 2.3 1.6e+14 2.8e+28 99
And below is a sample call.
fpath = "/Users/.../sample_data.txt"
data_in = read_in(fpath)
for i in range(len(data_in)):
print(data_in[i])
(I realize that it's better to use chunks of pre-defined sizes to read in data, but the number of characters per row of data varies. So I'm instead trying to give user control over the number of rows read in; one could read in a subset of the rows at a time and append them into contents, continually passing them into read_in - possibly in a loop - if the file size is large enough. That said, I'd love to know if I'm wrong about this approach as well, though this isn't my main question.)
If your function needs to do some other things after writing to the file, you usually do it outside the with block. So essentially you need to return outside the with block too.
However if the purpose of your function is just to read in a file, you can return within the with block, or outside it. I believe none of the methods are preferred in this case.
I don't really understand your second question.
You can put return also withing with context.
By exiting context, the cleanup are done. This is the power of with, not to need to check all possible exit paths. Note: also with exception inside with the exit context is called.
But if file is empty (as an example), you should still return something. So in such case your code is clear, and follow the principle: one exit path. But if you should handle end of file without finding something important, I would putting normal return within with context, and handle the special case after it.
I would like to convert 'bytes' data into a Pandas dataframe.
The data looks like this (few first lines):
(b'#Settlement Date,Settlement Period,CCGT,OIL,COAL,NUCLEAR,WIND,PS,NPSHYD,OCGT'
b',OTHER,INTFR,INTIRL,INTNED,INTEW,BIOMASS\n2017-01-01,1,7727,0,3815,7404,3'
b'923,0,944,0,2123,948,296,856,238,\n2017-01-01,2,8338,0,3815,7403,3658,16,'
b'909,0,2124,998,298,874,288,\n2017-01-01,3,7927,0,3801,7408,3925,0,864,0,2'
b'122,998,298,816,286,\n2017-01-01,4,6996,0,3803,7407,4393,0,863,0,2122,998'
The columns headers appear at the top. each subsequent line is a timestamp and numbers.
Is there a straightforward way to do this?
Thank you very much
#Paula Livingstone:
This seems to work:
s=str(bytes_data,'utf-8')
file = open("data.txt","w")
file.write(s)
df=pd.read_csv('data.txt')
maybe this can be done without using a file in between.
I had the same issue and found this library https://docs.python.org/2/library/stringio.html from the answer here: How to create a Pandas DataFrame from a string
Try something like:
from io import StringIO
s=str(bytes_data,'utf-8')
data = StringIO(s)
df=pd.read_csv(data)
You can also use BytesIO directly:
from io import BytesIO
df = pd.read_csv(BytesIO(bytes_data))
This will save you the step of transforming bytes_data to a string
Ok cool, your input formatting is quite awkward but the following works:
with open('file.txt', 'r') as myfile:
data=myfile.read().replace('\n', '') #read in file as a string
df = pd.Series(" ".join(data.strip(' b\'').strip('\'').split('\' b\'')).split('\\n')).str.split(',', expand=True)
print(df)
this produces the following:
0 1 2 3 4 5 6 7 \
0 #Settlement Date Settlement Period CCGT OIL COAL NUCLEAR WIND PS
1 2017-01-01 1 7727 0 3815 7404 3923 0
2 2017-01-01 2 8338 0 3815 7403 3658 16
3 2017-01-01 3 7927 0 3801 7408 3925 0
8 9 10 11 12 13 14 15
0 NPSHYD OCGT OTHER INTFR INTIRL INTNED INTEW BIOMASS
1 944 0 2123 948 296 856 238
2 909 0 2124 998 298 874 288
3 864 0 2122 998 298 816 286 None
In order for this to work you will need to ensure that your input file contains only a collection of complete rows. For this reason I removed the partial row for the purposes of the test.
As you have said that the data source is an http GET request then the initial read would take place using pandas.read_html.
More detail on this can be found here. Note specifically the section on io (io : str or file-like).
Dataset Sample
I have data set like the attached picture where I want only the observations that have same numsecur every year.
How do I do this in SAS proc sql function? Will this be easier to do in STATA? If so what procedure can I use?
You look like a new user to stackoverflow. Welcome. Your question is getting down voted for at least three reasons:
1) It's not really clear what you want from your description of the problem and the data
you're providing
2) You haven't shown any attempts at what you've tried
3) Providing your data as a picture is not great. It's most helpful if you're going
to provide data to provide it so it's easy for others to consume in their program.
After all, you're asking for our help make it easier for us to help you. If You
included something like the following we just have to copy and paste to create your
dataset to work with:
DATA test;
INPUT ID YEAR EXEC SUM;
DATALINES;
1573 1997 50 1080
1581 1997 51 300
1598 1996 54 80
1598 1998 54 80
1598 1999 54 80
1602 1996 55 112.6
1602 1997 55 335.965
;
RUN;
That being said the following MAY give you what you're looking for but it's only a guess as I'm not sure if this is really what you're asking:
proc sql no print;
create table testout as
select *,count(*) as cnt
from test
group by sum
having cnt > 1;
quit;
Are you asking: show all rows where the same SUM is used or something else?
Assuming I understand your question correctly, you would like to keep the observations from the same company/individual only if the company has the same numsecur every year. So, here is what I would try using STATA:
input ID YEAR EXEC SUM
1573 1997 50 1080 //
1581 1997 51 300 //
1598 1996 54 80 //
1598 1998 54 80 //
1598 1999 54 80 //
1602 1996 55 112.6 //
1602 1997 55 335.965 //
1575 1997 50 1080 //
1575 1998 51 1080 //
1595 1996 54 80 //
1595 1998 54 30 //
1595 1999 54 80 //
1605 1996 55 112.6 //
1605 1997 55 335.965 //
end
bysort ID SUM: gen drop=cond(_N==1, 0,_n)
drop if drop==0
The results show ( based on my data):
ID YEAR EXEC SUM drop
1. 1575 1997 50 1080 1
2. 1575 1998 51 1080 2
3. 1595 1999 54 80 1
4. 1595 1996 54 80 2
5. 1598 1996 54 80 1
6. 1598 1998 54 80 2
7. 1598 1999 54 80 3
My gstat program for calculating empirical semivariogram on walker lake data is as follows
data = read.table("C:/Users/chandan/Desktop/walk470.csv",sep=",",header=TRUE);
attach(data);
coordinates(data)=~x+y;
walk.var1 <- variogram(v ~ x+y, data=data,width=5,cutoff=100);
The result is as follows
np dist gamma
1 105 3.836866 32312.63
2 459 8.097102 44486.82
3 1088 12.445035 60230.48
4 985 17.874264 76491.36
5 1579 22.227711 75103.67
6 1360 27.742246 83595.83
7 1747 32.291155 91248.20
8 1447 37.724524 97610.65
9 2233 42.356048 85857.03
10 1794 47.537644 93263.63
11 2180 52.295711 98282.98
12 2075 57.601882 91589.39
13 2848 62.314646 91668.70
14 2059 67.627847 95803.45
15 2961 72.310575 91975.76
16 2240 77.648900 95858.87
17 3067 82.379802 88123.56
18 2463 87.641359 87568.94
19 2746 92.334788 97991.56
20 2425 97.754121 93914.31
I have written a code of my own version of the same peoblem using classical sample variogram estimator. The number of points, dist are coming exactly as in the output. But the gamma value is not same. Why is that and what should I do to make it exactly same with gstat output?
Thanks in advance...
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I am new with excel, but how can i get an estimate for the values in 2013 of something like this:
I need an estimate which is the extrapolation of the value according to the linear regression the counterparts observed in recent years.
Thanks
To answer this, I plotted data in two ways: (a) showing each year separately, and (b) showing all the data as one line through time. The graphs are as follows:
Looking at the first graph, if there is any seasonality in the data, it's not very strong. However, looking at all the data plotted on one line through time, it looks as though there is an upward trend. So my suggestion is to do the most basic regression and fit a straight line to the data. The graph with the trend line added is as follows:
In numbers, the results are:
Data Best fit straight line
Jan-10 218 232.7
Feb-10 251 235.0
Mar-10 221 237.1
Apr-10 241 239.4
May-10 261 241.7
Jun-10 227 244.0
Jul-10 253 246.3
Aug-10 266 248.6
Sep-10 238 250.9
Oct-10 255 253.2
Nov-10 238 255.5
Dec-10 219 257.7
Jan-11 263 260.0
Feb-11 239 262.4
Mar-11 255 264.5
Apr-11 297 266.8
May-11 299 269.0
Jun-11 256 271.4
Jul-11 292 273.6
Aug-11 247 275.9
Sep-11 254 278.2
Oct-11 258 280.5
Nov-11 264 282.8
Dec-11 301 285.1
Jan-12 319 287.4
Feb-12 314 289.7
Mar-12 274 291.9
Apr-12 325 294.2
May-12 319 296.4
Jun-12 339 298.8
Jul-12 339 301.0
Aug-12 271 303.3
Sep-12 310 305.7
Oct-12 291 307.9
Nov-12 259 310.2
Dec-12 286 312.5
Jan-13 314.8
Feb-13 317.1
Mar-13 319.2
Apr-13 321.5
May-13 323.8
Jun-13 326.1
Jul-13 328.4
Aug-13 330.7
Sep-13 333.0
Oct-13 335.2
Nov-13 337.6
Dec-13 339.8
There are different ways you can apply linear regression. You could, for example, use all your data points to create an equation to calculate for all the subsequent months. However, if there are yearly cycles, you might just want to use the data for each January to estimate the next January; each month of February to estimate February; etc. To keep it simple, let's just work with January for now. In order to keep the numbers smaller, I'm just going to use the last two digits of the year:
X Y
10 218
11 263
12 319
Next calculate 4 different sums:
S[x] = Sum of all Xs = 33
S[y] = Sum of all Ys = 800
S[xx] = Sum of X squared = 100 + 121 + 144 = 365
S[xy] = Sum of X*Y = 2180 + 2893 + 3828 = 8901
Calculate slope and intercept:
N = Number of data points sampled = 3
M = Slope = (N*S[xy] - S[x]*S[y])/(N*S[xx] - S[x]^2)
M = (3*8901 - 33*800)/(3*365 - 33^2) = 303/6 = 50.5
B = Intercept = (S[y] - M*S[x])/N
B = (800 - 50.5*33)/3 = -866.5/3 = -289
Therefore the equation for January would be:
Y = M*X + B
Y = 50.5*X - 289
Calculate for the year 2013:
Y = 50.5*13 -289 = 368
Start by plotting your data. Decide what kind of function will be a good fit.
You can either create a fit for each month or try to create one that has both year and month as independent variables.
Let's assume that a polynomial fit for each month will work for you:
y = c0 + c1*m + c2*m^2
So for January:
218 = c0 + c1*2010 + c2*2010^2
263 = c0 + c1*2011 + c2*2011^2
319 = c0 + c1*2012 + c2*2012^2
So now you have three equations for three unknowns. Solve for (c0, c1, c2) and the substitute m = 2013 for your extrapolation.
Here are the results I get:
Month 2010 2011 2012 2013
1 218 263 319 386
2 251 239 314 476
3 221 255 274 278
4 241 297 325 325
5 261 299 319 321
6 227 256 339 476
7 253 292 339 394
8 266 247 271 338
9 238 254 310 406
10 255 258 291 354
11 238 264 259 223
12 219 301 286 174
See how you do.