Getting information from a website using python - web

I am new to python. I am trying on a project with my friend, where we need to extract the runs a batsman scored in his last match. But we are stuck at how to get that info using python. What we know is getting the source code of the web page. How to go one step further. Any help will be appreciated!

Check out the requests package:
https://pypi.python.org/pypi/requests
From the site:
Requests is an Apache2 Licensed HTTP library, written in Python, for human beings.
Most existing Python modules for sending HTTP requests are extremely verbose and cumbersome. Python’s builtin urllib2 module provides most of the HTTP capabilities you should need, but the api is thoroughly broken. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.
Things shouldn’t be this way. Not in Python.
>>> r = requests.get('https://api.github.com', auth=('user', 'pass'))
>>> r.status_code
204
>>> r.headers['content-type']
'application/json'
>>> r.text
...

Related

How do I scrape data from this specific website?

I am trying to get some data out of this website.
http://asphaltoilmarket.com/index.php/state-index-tracker/
I am trying to get the data using the following code but it times out.
import requests
asphalt_r = requests.get('http://asphaltoilmarket.com/index.php/state-index-tracker/')
This website opens with no problems in the browser and also I can get data from other websites (with different structure) using this code, but my code does not work with this website. I am not sure what changes I need to make.
Also, I could get the data to download in excel and another tool (Alteryx) which uses GET from curl.
They likely don't want you to scrape their site.
The response code is a quick indication of that.
>>> import requests
>>> asphalt_r = requests.get('http://asphaltoilmarket.com/index.php/state-index-tracker/')
>>> asphalt_r
<Response [406]>
406 = Not Acceptable
>>> asphalt_r = requests.get('http://asphaltoilmarket.com/index.php/state-index-tracker/', headers={"User-Agent": "curl/7.54"})
>>> asphalt_r
<Response [200]>
Read and follow their AUP & Terms of Service.
Working does not equal permission.

Why does this simple Python 3 XPath web scraping script not work?

I was following a web scraping tutorial from here: http://docs.python-guide.org/en/latest/scenarios/scrape/
It looks pretty straight forward and before I did anything else, I just wanted to see if the sample code would run. I'm trying to find the URIs for the images on this site.
http://www.bvmjets.com/
This actually might be a really bad example. I was trying to do this with a more complex site but decided to dumb it down a bit so I could understand what was going on.
Following the instructions, I got the XPath for one of the images.
/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img
The whole script looks like:
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
images = tree.xpath('/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img')
print(images)
But when I run this, the dict is empty. I've looked at the XPath docs and I've tried various alterations to the xpath but I get nothing each time.
I dont think I can answer you question directly, but I noticed the images on the page you are targeting are sometimes wrapped differently. I'm unfamiliar with xpath myself, and wasnt able to get the number selector to work, despite this post. Here are a couple of examples to try:
tree.xpath('//html//body//div//div//table//tr//td//div//a//img[#src]')
or
tree.xpath('//table//tr//td//div//img[#src]')
or
tree.xpath('//img[#src]') # 68 images
The key to this is building up slowly. Find all the images, then find the image wrapped in the tag you are interested in.. etc etc, until you are confident you can find only the images your are interested in.
Note that the [#src] allows us to now access the source of that image. Using this post we can now download any/all image we want:
import shutil
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
cool_images = tree.xpath('//a[#target=\'_blank\']//img[#src]')
source_url = page.url + cool_images[5].attrib['src']
path = 'cool_plane_image.jpg' # path on disk
r = requests.get(source_url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
I would highly recommend looking at Beautiful Soup. For me, this has helped my amateur web scraping ventures. Have a look at this post for a relevant starting point.
This may not be the answer you are looking for, but hopeful it is a starting point / of some use to you - best of luck!

I want to extract the latitude, longitude and accurency from https://mycurrentlocation.net/ using Python3 BeautifulSoup and requests modules

My basic idea is to make a simple script in order to get geographic coordinates using this site: https://mycurrentlocation.net/.
When I run my script the attribute column is empty and the program doesn't return the list correctly : check the list
Please help me! :)
import requests,time
from bs4 import BeautifulSoup as bs
site="https://mycurrentlocation.net/"
def track_the_location():
page=requests.get(site)
page=bs(page.text,"html.parser")
latitude=page.find_all("td",{"id":"latitude"})
longitude=page.find_all("td",{"id":"longitude"})
accuracy=page.find_all("td",{"id":"accuracy"})
List=[latitude.text,longitude.text,accuracy.text]
return List
print(track_the_location())
I think the problem is the fact, that you are running it from a local script. This behaves different from a browser as it doesn't provide any location information to the website. The needed information are provided through runtime, so you actually need to simulate a browser session to get the data as long as they don't offer an API, where you can manually specify your information.
A possible solution for that would be selenium as it helps you simulating a browser session. Here's the selenium documentation for further readings.
Hope I could help you. Have a nice day.

Python 3.6.4 - urllib.request.urlopen certificate verify failed error

Trying to teach myself how to use urllib.request in Python 3.6.4 - however, I can't seem to get a basic example to work. Below is the code that I am running, copied straight from Python's documentation at this link.
>>> import urllib.request
>>> with urllib.request.urlopen('http://www.python.org/') as f:
... print(f.read(300))
A picture of the error I get is here. It tells me that the SSL certificate verify failed (I'm unsure of what this means).
I don't think there is anything wrong with the code I am running, but maybe I'm missing a step to setting up the environment. From what I can tell, it should be as simple as running those few lines of code. Any help is greatly appreciated.
As a quick background, I've taken 2 computer science courses at university, so I am by no means an expert, but I do have a pretty solid understanding of basic programming with Python. I'm trying to use this to scrape data in conjunction with BeautifulSoup.

python 3.3 basic error

I have python 3.3 installed.
i use the example they use on their site:
import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()
the only thing that happens when I run it is I get this :
======RESTART=========
I know I am a rookie but I figured the example from python's own website should be able to work.
It doesn't. What am I doing wrong?Eventually I want to run this script from the website below. But I think urllib is not going to work as it is on that site. Can someone tell me if the code will work with python3.3???
http://flowingdata.com/2007/07/09/grabbing-weather-underground-data-with-beautifulsoup/
I think I see what's probably going on. You're likely using IDLE, and when it starts a new run of a program, it prints the
======RESTART=========
line to tell you that a fresh program is starting. That means that all the variables currently defined are reset and/or deleted, as appropriate.
Since your program didn't print any output, you didn't see anything.
The two lines I suggested adding were just tests to figure out what was going on, they're not needed in general. [Unless the window itself is automatically closing, which it shouldn't.] But as a rule, if you want to see output, you'll have to print what you're interested in.
Your example works for me. However, I suggest using requests instead of urllib2.
To simplify the example you linked to, it would look like:
from bs4 import BeautifulSoup
import requests
resp = requests.get("http://www.wunderground.com/history/airport/KBUF/2007/12/16/DailyHistory.html")
soup = BeautifulSoup(resp.text)

Resources