I am trying to get some data out of this website.
http://asphaltoilmarket.com/index.php/state-index-tracker/
I am trying to get the data using the following code but it times out.
import requests
asphalt_r = requests.get('http://asphaltoilmarket.com/index.php/state-index-tracker/')
This website opens with no problems in the browser and also I can get data from other websites (with different structure) using this code, but my code does not work with this website. I am not sure what changes I need to make.
Also, I could get the data to download in excel and another tool (Alteryx) which uses GET from curl.
They likely don't want you to scrape their site.
The response code is a quick indication of that.
>>> import requests
>>> asphalt_r = requests.get('http://asphaltoilmarket.com/index.php/state-index-tracker/')
>>> asphalt_r
<Response [406]>
406 = Not Acceptable
>>> asphalt_r = requests.get('http://asphaltoilmarket.com/index.php/state-index-tracker/', headers={"User-Agent": "curl/7.54"})
>>> asphalt_r
<Response [200]>
Read and follow their AUP & Terms of Service.
Working does not equal permission.
Related
I am trying to make amazon product cost tracker, but i am not able to get past getting the information required and is just showing an empty list.
import requests
import bs4
res = requests.get("https://www.amazon.com/dp/B087R79S92/ref=twister_B08LD4TDF3?_encoding=UTF8&th=1")
soup = bs4.BeautifulSoup(res.text,"lxml")
x = soup.select("span.a-offscreen")
print(x)
Note: First at all - ALWAYS take a deeper look into the soup you are cooking
You will find the following information:
To discuss automated access to Amazon data please contact api-services-support#amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
Yes, there are ways to get the content:
Follow instructions and check the API
Set some headers to your request to show your "not a robot", wont work forever
Use Selenium
...
I have very little experience but a lot of persistence. One of my hobbies is football and I help in a local team. I'm a big fan of statistics and unfortunately the only way to collect data from the local football federation is by web scraping. I have seen python with beautifulsoup package can help but I cannot identify the tags as these I believe are on a table.
My goal is to automatically collect the information to build up a database with players, fixtures, teams, referees,... and build up stats such as how many times has player been on the starting line up, when are teams more likely to score a goal, when to receive a goal,...
I have two links for a reference.
The first is for the general fixtures of a given group.
The second is the details with a match of any of a given match.
Any clues or where to start with the pattern would be great.
you should first get into web-scraping using python native libraries like requests to kinda contact the page that you want. then depending on the page, you should use bs4 (Beautifulsoup) to find what you are looking for inside the page. Then what you wanna do is transform and refine that information into variables (list or dataframes based on your need) and finally save those variables into a dataset. here is an example code:
import requests
from bs4 import BeautifulSoup #assuming you already have bs4 installed
page = requests.get('www.DESTINATIONSITE.com/BLAHBLAH')
soup = BeautifulSoup(page.text, 'html.parser')
soup_str = str(soup.text)
up to this point, you have the string values of the entire page at hand. now you wanna do some regular expression or any other python coding to seperate the information that you want from soup_str and store them in variable(s).
for the rest of the process, if you want to save that data, i suggest you look into libraries like pandas
I have done some progress
import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
send request
url = 'http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html.parser')
read acta
acta_text = []
acta_text_element = soup.find_all(class_='acta-table')
for item in acta_text_element:
acta_text.append(item.text)
when I print the items I get many \n
Mahdi_J thanks for the initial response. I have already started the approach:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c", headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content
soup = BeautifulSoup(c,"html.parser")
What I need is some support on where to start for the parsing. On the url I neeed to collect the players for both teams, the goals, bookings to save them to else where. What I'm having trouble is how to parse. I have very little skills and I need some starting point for the parsing.
I am trying to scrape a website to extract values.
I can text back in the response, but cannot see any of the values on the website in the response.
How do i get the values please ?
I have used basic code from stackoverflow as a test to explore the data. The code is posted below. It works on other sites, but not this site ?
import requests
url = 'https://www.ishares.com/uk/individual/en/products/253741/'
data = requests.get(url).text
with open('F:\\webScrapeFolder\\out.txt', 'w') as f:
print(data.encode("utf-8"), file=f)
print('--- end ---')
There is no error message.
The file is written correctly.
However, i do not see any of the numbers ?!?
Check with this
url ="https://www.ishares.com/uk/individual/en/products/253741/?switchLocale=y&siteEntryPassthrough=true"
and try to get the response and still if you cannot get what is expected can you brief more on what is needed
I am using Python BeautifulSoup to extract some data from a famous song site.
Here is the snippet of code:
import requests
from bs4 import BeautifulSoup
url= 'https://gaana.com/playlist/gaana-dj-bollywood-top-50-1'
res = requests.get(url)
while(res.status_code!=200):
try:
res = requests.get('url')
except:
pass
print (res)
soup = BeautifulSoup(res.text,'lxml')
songs = soup.find_all('meta',{'property':'music:song'})
print (songs[0])
Here is the sample output:
<Response [200]>
<meta content="https://gaana.com/song/o-saathi" property="music:song"/>
Now i want to extract the url within content as string so that i can further use that url in my program.
Someone please Help me.
It's in the comments, but I just want to explain: beautifulsoup returns most results as a list or other iterable object. You show that you understand this in your code by using songs[0], but in this case what's been returned is a dictionary.
As explained in this StackOverflow post, you have need to query not only songs[0] but also the property within the dictionary (the two together are called a key pair and are the chief way to get data out of a dictionary).
Last note: while I've been a big fan of BeautifulSoup4 for basic web scraping, you may consider the lxml library. It's pretty well documented; to really take advantage of it you have to learn Python-variety Xpaths, which are sort of like regex for XML/HTML; but for advanced scraping it's probably the last best option short of Selenium, and it returns cleaner data than bs4.
Good luck!
I am new to python. I am trying on a project with my friend, where we need to extract the runs a batsman scored in his last match. But we are stuck at how to get that info using python. What we know is getting the source code of the web page. How to go one step further. Any help will be appreciated!
Check out the requests package:
https://pypi.python.org/pypi/requests
From the site:
Requests is an Apache2 Licensed HTTP library, written in Python, for human beings.
Most existing Python modules for sending HTTP requests are extremely verbose and cumbersome. Python’s builtin urllib2 module provides most of the HTTP capabilities you should need, but the api is thoroughly broken. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.
Things shouldn’t be this way. Not in Python.
>>> r = requests.get('https://api.github.com', auth=('user', 'pass'))
>>> r.status_code
204
>>> r.headers['content-type']
'application/json'
>>> r.text
...