Unable to get values from a GET request - python-3.x

I am trying to scrape a website to extract values.
I can text back in the response, but cannot see any of the values on the website in the response.
How do i get the values please ?
I have used basic code from stackoverflow as a test to explore the data. The code is posted below. It works on other sites, but not this site ?
import requests
url = 'https://www.ishares.com/uk/individual/en/products/253741/'
data = requests.get(url).text
with open('F:\\webScrapeFolder\\out.txt', 'w') as f:
print(data.encode("utf-8"), file=f)
print('--- end ---')
There is no error message.
The file is written correctly.
However, i do not see any of the numbers ?!?

Check with this
url ="https://www.ishares.com/uk/individual/en/products/253741/?switchLocale=y&siteEntryPassthrough=true"
and try to get the response and still if you cannot get what is expected can you brief more on what is needed

Related

Image download with Python

I'm trying to download images with Python 3.9.1
Other than the first 2-3 images, all images are 1 kb in size. How do I download all pictures? Please, help me.
Sample book: http://web2.anl.az:81/read/page.php?bibid=568450&pno=1
import urllib.request
import os
bibID = input("ID: ")
first = int(input("First page: "))
last = int(input("Last page: "))
if not os.path.exists(bibID):
os.makedirs(bibID)
for i in range(first,last+1):
url=f"http://web2.anl.az:81/read/img.php?bibid={bibID}&pno={i}"
urllib.request.urlretrieve(url,f"{bibID}/{i}.jpg")
Doesn't look like there is an issue with your script. It has to do with the APIs you are hitting and the sequence required.
A GET http://web2.anl.az:81/read/img.php?bibid=568450&pno=<page> just on its own doesn't seem to work right away. Instead, it returns No FILE HERE
The reason this happens is that the retrieval of the images is linked to your cookie. You first need to initiate your read session that's generated when first visiting the page and clicking the TƏSDİQLƏYIRƏM button
From what I could tell you need to do the following:
POST http://web2.anl.az:81/read/page.php?bibid=568450 with Content-Type: multipart/form-data body. It should have a single key value of approve: TƏSDİQLƏYIRƏM - this starts a session and generates a cookie for you which you have to add as a header for all of your API calls from now on.
E.g.
requests.post('http://web2.anl.az:81/read/page.php?bibid=568450', files=dict(approve='TƏSDİQLƏYIRƏM'))
Do the following in your for-loop of pages:
a. GET http://web2.anl.az:81/read/page.php?bibid=568450&pno=<page number> - page won't show up if you don't do this first
b. GET http://web2.anl.az:81/read/img.php?bibid=568450&pno=<page number> - finally get the image!

How do I scrape data from this specific website?

I am trying to get some data out of this website.
http://asphaltoilmarket.com/index.php/state-index-tracker/
I am trying to get the data using the following code but it times out.
import requests
asphalt_r = requests.get('http://asphaltoilmarket.com/index.php/state-index-tracker/')
This website opens with no problems in the browser and also I can get data from other websites (with different structure) using this code, but my code does not work with this website. I am not sure what changes I need to make.
Also, I could get the data to download in excel and another tool (Alteryx) which uses GET from curl.
They likely don't want you to scrape their site.
The response code is a quick indication of that.
>>> import requests
>>> asphalt_r = requests.get('http://asphaltoilmarket.com/index.php/state-index-tracker/')
>>> asphalt_r
<Response [406]>
406 = Not Acceptable
>>> asphalt_r = requests.get('http://asphaltoilmarket.com/index.php/state-index-tracker/', headers={"User-Agent": "curl/7.54"})
>>> asphalt_r
<Response [200]>
Read and follow their AUP & Terms of Service.
Working does not equal permission.

Extract only first post content from URL that has multiple tumblr posts with PYTHON

I am trying to extract only actual content/text from given input URL using newspaper package in python3. I have succeded in doing so but one of my URL consists of multiple tumblr posts in the same page.
In the below URL I want content of first post only i.e., paragraph starting with "The Karnataka Assembly election 2018 result is close to being known as vote counting is underway on Tuesday, "
https://poonamparekh.tumblr.com/post/173920050130/karnataka-election-results-modi-rallies-set-to
In my working while extracting content from above URL instead of first post I am getting 6th post content as my output. But that's not what I need. I require first post to be as my output. Can anyone help me out in achieving this ?
Here is my code:
from newspaper import Article
url="https://poonamparekh.tumblr.com/post/173920050130/karnataka-election-results-modi-rallies-set-to"
print(url)
article = Article(url, language='en')
article.download()
article.download_state
print('articlee_state : ',article.download_state)
if article.download_state == 2:
try:
article.parse()
result=article.text[0]
print(result[:150])
if result=='':
print('----MESSAGE : No description written for this post')
except Exception as e:
print(e)

Newbie webscraping with python3

This is my first attempt to use web scraping in python to extract some links from a webpage.
This the webpage i am interested in getting some data from:
http://www.hotstar.com/tv/bhojo-gobindo/14172/seasons/season-5
I am interest in extracting all the instance of following from above webpage:
href="/tv/bhojo-gobindo/14172/gobinda-is-in-a-fix/1000196352"
I have written following regex to extract all the matches of above type of links:
r"href=\"(\/tv\/bhojo-gobindo\/14172\/.*\/\d{10})\""
Here is quick code i have written to try to extract all the regex mataches:
#!/usr/bin/python3
import re
import requests
url = "http://www.hotstar.com/tv/bhojo-gobindo/14172/seasons/season-5"
page = requests.get(url)
l = re.findall(r'href=\"(\/tv\/bhojo-gobindo\/14172\/.*\/\d{10})\"', page.text)
print(l)
When I run the above code I get following ouput:
./links2.py
[]
When I inspect the webpage using developer tools within the browser I can see this links but when I try to extract the text I am interested in(href="/tv/bhojo-gobindo/14172/gobinda-is-in-a-fix/1000196352") using python3 script I get no matches.
Am I downloading the webpage correctly, how do I make sure I am getting all of the webapage from within my script. i have a feeling I am missing parts of the web page when using the requests to get the web page.
Any help please.

Making webcrawler - Wont go into my for-loop

I'm making a webcrawler for fun. Basically what I want to do for example is to crawl this page
http://www.premierleague.com/content/premierleague/en-gb/matchday/results.html?paramClubId=ALL&paramComp_8=true&paramSeasonId=2010&view=.dateSeason
and first of all get all the home teams. Here is my code:
def urslit_spider(max_years):
year = 2010
while year <= max_years:
url = 'http://www.premierleague.com/content/premierleague/en-gb/matchday/results.html?paramClubId=ALL&paramComp_8=true&paramSeasonId=' + str(year) + '&view=.dateSeason'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a', {'class' : 'clubs rHome'}):
lid = link.string
print(lid)
year += 1
I've found out that the code wont enter the for loop. It gives me no error but it doesn't do anything. Tried to search for this but can't find what's wrong.
The link you provided redirected me to the homepage. Tinkering with the URL I get to http://br.premierleague.com/en-gb/matchday/results.html
In this URL I get all the home teams name using
soup.findAll('td', {'class' : 'home'}):
How can I navigate to the link you provided? Maybe the HTML is different on that page
Edit: Looks like the content of this website is loaded from this URL: http://br.premierleague.com/pa-services/api/football/lang_en_gb/i18n/competition/fandr/api/gameweek/1.json
Tinkering with the url parameters, you can find lots of informations.
I still cant open the url you provided, it keeps redirecting me, but in the link I provided, I cant extract the table info from html (and BeautifulSoup) because it is gathering the info from that JSON above.
The best thing to do is using that json to get the information you need. My advice is to use json package from python.
If you are new to JSON, you can use this website to make the JSON more readable: https://jsonformatter.curiousconcept.com/

Resources