I need to parse a webpage and it has a lot of images, each request takes a lot of time because of it.
Can I use requests.get to get only html content without waiting for images?
When you GET a page, you only download the page itself anyway.
import requests
url = 'https://stackoverflow.com/questions/40394209/python-requests-how-to-get-a-page-without-downloading-all-images'
# This will yield only the HTML code
response = requests.get(url)
print(response.text)
The page HTML contains references to images, but the GET request does not follow them.
Related
Help me , please. Is there any way to parse my own page of products on amazon or on other websites
When i am trying to parse any data from pages, it works. But i cant parse my own page of products i have bought. I have used requests and beautiful soup
I am trying to scrape the website. When I use, Scrapy shell with chrome user agent, I get the response(html content) same as viewed in browser.
But when I crawl same link using Scrapy script with default user_agent, it shows different response with content I didn't need. But when I change the user_agent to same used in shell, it shows 404 error.
Please help me. I am really stuck on it.
I tried many user agents. I also changed concurrent requests, but nothing is working.
I have an ember application that is breaking if I try to direct access a URL sub route. If I navigate to the homepage and then to a sub route, it loads properly.
The issue that I'm seeing with direct access is the page outputs the HTML to the screen instead of parsing it, loading the JS libraries, and showing the normal content.
I work for a large corporation and can't post the site code, but would appreciate some help with common/likely causes where I can troubleshoot.
I had added a content-type = application/json header on my server side and it wasn't constrained to just API calls. For some reason this was only affecting pages when being directly accessed, and not the home page. I moved the header to only be used on API calls and all is well.
I want to get the address of a page after redirect. I have the following code
url = 'https://simple.wikipedia.org/wiki/Gcd'
print(urlopen(url).geturl())
But it doesn't work, it prints https://simple.wikipedia.org/wiki/Gcd, while it should print https://simple.wikipedia.org/wiki/Greatest_common_divisor.
So, what is the problem with it?
There is actually no problem. The URL you get when opening https://simple.wikipedia.org/wiki/Gcd is exactly that URL. The only way for the URL to change would be a redirect, and if you look at the response from that URL, you can see that it returns just a 200 status code. So there is no redirect.
However, when you open the URL in the browser, the URL does get changed to https://simple.wikipedia.org/wiki/Greatest_common_divisor. How does this happen when there is no redirect?
This is actually a new MediaWiki feature that rewrites the URL in the browser using the History API. It simply replaces the URL that is displayed in the browser—but without actually making a new request or being a true HTTP redirect.
It’s a functionality that only works in modern browsers with JavaScript enabled. Otherwise, you would stay on the Gcd URL which is also the behavior from older versions of MediaWiki.
You can learn more about this new MediaWiki feature in the Phabricator task T37045.
As for your “problem” with it, you should consider communicating with MediaWiki using the MediaWiki API which will also tell you when a page is a redirect.
I'm using the requests library in NodeJS to pull images from a website, and save them to disk. The website I'm pulling from, instead of 404-ing, redirects to the homepage, and serves the HTML. Is there a proper way of detecting this change/redirect without parsing the body of the response?
You can check the Content-Type sent by the server in response.headers['content-type'].