Scraping data requiring scrolling down with scrapy in python

Scraping data requiring scrolling down with scrapy in python - python-3.x

I have a scraping project I threw together with scrapy in python. Now I've come across certain data that are only loaded onto the page as a user scrolls down. How do I emulate this in my scrapy spiders?

Inspect the page (Press F12) in Google Chrome or similar.
Click on the Network tab.
Scroll down on the page until you the new data is rendered.
At the same time the new data is rendered in your browser you should see a new file pop up in the inspection panel.
The file could be anything depending on the site but most of the time it's JSON.
Click on the file in inspection panel and copy the Request URL.
Back in scrapy you can send a request to this URL the get the dynamically rendered data.

Related

Web scraping: How to tell in general if a page has content rendered in javascript

How can you tell in general when a website is rendering content in javascript? I typically use bs4 to scrape and I when I can't find a tag, I'm not sure if its because its javascript rendered (which bs4 cant detect) or if I did something wrong.

Compare the output of your request with the html returned from a browser request. In Chrome and Firefox, press F12 and the console will appear. Under the network tab you can see all the requests that have been made. If the Network tab is empty, refresh the page. The response from the first request in the Network tab should match the response you recieved from the Python request. If it doesn't match, either your request differs from the browser request, or javascript is doing some post processing.
Subsequent requests in the Network tab may be from javascript running, from iframes, images, or much more.

Chrome / opera browser extension on page load, load another page in background to get data

Is there a way to get data from other pages? I mean when you open specific page, the script runs a search on the web to look for data and then you insert that data into the loaded page you loaded before?
Example would be say you looking to buy a chair, you looking at one seller web page but the extension looks up prices from other sellers and shows a graph or values, then you click on the value to go to that other seller page.
From what i understand you can manipulate DOM of opened tabs, but is there a way to load DOM from another page that is not in the opened tab?

how to provide request_url with specification while web scraping using python

I'm on the web page with the url=x
The url of that particular web page doesn't change after giving my preferences(like selecting options,..) or after clicking the button on that web page.
Problem:
Before performing the above mentioned actions i will not be displayed with any data; but post actions the web page displays the data.
Context:
I'm trying to scrape data from a web page using python
And i'm struck at providing the request_url with the above mentioned specification
if i'm providing request_url=x it is fetching no data because i have provided no specifications
How to provide those specifications while requesting with the url?
kindly address the specification of pressing the button also

Sounds like you're trying to scrape data through real navigation actions, like filling form data and clicking on buttons and/or posting some data, considering whatever javascript scripts contains in the page, but you don't have the specifications to post the data.
My approach would be automating a real browser using selenium, finding the button via xpath or id and calling a click function to it.
driver.get("http://www.google.com")
# Assume the button has the ID "submit" :)
driver.find_element_by_id("submit").click()

Is there any way to disable self-refresh of page in ChromeDriver by Selenium?

I am using ChromeDriver to navigate to web pages, but the page is getting refreshed itself at short intervals. I don't want that to be happened, because the structure of page will be back to the original after refreshing.
Actually, I used the code driver.get_log() to load the logs of the network all the time, and the structure of page determine which of logs driver get. That is why I need that the page don't refresh itself.
link of page:https://www.bet365.com/?lng=10&cb=10581211257#/IP/

Can I preload content in a Webbrowser control?

I have a VC++ MFC dialog application with a web browser ActiveX control. I have "Next" and "Prev" buttons using which I let the user navigate through a list of pre-defined URLs, whose content are shown in the web browser control. Since I know the list of URLs at the start itself, I would like to pre-load the content in some way while the user is looking at one page, so that when they click on "Next", the content has already been fetched and can be shown to the user instead of waiting for the page to load. I did not find any documentation on how to do that so far. Does anyone have any ideas on how to achieve this? I was thinking of having a second invisible web browser control where I pre-load the next URL, but it would be tricky to handle the user clicking Next when the next URL is still loading in the other browser.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scraping data requiring scrolling down with scrapy in python - python-3.x

I have a scraping project I threw together with scrapy in python. Now I've come across certain data that are only loaded onto the page as a user scrolls down. How do I emulate this in my scrapy spiders?

Related

Web scraping: How to tell in general if a page has content rendered in javascript

Chrome / opera browser extension on page load, load another page in background to get data

how to provide request_url with specification while web scraping using python

Is there any way to disable self-refresh of page in ChromeDriver by Selenium?

Can I preload content in a Webbrowser control?

Categories

Resources