Scraping in Scrapy without pagination links - pagination

I'm trying to paginate through a review site that doesn't have hyperlinked pagination buttons. Code snipit for pagination button:
<span data-page-number="2" data-offset="5" class="nav next taLnk " onclick="(ta.prwidgets.getjs(this,'handlers')).paginate(this); ta.trackEventOnPage('NORTH_STAR_PAGINATION', 'next', '2', 0);" data-page="LqWQeVsSVuWy3KkAoWMUKvKmmmmWxfWiEoWrGVQhIpMgQJIQxGCSsVEtSIgQfSIgWwGScJMVc2GSJQwVCCtgcsJCSJB"><div class="ui_button primary ">Next</div></span>
However there is pagination structure in the URL. For example foobar.com/page1
I'd like to avoid using a headless browser. Since I'm visiting many of these pages I can't manually inspect the page length of each one.
However, I do know there are 10 reviews per page and the review count is stated on the first page, as is the number of pages. Is there a way I can use this scraped information to paginate via url logic off of my initial start-url? Thanks!
(FYI Going to a url with a page count that doesn't exist redirects back to page 1)

Related

How to fetch website links when they're not numerically ordered

Using beautifulsoup it's easy to fetch URLs that follow a certain numeric order. However how do I fetch URL links when it's organized otherwise such as https://mongolia.mid.ru/en_US/novosti where it has articles like
https://mongolia.mid.ru/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/24-avgusta-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-ministrom-energetiki-mongolii-n-tavinbeh?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1?
Websites such as these are weird because once you first open the link, you have » Бусад мэдээ button to go to the next page of articles. But once you click there, now you have Previous or Next button which is so unorganized.
How do I fetch all the news articles from websites like these (https://mongolia.mid.ru/en_US/novosti or https://mongolia.mid.ru/ru_RU/)?
It seems that the » Бусад мэдээ button from https://mongolia.mid.ru/ru_RU/ just redirects to https://mongolia.mid.ru/en_US/novosti. So why not start from the latter?
To scrape all the news just go page through page using the link from the Next button.
If you want it to be more programatic, just check the differences in the query parameters and you'll see that _101_INSTANCE_hfCjAfLBKGW0_cur is set to the actual page's number (starting from 1).

How to scrape a page and then perform a button click to go to next page for scraping using Selenium and BeautifulSoup

I am scraping a web page which has a table with child tr and td tags. I am able to scrape the first page properly. But to go to the next page I need a button click. I need some help in understanding that. I am using Selenium and Beautiful soup to get the page response.
The html for the button tag is as follows:
<input type="submit" name="RadGrid1$ctl00$ctl03$ctl01$ctl28" value=" " onclick="return false;" title="Next Page" class="rgPageNext">
The sample code that I have tried:
for i in range(0,14):
# code for scraping 1 page
some code here
btn = driver.find_element_by_xpath(xpath)
btn.click()
The button click goes to the next page but is not able to scrape info for any of the 2-14 pages. I have tried putting my scraping 1 page code in a for loop and added the button click logic at the end. It scrapes 1st page, performs a button click but doesn't go through to the next page. Instead it loops back to page 1.
As to what I can understand, I think you haven't updated the new url to the Beautifulsoup. After the click you take the current url of the new page and then perform scraping. Then only it will scrape the new page otherwise it contains the url of the old page and thus scrapes that page only.

How to click a button and scrape text from a website using python scrapy

I have used python scrapy to extract data from a website. Now i am able to scrape most of the details of a site using scrapy. But my main problem is that iam not able to extract all the reviews of products from the site. I am only able to extract the top 4 reviews which they display on the page and for getting other reviews i have to go to a pop up window which has all the reviews. I looked for 'href' for the popup window but im not able to find it. This is the link that i tried to scrape. The reviews and ratings are at the bottom of the page: https://www.coursera.org/learn/big-data-introduction
Can any one help me by explaining how to extract the reviews from this popup window. Another think to note is that there is infinite scrolling for the pop up.
Thanks in advance.
Scrapy, unlike tools like Selenium and PhantomJS, does not drive a full web browser in the background. You cannot just click a button.
You need to understand what the button does (e.g. does it simply submit a form? Does it do something with JavaScript? Etc.) and reproduce the functionality in your own code.
For example, you might need to read the content of a script element, apply regular expressions to it to pull a URL from a string literal, then make a new HTTP request to that URL, the pell the data you want from the new DOM.
... and then repeat for the next “page” of the infinite scroll.

scrapy pagination without href

I created a spider that takes the information from the table below, but I can not change to the previous table because it does not have "href", how do I?
https://br.soccerway.com/teams/italy/as-roma/1241/
previous button without href
<a rel="previous" class="previous " id="page_team_1_block_team_matches_summary_7_previous">« anterior</a>
If you look at network inspector in your browser you can see an XHR request being made when you click next button:
That request return json response with html changes:
You need to reverse engineer how your page generated this url (from the first image):
https://br.soccerway.com/a/block_team_matches_summary?block_id=page_team_1_block_team_matches_summary_7&callback_params=%7B%22page%22%3A0%2C%22bookmaker_urls%22%3A%7B%2213%22%3A%5B%7B%22link%22%3A%22http%3A%2F%2Fwww.bet365.com%2Fhome%2F%3Faffiliate%3D365_371546%22%2C%22name%22%3A%22Bet%20365%22%7D%5D%7D%2C%22block_service_id%22%3A%22team_summary_block_teammatchessummary%22%2C%22team_id%22%3A1241%2C%22competition_id%22%3A0%2C%22filter%22%3A%22all%22%2C%22new_design%22%3Afalse%7D&action=changePage&params=%7B%22page%22%3A1%7D
And then you can use that to retrieve following pages.

htaccess Redirecting or Rewriting to a form results page with variables intact

I am building a website which calls for a page selector on product search results, the page selector currently adds a forward slash and a number (representing the page) to the end of the current URL.
e.g. If I am browsing Washing Machines on "/laundry/Washing-Machines" and I click page 2 on the selector it takes me to "/laundry/Washing-Machines/2" and page 2 loads, this is working fine.
Now, the problem I am having...
I have a form in the sidebar where the user can filter Range Cooker search results by brand, fuel type, size and colour. The form gathers the products from the database that meet the search criteria, and displays the results along side the page selector.
If I leave the form values as default and submit the form I am presented with the results on "/cooking/Range-Cookers/Search?brand=0&type=0&size=0&colour=0" but when I click page 2 on the selector I am taken to "cooking/Range-Cookers/2" which presents me with a 404. If I add "&page=2" to the end of the original URL I am presented with page 2.
Since the page selector is a php include and works fine for every product except the results from my Range Cooker form, I would rather find a solution that leaves the selector php intact.
Is there any way I can add a redirect to .htaccess which would take a link from my page selector e.g. "cooking/Range-Cookers/5" and correctly apply it to the current URL with all form variables intact e.g. "cooking/Range-Cookers/Search?brand=1&type=2&size=0&colour=0&page=5"?
I have experience in HTML, CSS and PHP, but I am new to editing .htaccess and would appreciate any insight into how I can accomplish this. Thanks.
You cannot do this with .htaccess, because the information is not available, when the request hits Apache, or .htaccess for that matter.
When you click the link for page 2, the client requests the URL in the associated href attribute. It doesn't provide any other information available on the current page. If you want this information transmitted, you must modify the link for page 2 from
cooking/Range-Cookers/2
to
cooking/Range-Cookers/Search?brand=1&type=2&size=0&colour=0&page=2
when you deliver the page to the client. Same goes for any other information you need for following pages.

Resources