Scraping infinite scrolling pages using scrapy - python-3.x

I want help in scraping infinite scrolling pages. For now, I have entered pageNumber = 100, which helps me in getting the name from 100 pages.
But I want to crawl all the pages till the end. As the page has infinite scrolling and being new to scrapy I am unable to do the same. I am trying this for the past 2 days.
class StorySpider(scrapy.Spider):
name = 'story-spider'
start_urls = ['https://www.storytel.com/in/en/categories/3-Crime?pageNumber=100']
def parse(self, response):
for quote in response.css('div.gridBookTitle'):
item = {
'name': quote.css('a::attr(href)').extract_first()
}
yield item
The original link is https://www.storytel.com/in/en/categories/1-Children. I see that the pageNumber variable is inside script tag, if it helps to find the solution.
Any help would be appreciated. Thanks in advance!!

If you search the XPath like <link rel="next" href=''>
you will found the pagination option. With the help of you can add the pagination code.
here is some example of the pagination page.
next_page = xpath of pagination
if len(next_page) !=0:
next_page_url = main_url.join(next_page
yield scrapy.Request(next_page_url, callback=self.parse)
It will helps you.

Related

How to collect URL links for pages that are not numerically ordered

When URLs are ordered in a numeric order, it's simple to fetch all the articles in a given website.
However, when we have a website such as https://mongolia.mid.ru/en_US/novosti where there are articles with URLs like
https://mongolia.mid.ru/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/10-iula-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-i-ministra-inostrannyh-del-mongolii-n-enhtajv?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
How do I fetch all the article URLs on this website? Where there's no numeric order or whatsoever.
There's order to that chaos.
If you take a good look at the source code you'll surely notice the next button. If you click it and inspect the url (it's long, I know) you'll see there's a value at the very end of it - _cur=1. This is the number of the current page you're at.
The problem, however, is that you don't know how many pages there are, right? But, you can programmatically keep checking for a url in the next button and stop when there are no more pages to go to.
Meanwhile, you can scrape for article urls while you're at the current page.
Here's how to do it:
import requests
from lxml import html
url = "https://mongolia.mid.ru/en_US/novosti"
next_page_xpath = '//*[#class="pager lfr-pagination-buttons"]/li[2]/a/#href'
article_xpath = '//*[#class="title"]/a/#href'
def get_page(url):
return requests.get(url).content
def extractor(page, xpath):
return html.fromstring(page).xpath(xpath)
def head_option(values):
return next(iter(values), None)
articles = []
while True:
page = get_page(url)
print(f"Checking page: {url}")
articles.extend(extractor(page, article_xpath))
next_page = head_option(extractor(page, next_page_xpath))
if next_page == 'javascript:;':
break
url = next_page
print(f"Scraped {len(articles)}.")
# print(articles)
This gets you 216 article urls. If you want to see the article urls, just uncomment the last line - # print(articles)
Here's a sample of 2:
['https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/24-avgusta-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-ministrom-energetiki-mongolii-n-tavinbeh?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1', 'https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/19-avgusta-2020-goda-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-zamestitelem-ministra-inostran?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1']

Having trouble with a scrapy script (selecting links)

I am using Scrapy and am having trouble with the script. It works fine with the shell:
scrapy shell "www.redacted.com" I use response.xpath("//li[#a data-urltype()"]).extract
I am able to scrape 200 or so links from the page.
Here is the code from the webpage I am trying to scrape:
<a data-urltype="/view" data-mce-href="http://www.redacted.aspx?ID=xxxxxxxxxx" data-linktype="external" href="http://www.redacted.com/Home/wfContent.aspx?xxxxxxxxxxxxx" data-val="http://www.redacted.gov/Home/wfContent.aspx?xxxxxxxxxxxx" target="_blank">link text</a>
My problem is the script: (posted below) I know the "a data-val" is wrong.
import scrapy
from ..items import LinkscrapeItem
class Linkscrape(scrapy.Spider):
name = 'lnkscrapespider'
start_urls = [
'https://www.redacted.com'
]
def parse(self, response):
items = LinkscrapeItem()
links = response.xpath("a data-val").xpath.extract()
for links in links:
items['links'] = links
yield{
'links': links
}
You don't need to use .xpath() twice:
links = response.xpath("//li/a/#data-val").extract()
# or
links = response.xpath("//li/a/#data-val").getall()
Also below doesn't make sense (may be you need for link in links? ):
for links in links:
items['links'] = links
yield{
'links': links
}
if you are going to scrape data-val from a. use below xpath.
links = response.xpath("//li/a/#data-val").xpath.extract()

How to get a Scrapy request to go to the last page of the website?

I just need to make Scrapy request to request last page of the website.
I cant create a scrapy request to go to the last page. I have tried the code below.
last_page = response.css('li.next a::attr(href)').get()
if next_page is None:
yield scrapy.Request(last_page, callback=self.parse)
It is expected that the crawler goes straight to the last page, then from there I would do some manipulations
I believe the way to go would be to inspect the source code to find the "Next" page link and use this function in parse:
current_page = #current_page_link
next_page = #scraping the link using a css selector
if next_page is None:
yield response.follow(current_page, callback = self.manipulation)
def manipulation(self, response):
#your code here

How to make "tri-directional" scraping?

I want to scrape a business directory with Scrapy and Python 3.
You know the concept of "bi-directional" scraping:
1st direction => Scraping the urls of details page of item (detail page of Business A, detail page of business B, etc....) which are displayed on 1 page of result.
2nd direction => Scraping the urls of pagination of resuls pages (page 1, page 2, page 3, etc...)
I guess you understand this stuff.
The website business directory I want to scrape has a 3rd "direction". All the businesses I want to scrape in this website are organised by ALPHABET.
I need to click on an ALPHABET letter to get thousands of businesses displayed on several pages pagination (please see pictures attached for better understanding).
So I added the alphabet urls by hand in the start_urls. But it didn't work. Have a look at my code:
class AnnuaireEntreprisesSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
#here I added the list of ALPHABET pages which containes several pages of results per letter
start_urls = ['http://www.example.com/entreprises-0-9.html',
'http://www.example.com/entreprises-a.html',
'http://www.example.com/entreprises-b.html',
'http://www.example.com/entreprises-c.html',
'http://www.example.com/entreprises-d.html',
'http://www.example.com/entreprises-e.html',
'http://www.example.com/entreprises-f.html',
'http://www.example.com/entreprises-g.html',
'http://www.example.com/entreprises-h.html',
'http://www.example.com/entreprises-i.html',
'http://www.example.com/entreprises-j.html',
'http://www.example.com/entreprises-k.html',
'http://www.example.com/entreprises-l.html',
'http://www.example.com/entreprises-m.html',
'http://www.example.com/entreprises-n.html',
'http://www.example.com/entreprises-o.html',
'http://www.example.com/entreprises-p.html',
'http://www.example.com/entreprises-q.html',
'http://www.example.com/entreprises-r.html',
'http://www.example.com/entreprises-s.html',
'http://www.example.com/entreprises-t.html',
'http://www.example.com/entreprises-u.html',
'http://www.example.com/entreprises-v.html',
'http://www.example.com/entreprises-w.html',
'http://www.example.com/entreprises-x.html',
'http://www.example.com/entreprises-y.html',
'http://www.example.com/entreprises-z.html'
]
def parse(self, response):
urls = response.xpath("//a[#class='btn-fiche dcL ']/#href").extract()
for url in urls:
#here I scrape the urls of detail page of business
absolute_url = response.urljoin(url)
print('Voici absolute url :' + absolute_url)
yield Request(absolute_url, callback=self.parse_startup)
next_page = response.xpath("//a[#class='nextPages']/#href").get() or ''
if next_page:
#Here I scrape the pagination urls
absolute_next_page = response.urljoin(next_page)
print('Voici absolute url NEXT PAGE :' + absolute_next_page)
yield response.follow(next_page, callback=self.parse)
def parse_startup(self, response):
print("Parse_startup details!!!")
#and here I scrape the details of the business
I am a beginner who started to learn Scrapy a few weeks ago.

Scraping 'next' page after finishing in the main one using Rules

I'm trying to make a spider that scrapes products from a page and, when finished, scrape the next page on the catalog and the next one after that, etc.
I got all the products from a page (I'm scraping amazon) with
rules = {
Rule(LinkExtractor(allow =(), restrict_xpaths = ('//a[contains(#class, "a-link-normal") and contains(#class,"a-text-normal")]') ),
callback = 'parse_item', follow = False)
}
And that works just fine. The problem is that I should go to the 'next' page and keep scraping.
What I tried to do is a rule like this
rules = {
#Next Button
Rule(LinkExtractor(allow =(), restrict_xpaths = ('(//li[#class="a-normal"]/a/#href)[2]') )),
}
Problem is that the xPath returns (for example, from this page: https://www.amazon.com/s?k=mac+makeup&lo=grid&page=2&crid=2JQQNTWC87ZPV&qid=1559841911&sprefix=MAC+mak%2Caps%2C312&ref=sr_pg_2)
/s?k=mac+makeup&lo=grid&page=3&crid=2JQQNTWC87ZPV&qid=1559841947&sprefix=MAC+mak%2Caps%2C312&ref=sr_pg_3
Which would be the URL for the next page but without the www.amazon.com.
I think that my code is not working because I'm missing the www.amazon.com before the url above.
Any idea how to make this work? Maybe the way I went in doing this is not the right one.
Try using urljoin.
link = "/s?k=mac+makeup&lo=grid&page=3&crid=2JQQNTWC87ZPV&qid=1559841947&sprefix=MAC+mak%2Caps%2C312&ref=sr_pg_3"
new_link = response.urljoin(link)
The following spider is a possible solution, the main ideas is use the parse_links function to get the links to the individual page which yields the response to the parse function, and you can also yield the next page response to the same function untill you've crawled through all the pages.
class AmazonSpider(scrapy.spider):
start_urls = ['https://www.amazon.com/s?k=mac+makeup&lo=grid&crid=2JQQNTWC87ZPV&qid=1559870748&sprefix=MAC+mak%2Caps%2C312&ref=sr_pg_1'
wrapper_xpath = '//*[#id="search"]/div[1]/div[2]/div/span[3]/div[1]/div' # Product wrapper
link_xpath = './//div/div/div/div[2]/div[2]/div/div[1]/h2/a/#href' # Link xpath
np_xpath = '(//li[#class="a-normal"]/a/#href)[2]' # Next page xpath
def parse_links(self, response):
for li in response.xpath(self.wrapper_xpath):
link = li.xpath(self.link_xpath).extract_first()
link = response.urljoin(link)
yield scrapy.Request(link, callback = self.parse)
next_page = response.xpath(self.np_xpath).extract_first()
if next_page is not None:
next_page_link = response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback=self.parse_links)
else:
print("next_page is none")

Resources