How to make "tri-directional" scraping? - python-3.x

I want to scrape a business directory with Scrapy and Python 3.
You know the concept of "bi-directional" scraping:
1st direction => Scraping the urls of details page of item (detail page of Business A, detail page of business B, etc....) which are displayed on 1 page of result.
2nd direction => Scraping the urls of pagination of resuls pages (page 1, page 2, page 3, etc...)
I guess you understand this stuff.
The website business directory I want to scrape has a 3rd "direction". All the businesses I want to scrape in this website are organised by ALPHABET.
I need to click on an ALPHABET letter to get thousands of businesses displayed on several pages pagination (please see pictures attached for better understanding).
So I added the alphabet urls by hand in the start_urls. But it didn't work. Have a look at my code:
class AnnuaireEntreprisesSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
#here I added the list of ALPHABET pages which containes several pages of results per letter
start_urls = ['http://www.example.com/entreprises-0-9.html',
'http://www.example.com/entreprises-a.html',
'http://www.example.com/entreprises-b.html',
'http://www.example.com/entreprises-c.html',
'http://www.example.com/entreprises-d.html',
'http://www.example.com/entreprises-e.html',
'http://www.example.com/entreprises-f.html',
'http://www.example.com/entreprises-g.html',
'http://www.example.com/entreprises-h.html',
'http://www.example.com/entreprises-i.html',
'http://www.example.com/entreprises-j.html',
'http://www.example.com/entreprises-k.html',
'http://www.example.com/entreprises-l.html',
'http://www.example.com/entreprises-m.html',
'http://www.example.com/entreprises-n.html',
'http://www.example.com/entreprises-o.html',
'http://www.example.com/entreprises-p.html',
'http://www.example.com/entreprises-q.html',
'http://www.example.com/entreprises-r.html',
'http://www.example.com/entreprises-s.html',
'http://www.example.com/entreprises-t.html',
'http://www.example.com/entreprises-u.html',
'http://www.example.com/entreprises-v.html',
'http://www.example.com/entreprises-w.html',
'http://www.example.com/entreprises-x.html',
'http://www.example.com/entreprises-y.html',
'http://www.example.com/entreprises-z.html'
]
def parse(self, response):
urls = response.xpath("//a[#class='btn-fiche dcL ']/#href").extract()
for url in urls:
#here I scrape the urls of detail page of business
absolute_url = response.urljoin(url)
print('Voici absolute url :' + absolute_url)
yield Request(absolute_url, callback=self.parse_startup)
next_page = response.xpath("//a[#class='nextPages']/#href").get() or ''
if next_page:
#Here I scrape the pagination urls
absolute_next_page = response.urljoin(next_page)
print('Voici absolute url NEXT PAGE :' + absolute_next_page)
yield response.follow(next_page, callback=self.parse)
def parse_startup(self, response):
print("Parse_startup details!!!")
#and here I scrape the details of the business
I am a beginner who started to learn Scrapy a few weeks ago.

Related

How to collect URL links for pages that are not numerically ordered

When URLs are ordered in a numeric order, it's simple to fetch all the articles in a given website.
However, when we have a website such as https://mongolia.mid.ru/en_US/novosti where there are articles with URLs like
https://mongolia.mid.ru/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/10-iula-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-i-ministra-inostrannyh-del-mongolii-n-enhtajv?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
How do I fetch all the article URLs on this website? Where there's no numeric order or whatsoever.
There's order to that chaos.
If you take a good look at the source code you'll surely notice the next button. If you click it and inspect the url (it's long, I know) you'll see there's a value at the very end of it - _cur=1. This is the number of the current page you're at.
The problem, however, is that you don't know how many pages there are, right? But, you can programmatically keep checking for a url in the next button and stop when there are no more pages to go to.
Meanwhile, you can scrape for article urls while you're at the current page.
Here's how to do it:
import requests
from lxml import html
url = "https://mongolia.mid.ru/en_US/novosti"
next_page_xpath = '//*[#class="pager lfr-pagination-buttons"]/li[2]/a/#href'
article_xpath = '//*[#class="title"]/a/#href'
def get_page(url):
return requests.get(url).content
def extractor(page, xpath):
return html.fromstring(page).xpath(xpath)
def head_option(values):
return next(iter(values), None)
articles = []
while True:
page = get_page(url)
print(f"Checking page: {url}")
articles.extend(extractor(page, article_xpath))
next_page = head_option(extractor(page, next_page_xpath))
if next_page == 'javascript:;':
break
url = next_page
print(f"Scraped {len(articles)}.")
# print(articles)
This gets you 216 article urls. If you want to see the article urls, just uncomment the last line - # print(articles)
Here's a sample of 2:
['https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/24-avgusta-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-ministrom-energetiki-mongolii-n-tavinbeh?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1', 'https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/19-avgusta-2020-goda-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-zamestitelem-ministra-inostran?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1']

Scraping infinite scrolling pages using scrapy

I want help in scraping infinite scrolling pages. For now, I have entered pageNumber = 100, which helps me in getting the name from 100 pages.
But I want to crawl all the pages till the end. As the page has infinite scrolling and being new to scrapy I am unable to do the same. I am trying this for the past 2 days.
class StorySpider(scrapy.Spider):
name = 'story-spider'
start_urls = ['https://www.storytel.com/in/en/categories/3-Crime?pageNumber=100']
def parse(self, response):
for quote in response.css('div.gridBookTitle'):
item = {
'name': quote.css('a::attr(href)').extract_first()
}
yield item
The original link is https://www.storytel.com/in/en/categories/1-Children. I see that the pageNumber variable is inside script tag, if it helps to find the solution.
Any help would be appreciated. Thanks in advance!!
If you search the XPath like <link rel="next" href=''>
you will found the pagination option. With the help of you can add the pagination code.
here is some example of the pagination page.
next_page = xpath of pagination
if len(next_page) !=0:
next_page_url = main_url.join(next_page
yield scrapy.Request(next_page_url, callback=self.parse)
It will helps you.

How to get a Scrapy request to go to the last page of the website?

I just need to make Scrapy request to request last page of the website.
I cant create a scrapy request to go to the last page. I have tried the code below.
last_page = response.css('li.next a::attr(href)').get()
if next_page is None:
yield scrapy.Request(last_page, callback=self.parse)
It is expected that the crawler goes straight to the last page, then from there I would do some manipulations
I believe the way to go would be to inspect the source code to find the "Next" page link and use this function in parse:
current_page = #current_page_link
next_page = #scraping the link using a css selector
if next_page is None:
yield response.follow(current_page, callback = self.manipulation)
def manipulation(self, response):
#your code here

Going through specified items in a page using scrapy

I'm running into some trouble trying to enter and analyze several items within a page.
I have a certain page which contains items in it, the code looks something like this
class Spider(CrawlSpider):
name = 'spider'
maxId = 20
allowed_domain = ['www.domain.com']
start_urls = ['http://www.startDomain.com']
In the start url, i have some items that all follow, in XPath, the following path (within the startDomain):
def start_requests(self):
for i in range(self.maxId):
yield Request('//*[#id="result_{0}"]/div/div/div/div[2]/div[1]/div[1]/a/h2'.format(i) , callback = self.parse_item)
I'd like to find a way to access each one of these links (the ones tied to result{number}) and then scrape the contents of that certain item.

Python web crawler doesn't crawl all pages

I'm trying to make a web crawler that crawls a set number of pages, but it only crawls the first page, and prints it as many times as the amount of pages i want to crawl.
def web_spider (max_pages):
page = 1
while page <= max_pages:
url = 'http://www.forbes.com/global2000/list/#page:' + str(page) + '_sort:0_direction:asc_search:_filter:All%20industries_' \
'filter:All%20countries_filter:All%20states'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a'):
if link.parent.name == 'td':
href = link.get('href')
x = href[11:len(href)-1]
company_list.append(x)
page += 1
print(page)
return company_list
Edit: Did it another way.
In case you want the dataset, you can use your browsers developer tools to find what network resources are used by clicking on Record network traffic and refresh the page to see how the table is populated. In this case I found the following URL:
https://www.forbes.com/forbesapi/org/global2000/2020/position/true.json?limit=2000
Does that help you?

Resources