Get results of Scrapy spiders in variable

Get results of Scrapy spiders in variable - python-3.x

I try to run Scrapy spider and some SDK call to another resource inside Django. The main idea collect results from both of them in one list once it will be ready and output it to view. SDK is working in sync way, so there are no issues. But I could not get results from a spider. Anyone could point me to the correct solution?
My code to run parses looks like this:
class scrapyParser(Parser):
def __init__(self, keywords=None, n_items=None):
super().__init__(keywords, n_items)
def parse(self):
result = []
if not super().parse():
return False
crawler = UrlCrawlerScript(Parser1, result, [BASE_PATH + self.keywords])
crawler.start()
crawler.join()
print(crawler.outputResponse)
return result[:self.n_items]
class UrlCrawlerScript(Process):
def __init__(self, result, urls):
Process.__init__(self)
settings = get_project_settings()
self.crawler = Crawler(spider, settings=settings)
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
self.spider = spider
self.urls = urls
self.outputResponse = result
#inlineCallbacks
def cycle_run(self):
yield self.crawler.crawl(Parser1, outputResponse=self.outputResponse, start_urls=self.urls)
returnValue(self.outputResponse)
def run(self):
result = self.cycle_run()
result.addCallback(print)
reactor.run()
Parse code is very simple and it has such a template:
import scrapy
class Parser1(scrapy.Spider):
name = 'items'
allowed_domains = ['domain.com']
def parse(self, response):
...
# parsing page
for item in row_data:
scraped_info = {
...
}
self.outputResponse.append(scraped_info)
So I could not get anything in the output of parse. It returns an empty list. However, I'm at the very beginning of my way with async calls in Python and Twisted framework. It's highly possible I just messed something.

After doing a lot of different code snippets and looking for SO answers I finally found an easy and elegant solution. Using scrapyscript.
class scrapyParser(Parser):
def __init__(self, keywords=None, n_items=None):
super().__init__(keywords, n_items)
def parse(self):
result = []
if not super().parse():
return False
processor = Processor(settings=None)
job1 = Job(Parser1, url=URL1 + self.keywords)
job2 = Job(Parser2, url=URL2 + self.keywords)
return processor.run([job1, job2])
Source: https://stackoverflow.com/a/62902603/1345788

Related

PyTest how to properly mock imported ContextManager class and its function?

This is my sample code:
from path.lib import DBInterface
class MyClass:
def __init__(self):
self.something = "something"
def _my_method(self, some_key, new_setup):
with DBInterface(self.something) as ic:
current_setup = ic.get(some_key)
if current_setup != new_setup:
with DBInterface(self.something) as ic:
ic.set(new_setup)
def public_method(self, some_key, new_setup):
return self._my_method(some_key, new_setup)
(my actual code is bit more complex, but i cant put it here on public :)
Now, what I want to do is, I want to completely mock the imported class DBInterface, because I do not want my unittests to do anything in DB.
BUT I also need the ic.get(some_key) to return some value, or to be more precise, I need to set the value it returns, because thats the point of my unittests, to test if the method behave properly according to value returned from DB.
This is how far I got:
class TestMyClass:
def test_extractor_register(self, mocker):
fake_db = mocker.patch.object('my_path.my_lib.DBInterface')
fake_db.get.return_value = None
# spy_obj = mocker.spy(MyClass, "_my_method")
test_class = MyClass()
# Test new registration in _extractor_register
result = test_class.public_method(Tconf.test_key, Tconf.test_key_setup)
fake_db.assert_has_calls([call().__enter__().get(Tconf.test_key),
call().__enter__().set(Tconf.test_key, Tconf.test_key_setup)])
# spy_obj.assert_called_with(ANY, Tconf.test_key, Tconf.test_key_setup)
assert result.result_status.status_code == Tconf.status_ok.status_code
assert result.result_data == MyMethodResult.new_reg
But i am unable to set return value for call().__enter__().get(Tconf.test_key).
I have been trying many approaches:
fake_db.get.return_value = None
fake_db.__enter__().get.return_value = None
fake_db.__enter__.get = Mock(return_value=None)
mocker.patch.object(MyClass.DBInterface, "get").return_value = None
None of that is actually working and I am running out of options I can think about.

Without having more code or errors that are being produced, it's tough to provide a conclusive answer.
However, if you truly only need to specify a return value for set() I would recommend using MagicMock by virtue of patch --
from unittest.mock import patch
#patch("<MyClassFile>.DBInterface", autospec=True)
def test_extractor_register(mock_db):
mock_db.set.return_value = "some key"
# Rest of test code

Scrapy doesn't find custom function

I have implemented my own function for excluding urls which contain certain words. However when I call it inside my parse method, Scrapy tells me that the function is not defined, even though it is. I didn't use the rule object since I get the Urls I want to scrape from an api. Here is my setup:
class IbmSpiderSpider(scrapy.Spider):
...
def checkUrlForWords(text):
...
return flag
def parse(self, response):
data = json.loads(response.body)
results = data.get('resultset').get('searchresults').get('searchresultlist')
for result in results:
url = result.get('url')
if (checkUrlForWords(url)==True): continue
yield scrapy.Request(url, self.parse_content, meta={'title': result.get('title')})
Please help

Use self.checkUrlForWords since this is method inside class. Usage of plain checkUrlForWords will lead to errors. Just add self to method attributes and calling.
def checkUrlForWords(self, text):
...
return flag

Your function is defined inside your class. Use:
IbmSpiderSpider.checkUrlForWords(url)
Your function looks like a static method, you can use the appropriate decorator to call it with self.checkUrlForWords:
class IbmSpiderSpider(scrapy.Spider):
...
#staticmethod
def checkUrlForWords(text):
...
return flag
def parse(self, response):
data = json.loads(response.body)
results = data.get('resultset').get('searchresults').get('searchresultlist')
for result in results:
url = result.get('url')
if (self.checkUrlForWords(url)==True): continue
yield scrapy.Request(url, self.parse_content, meta={'title': result.get('title')})

You can also define your function outside from your class in the same .py file:
def checkUrlForWords(text):
...
return flag
class IbmSpiderSpider(scrapy.Spider):
...
def parse(self, response):
data = json.loads(response.body)
results = data.get('resultset').get('searchresults').get('searchresultlist')
for result in results:
url = result.get('url')
if (checkUrlForWords(url)==True): continue
....

How to go to following link from a for loop?

I am using scrapy to scrape a website I am in a loop where every item have link I want to go to following every time in a loop.
import scrapy
class MyDomainSpider(scrapy.Spider):
name = 'My_Domain'
allowed_domains = ['MyDomain.com']
start_urls = ['https://example.com']
def parse(self, response):
Colums = response.xpath('//*[#id="tab-5"]/ul/li')
for colom in Colums:
title = colom.xpath('//*[#class="lng_cont_name"]/text()').extract_first()
address = colom.xpath('//*[#class="adWidth cont_sw_addr"]/text()').extract_first()
con_address = address[9:-9]
url= colom.xpath('//*[#id="tab-5"]/ul/li/#data-href').extract_first()
print(url)
print('*********************')
yield scrapy.Request(url, callback = self.parse_dir_contents)
def parse_dir_contents(self, response):
print('000000000000000000')
a = response.xpath('//*[#class="fn"]/text()').extract_first()
print(a)
I have tried something like this but zeros print only once but stars prints 10 time I want it to run 2nd function to run every time when the loop runs.

You are probably doing something that is not intended. With
url = colom.xpath('//*[#id="tab-5"]/ul/li/#data-href').extract_first()
inside the loop, url always results in the same value. By default, Scrapy filters duplicate requests (see here). If you really want to scrape the same URL multiple times, you can disable the filtering on request level with dont_filter=True argument to scrapy.Request constructor. However, I think that what you really want is to go like this (only the relevant part of the code left):
def parse(self, response):
Colums = response.xpath('//*[#id="tab-5"]/ul/li')
for colom in Colums:
url = colom.xpath('./#data-href').extract_first()
yield scrapy.Request(url, callback=self.parse_dir_contents)

Scrapy: how to pass links

I can not pass references. When starting a spider, I'm not getting data
Help with code.
I'm a beginner in Scrapy
import scrapy
from movie.items import AfishaCinema
class AfishaCinemaSpider(scrapy.Spider):
name = 'afisha-cinema'
allowed_domains = ['kinopoisk.ru']
start_urls = ['https://www.kinopoisk.ru/premiere/ru/']
def parse(self, response):
links = response.css('div.textBlock>span.name_big>a').xpath(
'#href').extract()
for link in links:
yield scrapy.Request(link, callback=self.parse_moov,
dont_filter=True)
def parse_moov(self, response):
item = AfishaCinema()
item['name'] = response.css('h1.moviename-big::text').extract()

The reason you are not getting the data is that you don't yield any from your parse_moov method. As per the documentation, parse method must return an iterable of Request and/or dicts or Item objects. So add
yield item
at the end of your parse_moov method.
Also, to be able to run your code, I had to modify
yield scrapy.Request(link, callback=self.parse_moov, dont_filter=True)
to
yield scrapy.Request(response.urljoin(link), callback=self.parse_moov, dont_filter=True)
in the parse method, otherwise I was getting errors:
ValueError: Missing scheme in request url: /film/monstry-na-kanikulakh-3-more-zovyot-2018-950968/
(That's because Request constructor needs absolute URL while the page contains relative URLs.)

yield scrapy.Request does not return the title

I am new to Scrapy and try to use it to practice crawling the website. However, even I followed the codes provided by the tutorial, it does not return the results. It looks like yield scrapy.Request does not work. My codes are as follow:
Import scrapy
from bs4 import BeautifulSoup
from apple.items import AppleItem
class Apple1Spider(scrapy.Spider):
name = 'apple'
allowed_domains = ['appledaily.com']
start_urls =['http://www.appledaily.com.tw/realtimenews/section/new/']
def parse(self, response):
domain = "http://www.appledaily.com.tw"
res = BeautifulSoup(response.body)
for news in res.select('.rtddt'):
yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail)
def parse_detail(self, response):
res = BeautifulSoup(response.body)
appleitem = AppleItem()
appleitem['title'] = res.select('h1')[0].text
appleitem['content'] = res.select('.trans')[0].text
appleitem['time'] = res.select('.gggs time')[0].text
return appleitem
It shows that spider was opened and closed but it returns nothing. The version of Python is 3.6. Can anyone please help? Thanks.
EDIT I
The crawl log can be reached here.
EDIT II
Maybe if I change the codes as below will make the issue clearer:
Import scrapy
from bs4 import BeautifulSoup
class Apple1Spider(scrapy.Spider):
name = 'apple'
allowed_domains = ['appledaily.com']
start_urls = ['http://www.appledaily.com.tw/realtimenews/section/new/']
def parse(self, response):
domain = "http://www.appledaily.com.tw"
res = BeautifulSoup(response.body)
for news in res.select('.rtddt'):
yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail)
def parse_detail(self, response):
res = BeautifulSoup(response.body)
print(res.select('#h1')[0].text)
The codes should print out the url and the title separately but it does not return anything.

Your log states:
2017-07-10 19:12:47 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to
'www.appledaily.com.tw': http://www.appledaily.com.tw/realtimenews/article/life/201
70710/1158177/oBike%E7%A6%81%E5%81%9C%E6%A9%9F%E8%BB%8A%E6%A0%BC%E3%80%80%E6%96%B0%E5%8C%
97%E7%81%AB%E9%80%9F%E5%86%8D%E5%85%AC%E5%91%8A6%E5%8D%80%E7%A6%81%E5%81%9C>
Your spider is set to:
allowed_domains = ['appledaily.com']
So it should probably be:
allowed_domains = ['appledaily.com.tw']

It seems like the content you are interested in your parse method (i.e. list items with class rtddt) is generated dynamically -- it can be inspected for example using Chrome, but is not present in HTML source (what Scrapy obtains as a response).
You will have to use something to render the page for Scrapy first. I would recommend Splash together with scrapy-splash package.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Get results of Scrapy spiders in variable - python-3.x

Related

PyTest how to properly mock imported ContextManager class and its function?

Scrapy doesn't find custom function

How to go to following link from a for loop?

Scrapy: how to pass links

yield scrapy.Request does not return the title

Categories

Resources