Scrapy doesn't find custom function - python-3.x

I have implemented my own function for excluding urls which contain certain words. However when I call it inside my parse method, Scrapy tells me that the function is not defined, even though it is. I didn't use the rule object since I get the Urls I want to scrape from an api. Here is my setup:
class IbmSpiderSpider(scrapy.Spider):
...
def checkUrlForWords(text):
...
return flag
def parse(self, response):
data = json.loads(response.body)
results = data.get('resultset').get('searchresults').get('searchresultlist')
for result in results:
url = result.get('url')
if (checkUrlForWords(url)==True): continue
yield scrapy.Request(url, self.parse_content, meta={'title': result.get('title')})
Please help

Use self.checkUrlForWords since this is method inside class. Usage of plain checkUrlForWords will lead to errors. Just add self to method attributes and calling.
def checkUrlForWords(self, text):
...
return flag

Your function is defined inside your class. Use:
IbmSpiderSpider.checkUrlForWords(url)
Your function looks like a static method, you can use the appropriate decorator to call it with self.checkUrlForWords:
class IbmSpiderSpider(scrapy.Spider):
...
#staticmethod
def checkUrlForWords(text):
...
return flag
def parse(self, response):
data = json.loads(response.body)
results = data.get('resultset').get('searchresults').get('searchresultlist')
for result in results:
url = result.get('url')
if (self.checkUrlForWords(url)==True): continue
yield scrapy.Request(url, self.parse_content, meta={'title': result.get('title')})

You can also define your function outside from your class in the same .py file:
def checkUrlForWords(text):
...
return flag
class IbmSpiderSpider(scrapy.Spider):
...
def parse(self, response):
data = json.loads(response.body)
results = data.get('resultset').get('searchresults').get('searchresultlist')
for result in results:
url = result.get('url')
if (checkUrlForWords(url)==True): continue
....

Related

Get results of Scrapy spiders in variable

I try to run Scrapy spider and some SDK call to another resource inside Django. The main idea collect results from both of them in one list once it will be ready and output it to view. SDK is working in sync way, so there are no issues. But I could not get results from a spider. Anyone could point me to the correct solution?
My code to run parses looks like this:
class scrapyParser(Parser):
def __init__(self, keywords=None, n_items=None):
super().__init__(keywords, n_items)
def parse(self):
result = []
if not super().parse():
return False
crawler = UrlCrawlerScript(Parser1, result, [BASE_PATH + self.keywords])
crawler.start()
crawler.join()
print(crawler.outputResponse)
return result[:self.n_items]
class UrlCrawlerScript(Process):
def __init__(self, result, urls):
Process.__init__(self)
settings = get_project_settings()
self.crawler = Crawler(spider, settings=settings)
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
self.spider = spider
self.urls = urls
self.outputResponse = result
#inlineCallbacks
def cycle_run(self):
yield self.crawler.crawl(Parser1, outputResponse=self.outputResponse, start_urls=self.urls)
returnValue(self.outputResponse)
def run(self):
result = self.cycle_run()
result.addCallback(print)
reactor.run()
Parse code is very simple and it has such a template:
import scrapy
class Parser1(scrapy.Spider):
name = 'items'
allowed_domains = ['domain.com']
def parse(self, response):
...
# parsing page
for item in row_data:
scraped_info = {
...
}
self.outputResponse.append(scraped_info)
So I could not get anything in the output of parse. It returns an empty list. However, I'm at the very beginning of my way with async calls in Python and Twisted framework. It's highly possible I just messed something.
After doing a lot of different code snippets and looking for SO answers I finally found an easy and elegant solution. Using scrapyscript.
class scrapyParser(Parser):
def __init__(self, keywords=None, n_items=None):
super().__init__(keywords, n_items)
def parse(self):
result = []
if not super().parse():
return False
processor = Processor(settings=None)
job1 = Job(Parser1, url=URL1 + self.keywords)
job2 = Job(Parser2, url=URL2 + self.keywords)
return processor.run([job1, job2])
Source: https://stackoverflow.com/a/62902603/1345788

how to use python decorator with argument?

I would like to define a decorator that will register classes by a name given as an argument of my decorator. I could read from stackoverflow and other sources many examples that show how to derive such (tricky) code but when adapted to my needs my code fails to produce the expected result. Here is the code:
import functools
READERS = {}
def register(typ):
def decorator_register(kls):
#functools.wraps(kls)
def wrapper_register(*args, **kwargs):
READERS[typ] = kls
return wrapper_register
return decorator_register
#register(".pdb")
class PDBReader:
pass
#register(".gro")
class GromacsReader:
pass
print(READERS)
This code produces an empty dictionary while I would expect a dictionary with two entries. Would you have any idea about what is wrong with my code ?
Taking arguments (via (...)) and decoration (via #) both result in calls of functions. Each "stage" of taking arguments or decoration maps to one call and thus one nested functions in the decorator definition. register is a three-stage decorator and takes as many calls to trigger its innermost code. Of these,
the first is the argument ((".pdb")),
the second is the class definition (#... class), and
the third is the class call/instantiation (PDBReader(...))
This stage is broken as it does not instantiate the class.
In order to store the class itself in the dictionary, store it at the second stage. As the instances are not to be stored, remove the third stage.
def register(typ): # first stage: file extension
"""Create a decorator to register its target for the given `typ`"""
def decorator_register(kls): # second stage: Reader class
"""Decorator to register its target `kls` for the previously given `typ`"""
READERS[typ] = kls
return kls # <<< return class to preserve it
return decorator_register
Take note that the result of a decorator replaces its target. Thus, you should generally return the target itself or an equivalent object. Since in this case the class is returned immediately, there is no need to use functools.wraps.
READERS = {}
def register(typ): # first stage: file extension
"""Create a decorator to register its target for the given `typ`"""
def decorator_register(kls): # second stage: Reader class
"""Decorator to register its target `kls` for the previously given `typ`"""
READERS[typ] = kls
return kls # <<< return class to preserve it
return decorator_register
#register(".pdb")
class PDBReader:
pass
#register(".gro")
class GromacsReader:
pass
print(READERS) # {'.pdb': <class '__main__.PDBReader'>, '.gro': <class '__main__.GromacsReader'>}
If you don't actually call the code that the decorator is "wrapping" then the "inner" function will not fire, and you will not create an entry inside of READER. However, even if you create instances of PDBReader or GromacsReader, the value inside of READER will be of the classes themselves, not an instance of them.
If you want to do the latter, you have to change wrapper_register to something like this:
def register(typ):
def decorator_register(kls):
#functools.wraps(kls)
def wrapper_register(*args, **kwargs):
READERS[typ] = kls(*args, **kwargs)
return READERS[typ]
return wrapper_register
return decorator_register
I added simple init/repr inside of the classes to visualize it better:
#register(".pdb")
class PDBReader:
def __init__(self, var):
self.var = var
def __repr__(self):
return f"PDBReader({self.var})"
#register(".gro")
class GromacsReader:
def __init__(self, var):
self.var = var
def __repr__(self):
return f"GromacsReader({self.var})"
And then we initialize some objects:
x = PDBReader("Inside of PDB")
z = GromacsReader("Inside of Gromacs")
print(x) # Output: PDBReader(Inside of PDB)
print(z) # Output: GromacsReader(Inside of Gromacs)
print(READERS) # Output: {'.pdb': PDBReader(Inside of PDB), '.gro': GromacsReader(Inside of Gromacs)}
If you don't want to store the initialized object in READER however, you will still need to return an initialized object, otherwise when you try to initialize the object, it will return None.
You can then simply change wrapper_register to:
def wrapper_register(*args, **kwargs):
READERS[typ] = kls
return kls(*args, **kwargs)

How to scrape data from main listing page as well as detail pagefor that particular listing using scrapy

I am crawling a website with property listings and the "Buy/Rent" is only found in the listing page.I have extracted other data from the detail page by parsing each urls to the parse_property method from parse method, however i am not able to get the offering type from the main listing page.
I have tried to do it the same way i parsed individual urls.(The commented code)
def parse(self, response):
properties = response.xpath('//div[#class="property-information-address"]/a')
for property in properties:
url= property.xpath('./#href').extract_first()
yield Request(url, callback=self.parse_property, meta={'URL':url})
# TODO: offering
# offering=response.xpath('//div[#class="property-status"]')
# for of in offerings:
# offering=of.xpath('./a/text()').extract_first()
# yield Request(offering, callback=self.parse_property, meta={'Offering':offering})
next_page=response.xpath('//div[#class="pagination"]/a/#href')[-2].extract()
yield Request(next_page, callback=self.parse)
def parse_property(self, response):
l = ItemLoader(item=NPMItem(), response=response)
url=response.meta.get('URL')
#offer=response.meta.get('Offering')
l.add_value('URL', response.url)
#l.add_value('Offering', response.offer)
You can try to rely on element, which is higher in DOM-tree, and scrape both property type and link from there. Check this code example, it works:
def parse(self, response):
properties = response.xpath('//div[#class="property-listing"]')
for property in properties:
url = property.xpath('.//div[#class="property-information-address"]/a/#href').get()
ptype = property.xpath('.//div[#class="property-status"]/a/text()').get()
yield response.follow(url, self.parse_property, meta={'ptype': ptype})
next_page = response.xpath('//link[#rel="next"]/#href').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_property(self, response):
print '======'
print response.meta['ptype']
print '======'
# build your item here, printing is only to show content of `ptype`

How to go to following link from a for loop?

I am using scrapy to scrape a website I am in a loop where every item have link I want to go to following every time in a loop.
import scrapy
class MyDomainSpider(scrapy.Spider):
name = 'My_Domain'
allowed_domains = ['MyDomain.com']
start_urls = ['https://example.com']
def parse(self, response):
Colums = response.xpath('//*[#id="tab-5"]/ul/li')
for colom in Colums:
title = colom.xpath('//*[#class="lng_cont_name"]/text()').extract_first()
address = colom.xpath('//*[#class="adWidth cont_sw_addr"]/text()').extract_first()
con_address = address[9:-9]
url= colom.xpath('//*[#id="tab-5"]/ul/li/#data-href').extract_first()
print(url)
print('*********************')
yield scrapy.Request(url, callback = self.parse_dir_contents)
def parse_dir_contents(self, response):
print('000000000000000000')
a = response.xpath('//*[#class="fn"]/text()').extract_first()
print(a)
I have tried something like this but zeros print only once but stars prints 10 time I want it to run 2nd function to run every time when the loop runs.
You are probably doing something that is not intended. With
url = colom.xpath('//*[#id="tab-5"]/ul/li/#data-href').extract_first()
inside the loop, url always results in the same value. By default, Scrapy filters duplicate requests (see here). If you really want to scrape the same URL multiple times, you can disable the filtering on request level with dont_filter=True argument to scrapy.Request constructor. However, I think that what you really want is to go like this (only the relevant part of the code left):
def parse(self, response):
Colums = response.xpath('//*[#id="tab-5"]/ul/li')
for colom in Colums:
url = colom.xpath('./#data-href').extract_first()
yield scrapy.Request(url, callback=self.parse_dir_contents)

Scrapy: how to pass links

I can not pass references. When starting a spider, I'm not getting data
Help with code.
I'm a beginner in Scrapy
import scrapy
from movie.items import AfishaCinema
class AfishaCinemaSpider(scrapy.Spider):
name = 'afisha-cinema'
allowed_domains = ['kinopoisk.ru']
start_urls = ['https://www.kinopoisk.ru/premiere/ru/']
def parse(self, response):
links = response.css('div.textBlock>span.name_big>a').xpath(
'#href').extract()
for link in links:
yield scrapy.Request(link, callback=self.parse_moov,
dont_filter=True)
def parse_moov(self, response):
item = AfishaCinema()
item['name'] = response.css('h1.moviename-big::text').extract()
The reason you are not getting the data is that you don't yield any from your parse_moov method. As per the documentation, parse method must return an iterable of Request and/or dicts or Item objects. So add
yield item
at the end of your parse_moov method.
Also, to be able to run your code, I had to modify
yield scrapy.Request(link, callback=self.parse_moov, dont_filter=True)
to
yield scrapy.Request(response.urljoin(link), callback=self.parse_moov, dont_filter=True)
in the parse method, otherwise I was getting errors:
ValueError: Missing scheme in request url: /film/monstry-na-kanikulakh-3-more-zovyot-2018-950968/
(That's because Request constructor needs absolute URL while the page contains relative URLs.)

Resources