Scrapy not calling the assigned pipeline when run from a script

Scrapy not calling the assigned pipeline when run from a script - python-3.x

I have a piece of code to test scrapy. My goal is to use scrapy without having to call the scrapy command from the terminal, so I can embed this code somewhere else.
The code is the following:
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.item import Item, Field
from scrapy.crawler import CrawlerProcess
import json
class JsonWriterPipeline(object):
file = None
def open_spider(self, spider):
self.file = open('items.json', 'wb')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
class StackItem(Item):
title = Field()
url = Field()
class StackSpider(Spider):
name = "stack"
allowed_domains = ["stackoverflow.com"]
start_urls = ["http://stackoverflow.com/questions?pagesize=50&sort=newest"]
def parse(self, response):
questions = Selector(response).xpath('//div[#class="summary"]/h3')
for question in questions:
item = StackItem()
item['title'] = question.xpath('a[#class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath('a[#class="question-hyperlink"]/#href').extract()[0]
yield item
if __name__ == '__main__':
settings = dict()
settings['USER_AGENT'] = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
settings['ITEM_PIPELINES'] = {'JsonWriterPipeline': 1}
process = CrawlerProcess(settings=settings)
spider = StackSpider()
process.crawl(spider)
process.start()
As you see, the code is self contained and I override two settings; the USER_AGENT and the ITEM_PIPELINES. However when I set debug points in the JsonWriterPipeline class, I see that the code is executed and the debug points are never reached, thus the custom pipeline is not being used.
How can this be fixed?

I get 2 errors when running your script with scrapy 1.3.2 and Python 3.5.
First:
Unhandled error in Deferred:
2017-02-21 13:47:23 [twisted] CRITICAL: Unhandled error in Deferred:
2017-02-21 13:47:23 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/utils/misc.py", line 39, in load_object
dot = path.rindex('.')
ValueError: substring not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/core/engine.py", line 70, in __init__
self.scraper = Scraper(crawler)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/core/scraper.py", line 71, in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/utils/misc.py", line 41, in load_object
raise ValueError("Error loading object '%s': not a full path" % path)
ValueError: Error loading object 'JsonWriterPipeline': not a full path
You need to give a complete path for the pipeline. For example here, the __main__ namespace works:
settings['ITEM_PIPELINES'] = {'__main__.JsonWriterPipeline': 1}
Second (with this pipeline class fix above), you get loads of:
2017-02-21 13:47:52 [scrapy.core.scraper] ERROR: Error processing {'title': 'Apply Remote Commits to a Local Pull Request',
'url': '/questions/42367647/apply-remote-commits-to-a-local-pull-request'}
Traceback (most recent call last):
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "test.py", line 20, in process_item
self.file.write(line)
TypeError: a bytes-like object is required, not 'str'
which you can fix with writing items JSON as bytes:
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line.encode('ascii'))
return item

Related

TypeError: <class 'str'> is not callable-Scrapy Framework

I am trying to run some codes from 'Learning Scrapy' book and ran into some errors. The codes that I ran:
`
import scrapy
from ..items import PropertiesItem
from scrapy.loader import ItemLoader
from itemloaders.processors import MapCompose, Join
from urllib.parse import urlparse
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
start_urls = (
'http://localhost:9312/properties/property_000000.html',
)
def parse(self, response):
l = ItemLoader(item=PropertiesItem(), response=response)
l.add_xpath('title', '//*[#itemprop="name"][1]/text()', MapCompose(str.strip, str.title))
l.add_xpath('price', './/*[#itemprop="price"][1]/text()', MapCompose(lambda i: i.replace(',', ''), float), re='[,.0-9]+')
l.add_xpath('description', '//*[#itemprop="description"][1]/text()', MapCompose(str.strip), Join())
l.add_xpath('address', '//*[#itemtype="http://schema.org/Place"][1]/text()', MapCompose(str.strip))
l.add_xpath('image_urls', '//*[#itemprop="image"][1]/#src', MapCompose(lambda i: urlparse.urljoin(response.url, i)))
return l.load_item()
`
And the error I got:
`
Traceback (most recent call last):
File "c:\users\sadat\appdata\local\programs\python\python39\lib\site-packages\twisted\internet\defer.py", line 857, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "c:\users\sadat\appdata\local\programs\python\python39\lib\site-packages\scrapy\spiders\__init__.py", line 67, in _parse
return self.parse(response, **kwargs)
File "C:\Users\Sadat\Desktop\scrapybook\properties\properties\spiders\basic.py", line 23, in parse
l.add_xpath('image_urls', '//*[#itemprop="image"][1]/#src', 'image_urls', '//*[#itemprop="image"][1]/#src',MapCompose(lambda i: urlparse.urljoin(response.url, i)))
File "c:\users\sadat\appdata\local\programs\python\python39\lib\site-packages\itemloaders\__init__.py", line 350, in add_xpath
self.add_value(field_name, values, *processors, **kw)
File "c:\users\sadat\appdata\local\programs\python\python39\lib\site-packages\itemloaders\__init__.py", line 183, in add_value
value = self.get_value(value, *processors, **kw)
File "c:\users\sadat\appdata\local\programs\python\python39\lib\site-packages\itemloaders\__init__.py", line 246, in get_value
proc = wrap_loader_context(proc, self.context)
File "c:\users\sadat\appdata\local\programs\python\python39\lib\site-packages\itemloaders\common.py", line 11, in wrap_loader_context
if 'loader_context' in get_func_args(function):
File "c:\users\sadat\appdata\local\programs\python\python39\lib\site-packages\itemloaders\utils.py", line 53, in get_func_args
raise TypeError('%s is not callable' % type(func))
TypeError: <class 'str'> is not callable
2022-11-02 16:04:47 [scrapy.core.engine] INFO: Closing spider (finished)
`
Specifically, the code that was giving error initially was this snippet: MapCompose(unicode.strip, unicode.title), there are multiples of them. After some digging, I found out that in later versions of python, str is used instead of unicode. But even after using str I am getting this error. I need help solving this error. Thanks.
Please note that I am using:
Python 3.9.4
Scrapy 2.6.1
VS Code 1.72
I was expecting scrapy to provide a clean scraped data via the Items, not this error.

get_serving_url silently fails with django-storages, app engine, python 3.x

I am trying to get the seving_url for a project that uses django-storages with google cloud storage for media files.
I am trying to serve the files with get_serving_url, but I get a silent failure here with no text logged in the exception.
The blobkey generates correctly from what I can see
however the image = images.get_serving_url(blobkey, secure_url=True) raises an exception with no error text.
This is what I have done:
#storage_backends.py
class GoogleCloudMediaStorage(GoogleCloudStorage):
"""GoogleCloudStorage suitable for Django's Media files."""
def __init__(self, *args, **kwargs):
if not settings.MEDIA_URL:
raise Exception('MEDIA_URL has not been configured')
kwargs['bucket_name'] = setting('GS_MEDIA_BUCKET_NAME')
super(GoogleCloudMediaStorage, self).__init__(*args, **kwargs)
#this works fine
def url(self, name):
""".url that doesn't call Google."""
return urljoin(settings.MEDIA_URL, name)
#https://programtalk.com/python-examples/google.appengine.api.images.get_serving_url/_
#This does not work yet
def serving_url(self, name):
logging.info('serving url called')
if settings.DEBUG:
return urljoin(settings.MEDIA_URL, name)
else:
# Your app's GCS bucket and any folder structure you have used.
try:
logging.info('trying to get serving_url')
filename = settings.GS_MEDIA_BUCKET_NAME + '/' + name
logging.info(filename)
blobkey = blobstore.create_gs_key('/gs/' + filename)
logging.info('This is a blobkey')
logging.info(blobkey)
image = images.get_serving_url(blobkey, secure_url=True)
return image
except Exception as e:
logging.warn('didnt work')
logging.warn(e)
return urljoin(settings.MEDIA_URL, name)
I have appengine-python-standard installed
and I have wrapped my application
#main.py
from antiques_project.wsgi import application
from google.appengine.api import wrap_wsgi_app
app = wrap_wsgi_app(application)
I also have this in my app.yaml
app_engine_apis: true
I have tried to generate the blobkey in different ways (with and without bucket)
I have also tried secure_url = False and True
So far nothing seems to work
EDIT:
Got a traceback in the logs:
Traceback (most recent call last):
File "/layers/google.python.pip/pip/lib/python3.10/site-packages/google/appengine/api/images/init.py", line 2013, in get_serving_url_hook
rpc.check_success()
File "/layers/google.python.pip/pip/lib/python3.10/site-packages/google/appengine/api/apiproxy_stub_map.py", line 614, in check_success
self.__rpc.CheckSuccess()
File "/layers/google.python.pip/pip/lib/python3.10/site-packages/google/appengine/api/apiproxy_rpc.py", line 149, in CheckSuccess
raise self.exception
File "/layers/google.python.pip/pip/lib/python3.10/site-packages/google/appengine/runtime/default_api_stub.py", line 276, in _CaptureTrace
f(**kwargs)
File "/layers/google.python.pip/pip/lib/python3.10/site-packages/google/appengine/runtime/default_api_stub.py", line 269, in _SendRequest
raise self._TranslateToError(parsed_response)
File "/layers/google.python.pip/pip/lib/python3.10/site-packages/google/appengine/runtime/default_api_stub.py", line 138, in _TranslateToError
raise apiproxy_errors.ApplicationError(response.application_error.code,
google.appengine.runtime.apiproxy_errors.ApplicationError: ApplicationError: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/srv/config/storage_backends.py", line 50, in serving_url
image = images.get_serving_url(blobkey, secure_url=True)
File "/layers/google.python.pip/pip/lib/python3.10/site-packages/google/appengine/api/images/init.py", line 1911, in get_serving_url
return rpc.get_result()
File "/layers/google.python.pip/pip/lib/python3.10/site-packages/google/appengine/api/apiproxy_stub_map.py", line 648, in get_result
return self.__get_result_hook(self)
File "/layers/google.python.pip/pip/lib/python3.10/site-packages/google/appengine/api/images/init.py", line 2015, in get_serving_url_hook
raise _ToImagesError(e, readable_blob_key)
google.appengine.api.images.TransformationError

What makes Python Multiprocessing raise different errors when sharing objects between processes?

Context: I want to create attributes of an object class in parallel by distributing them in the available cores. This question was answered in this post here by using the python Multiprocessing Pool.
The MRE for my task is the following using Pyomo 6.4.1v:
from pyomo.environ import *
import os
import multiprocessing
from multiprocessing import Pool
from multiprocessing.managers import BaseManager, NamespaceProxy
import types
class ObjProxy(NamespaceProxy):
"""Returns a proxy instance for any user defined data-type. The proxy instance will have the namespace and
functions of the data-type (except private/protected callables/attributes). Furthermore, the proxy will be
pickable and can its state can be shared among different processes. """
def __getattr__(self, name):
result = super().__getattr__(name)
if isinstance(result, types.MethodType):
def wrapper(*args, **kwargs):
return self._callmethod(name, args, kwargs)
return wrapper
return result
#classmethod
def create(cls, *args, **kwargs):
# Register class
class_str = cls.__name__
BaseManager.register(class_str, cls, ObjProxy, exposed=tuple(dir(cls)))
# Start a manager process
manager = BaseManager()
manager.start()
# Create and return this proxy instance. Using this proxy allows sharing of state between processes.
inst = eval("manager.{}(*args, **kwargs)".format(class_str))
return inst
ConcreteModel.create = create
class A:
def __init__(self):
self.model = ConcreteModel.create()
def do_something(self, var):
if var == 'var1':
self.model.var1 = var
elif var == 'var2':
self.model.var2 = var
else:
print('other var.')
def do_something2(self, model, var_name, var_init):
model.add_component(var_name,var_init)
def init_var(self):
print('Sequentially')
self.do_something('var1')
self.do_something('test')
print(self.model.var1)
print(vars(self.model).keys())
# Trying to create the attributes in parallel
print('\nParallel')
self.__sets_list = [(self.model,'time',Set(initialize = [x for x in range(1,13)])),
(self.model,'customers',Set(initialize = ['c1','c2','c3'])),
(self.model,'finish_bulks',Set(initialize = ['b1','b2','b3','b4'])),
(self.model,'fermentation_types',Set(initialize = ['ft1','ft2','ft3','ft4'])),
(self.model,'fermenters',Set(initialize = ['f1','f2','f3'])),
(self.model,'ferm_plants',Set(initialize = ['fp1','fp2','fp3','fp4'])),
(self.model,'plants',Set(initialize = ['p1','p2','p3','p4','p5'])),
(self.model,'gran_plants',Set(initialize = ['gp1','gp2','gp3','gp4','gp4']))]
with Pool(7) as pool:
pool.starmap(self.do_something2,self.__sets_list)
self.model.time.pprint()
self.model.customers.pprint()
def main(): # The main part run from another file
obj = A()
obj.init_var()
# Call other methods to create other attributes and the solver step.
# The other methods are similar to do_something2() just changing the var_init to Var() and Constraint().
if __name__ == '__main__':
multiprocessing.set_start_method("spawn")
main = main()
Ouput
Sequentially
other var.
var1
dict_keys(['_tls', '_idset', '_token', '_id', '_manager', '_serializer', '_Client', '_owned_by_manager', '_authkey', '_close'])
Parallel
WARNING: Element gp4 already exists in Set gran_plants; no action taken
time : Size=1, Index=None, Ordered=Insertion
Key : Dimen : Domain : Size : Members
None : 1 : Any : 12 : {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
customers : Size=1, Index=None, Ordered=Insertion
Key : Dimen : Domain : Size : Members
None : 1 : Any : 3 : {'c1', 'c2', 'c3'}
I change the number of parallel processes for testing, but it raises different errors, and other times it runs without errors. This is confusing for me, and I did not figure out what is the problem behind it. I did not find another post that had a similar problem, but I saw some posts discussing that pickle does not handle large data. So, the errors that sometimes I gotcha are the following:
Error 1
Unserializable message: Traceback (most recent call last):
File "/home/.../anaconda3/envs/.../lib/python3.9/multiprocessing/managers.py", line 300, in serve_client
send(msg)
File "/home/.../anaconda3/envs/.../lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/.../anaconda3/envs/.../lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
SystemError: <method 'dump' of '_pickle.Pickler' objects> returned NULL without setting an error
Error 2
Unserializable message: Traceback (most recent call last):
File "/home/.../anaconda3/envs/.../lib/python3.9/multiprocessing/managers.py", line 300, in serve_client
send(msg)
File "/home/.../anaconda3/envs/.../lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/.../anaconda3/envs/.../lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
RuntimeError: dictionary changed size during iteration
Error 3
*** Reference count error detected: an attempt was made to deallocate the type 32727 (? ***
*** Reference count error detected: an attempt was made to deallocate the type 32727 (? ***
*** Reference count error detected: an attempt was made to deallocate the type 32727 (? ***
Unserializable message: Traceback (most recent call last):
File "/home/.../anaconda3/envs/.../lib/python3.9/multiprocessing/managers.py", line 300, in serve_client
send(msg)
File "/home/.../anaconda3/envs/.../lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/.../anaconda3/envs/.../lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
numpy.core._exceptions._ArrayMemoryError: <unprintble MemoryError object>
Error 4
Unserializable message: Traceback (most recent call last):
File "/home/.../anaconda3/envs/.../lib/python3.9/multiprocessing/managers.py", line 300, in serve_client
send(msg)
File "/home/.../anaconda3/envs/.../lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/.../anaconda3/envs/.../lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'WeakSet.__init__.<locals>._remove'
So, there are different errors, and it looks like it is not stable. I hope that someone has had and solved this problem. Furthermore, if someone has implemented other strategies for this task, please, feel free to post your answer in this issue here
Tkx.

Proper api response error handling in python

I'm trying to pull some api data and save it for later use
How would I properly handle errors, with this code block:
# import modules
import requests
import json
#test api data
url='https://pipl.ir/v1/getPerson'
#error handling
try:
url_response = requests.get(url,timeout=3)
url_response.raise_for_status()
except requests.exceptions.HTTPError as errh:
print ("Http Error:",errh)
except requests.exceptions.ConnectionError as errc:
print ("Error Connecting:",errc)
except requests.exceptions.Timeout as errt:
print ("Timeout Error:",errt)
except requests.exceptions.RequestException as err:
print ("OOps: Something Else",err)
#json dictionary
json_data = url_response.json()
#print api json response
print(json.dumps(json_data, indent=4, sort_keys=True))
This works if I get a valid json response, if not, I get a return like
Http Error: 404 Client Error: Not Found for url: https://google.com/fakesite
Traceback (most recent call last):
File "/home/telendrith/python/blapi.py", line 19, in <module>
json_data = url_response.json()
File "/usr/lib/python3/dist-packages/requests/models.py", line 897, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/lib/python3/dist-packages/simplejson/__init__.py", line 518, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
And .. I'm back at square nothing.

Python3 + Tweepy streaming ERROR

I would like to print out tweets which have #Berlin hashtag in it. How can I rewrite the code?I cant find sample codes in python3 for this action.
I have the following problem:
from tweepy.streaming import StreamListener
import tweepy
from tweepy import Stream
from tweepy import OAuthHandler
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
print (data)
return (True)
def on_error(self, status):
print (status)
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['Berlin'])
And then I got this error at the end:
Traceback (most recent call last):
File "test.py", line 31, in <module>
stream.filter(track=['Berlin'])
File "/home/ubuntu/tweepy/tweepy/streaming.py", line 430, in filter
self._start(async)
File "/home/ubuntu/tweepy/tweepy/streaming.py", line 346, in _start
self._run()
File "/home/ubuntu/tweepy/tweepy/streaming.py", line 286, in _run
raise exception
File "/home/ubuntu/tweepy/tweepy/streaming.py", line 255, in _run
self._read_loop(resp)
File "/home/ubuntu/tweepy/tweepy/streaming.py", line 298, in _read_loop
line = buf.read_line().strip()
File "/home/ubuntu/tweepy/tweepy/streaming.py", line 171, in read_line
self._buffer += self._stream.read(self._chunk_size)
TypeError: Can't convert 'bytes' object to str implicitly

This is related to a known bug in tweepy #615. Taken from a post in there.
In streaming.py:
I changed line 161 to
self._buffer += self._stream.read(read_len).decode('UTF-8', 'ignore')
and line 171 to
self._buffer += self._stream.read(self._chunk_size).decode('UTF-8', 'ignore')
They file you need to change on windows is located under \Python 3.5\Lib\site-packages\tweepy.
For Ubuntu you need: '/usr/lib/python3.5/dist-packages/tweepy'

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scrapy not calling the assigned pipeline when run from a script - python-3.x

Related

TypeError: <class 'str'> is not callable-Scrapy Framework

get_serving_url silently fails with django-storages, app engine, python 3.x

What makes Python Multiprocessing raise different errors when sharing objects between processes?

Proper api response error handling in python

Python3 + Tweepy streaming ERROR

Categories

Resources