scrapy not exporting data to elastic search - search

I want to index my items in ElasticSearch, I found this.
But if i'm trying to crawl a site I get the following error:
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 70, in process_item
self.index_item(item)
File "/usr/local/lib/python2.7/dist-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 52, in index_item
local_id = hashlib.sha1(item[uniq_key]).hexdigest()
File "/home/javed/.local/lib/python2.7/site-packages/scrapy/item.py", line 50, in getitem
return self._values[key]
exceptions.KeyError: 'url'

Since you didn't paste your spider code, I can only assume things.
One assumption would be that you didn't set the required filed in your items. They need to have a field specified in ELASTICSEARCH_UNIQ_KEY, and it has to be unique. The simplest thing might be to use the url:
# somewhere deep in your callback,
# where you create and yield your item
...
myitem['url'] = response.url
return myitem
and make sure to set in the settings.py:
ELASTICSEARCH_UNIQ_KEY = 'url'

I simply commented this field in my settings.py file (this field is optional according to official documentation)
#ELASTICSEARCH_UNIQ_KEY = 'url' # Custom unique key

Related

Django model object as parameter for celery task raises EncodeError - 'object of type someModelName is not JSON serializable'

Im working with a django project(im pretty new to django) and running into an issue passing a model object between my view and a celery task.
I am taking input from a form which contains several ModelChoiceField fields and using the selected object in a celery task. When I queue the task(from the post method in the view) using someTask.delay(x, y, z) where x, y and z are various objects from the form ModelChoiceFields I get the error object of type <someModelName> is not JSON serializable.
That said, if I create a simple test function and pass any of the same objects from the form into the function I get the expected behavior and the name of the object selected in the form is logged.
def test(object):
logger.debug(object.name)
I have done some poking based on the above error and found django serializers which allows for a workaround by serializing the object using serializers.serialize('json', [template]), in the view before passing it to the celery task.
I can then access the object in the celery task by using template = json.loads(template)[0].get('fields') to access its required bits as a dictionary -- while this works, it does seem a bit inelegant and I wanted to see if there is something I am missing here.
Im obviously open to any feedback/guidance here however my main questions are:
Why do I get the object...is not JSON serializable error when passing a model object into a celery task but not when passing to my simple test function?
Is the approach using django serializers before queueing the celery task considered acceptable/correct or is there a cleaner way to achieve this goal?
Any suggestions would be greatly appreciated.
Traceback:
I tried to post the full traceback here as well however including that caused the post to get flagged as 'this looks like spam'
Internal Server Error: /build/
Traceback (most recent call last):
File "/home/tech/sandbox_project/venv/lib/python3.8/site-packages/kombu/serialization.py", line 49, in _reraise_errors
yield
File "/home/tech/sandbox_project/venv/lib/python3.8/site-packages/kombu/serialization.py", line 220, in dumps
payload = encoder(data)
File "/home/tech/sandbox_project/venv/lib/python3.8/site-packages/kombu/utils/json.py", line 65, in dumps
return _dumps(s, cls=cls or _default_encoder,
File "/usr/lib/python3.8/json/__init__.py", line 234, in dumps
return cls(
File "/usr/lib/python3.8/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python3.8/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/home/tech/sandbox_project/venv/lib/python3.8/site-packages/kombu/utils/json.py", line 55, in default
return super().default(o)
File "/usr/lib/python3.8/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Template is not JSON serializable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/tech/sandbox_project/venv/lib/python3.8/site-packages/django/core/handlers/exception.py", line 47, in inner
response = get_response(request)
Add this lines to settings.py
# Project/settings.py
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
Then instead of passing object, send JSON with id/pk if you're using a model instance call the task like this..
test.delay({'pk': 1})
Django model instance is not available in celery environment, as it runs in a different process
How you can get the model instance inside task then? Well, you can do something like below -
def import_django_instance():
"""
Makes django environment available
to tasks!!
"""
import django
import os
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'Project.settings')
django.setup()
# task
#shared_task(name="simple_task")
def simple_task(data):
import_django_instance()
from app.models import AppModel
pk = data.get('pk')
instance = AppModel.objects.get(pk=pk)
# your operation

How to Handle Popup Windows That Occur within Selenium Python

So I have an issue where, I am trying to automate an Import on an application that has no API.
As a result, I have to do like 30 navigation clicks just to get to what I want (Exaggeration).
However, I am trying to basically automate the clicks that will allow me to upload a specific file.
As a result, I almost get to the part where I have to select the specific test build I want to import the file with. There is a field that I need to do a send_keys to find the correct import build I have to upload. The Field element looks like this
<input class="lookupInput" type="text" name="brTestScoreImportLookupInput" id="brTestScoreImportLookupInput" style="width: 100px;" tabindex="1" onkeydown="return lookupKeyPressed(event,"","simptbrws000.w")" origvalue="" det="true" aria-labelledby="" autocomplete="off">
However I don't think my code is properly handling the window as it pops-up from the prior selection.
The field I need to update can be found in the picture I uploaded:
Furthermore the XPATH for the Field is //*[#id='brTestScoreImportLookupInput']
You can find the full code here.
The main aspect is I have to Enter TSI into that File ID field and then hit enter on my keyboard to populate the correct import utility I need. Once I do that the import utilities filter out and I need to select a specific File ID:
.
The main code that should be controlling this:
# Click on Test Score Import Wizard - TW
# Test Wizard XPATH = //a[#id='tree1-3-link']/span
element = WebDriverWait(browser, 20).until(
EC.element_to_be_clickable((By.XPATH, "//a[#id='tree1-3-link']/span")))
element.click();
# Send test_upload and Send Keys
# Field XPATH = //*[#id='brTestScoreImportLookupInput']
test_lookup = browser.find_element_by_id("brTestScoreImportLookupInput")
test_lookup.send_keys(test_upload)
If you want to visit the link toe repository code click on here above this.
Any help would be greatly appreciated.
Traceback (most recent call last): File ".\skyward_collegeboard_TSI_import.py", line 115, in
<module> test_lookup = browser.find_element_by_id("brTestScoreImportLookupInput") File "C:\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 360, in find_element_by_id return self.find_element(by=By.ID, value=id_) File "C:\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py",
line 976, in find_element return self.execute(Command.FIND_ELEMENT, { File "C:\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute self.error_handler.check_response(response) File "C:\Python38\lib\site-packages\selenium\webdriver\remote\errorhandler.py",
line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="brTestScoreImportLookupInput"]"}
(Session info: chrome=80.0.3987.122)
So I was able to accomplish this by using the following method using both selenium and pynput.
# Browser Switches to Window
WebDriverWait(browser,10).until(EC.number_of_windows_to_be(2))
browser.switch_to.window(browser.window_handles[-1])
# Send test_upload and oend Keys
# Field XPATH = //*[#id='brTestScoreImportLookupInput']
test_lookup = browser.find_element_by_id("brTestScoreImportLookupInput")
test_lookup.send_keys(test_upload)
# Press and Release Enter Key
keyboard.press(Key.enter)
keyboard.release(Key.enter)
Essentially I had to switch to that popup window.

"RuntimeError: dictionary keys changed during iteration" when attempting to call add_attachment for Jira issue object, but dictionaries aren't used?

I am attempting to just add a csv file to my issues as a test, but I keep receiving the error:
RuntimeError: dictionary keys changed during iteration
Here is the code (I've removed the parameters for server, username and password):
from jira import JIRA
options = {"server": "serverlinkgoeshere"}
jira = JIRA(options, basic_auth=('username', 'password'))
issuesList = jira.search_issues(jql_str='', startAt=0, maxResults=100)
for issue in issuesList:
with open("./csv/Adobe.csv",'rb') as f:
jira.add_attachment(issue=issue, attachment=f)
f.close()
I'm at a loss, I'm not changing any dictionary keys in my code. Here is the full error message:
Traceback (most recent call last):
File "C:/Users/USER/PycharmProjects/extractor/main/jiraCSVDupdate.py", line 8, in <module>
jira.add_attachment(issue=issue, attachment=f)
File "C:\Users\USER\AppData\Roaming\Python\Python38\site-packages\jira\client.py", line 126, in wrapper
result = func(*arg_list, **kwargs)
File "C:\Users\USER\AppData\Roaming\Python\Python38\site-packages\jira\client.py", line 787, in add_attachment
url, data=m, headers=CaseInsensitiveDict({'content-type': m.content_type, 'X-Atlassian-Token': 'nocheck'}), retry_data=file_stream)
File "C:\Users\USER\AppData\Roaming\Python\Python38\site-packages\jira\utils\__init__.py", line 41, in __init__
for key, value in super(CaseInsensitiveDict, self).items():
RuntimeError: dictionary keys changed during iteration
References:
Jira add_attachment example:
https://jira.readthedocs.io/en/master/examples.html#attachments
add_attachment source code:
https://jira.readthedocs.io/en/master/_modules/jira/client.html#JIRA.add_attachment
The root of the problem is found at jira.utils.__init__py:
for key, value in super(CaseInsensitiveDict, self).items():
if key != key.lower():
self[key.lower()] = value
self.pop(key, None)
This is a programming mistake: one should not modify the data structure that is being iterated over. Therefore, this requires a patch and must be the only accepted solution for this.
In the meantime, I suggest a monkey patch:
import jira.client
class CaseInsensitiveDict(dict):
def __init__(self, *args, **kw):
super(CaseInsensitiveDict, self).__init__(*args, **kw)
for key, value in self.copy().items():
if key != key.lower():
self[key.lower()] = value
self.pop(key, None)
jira.client.CaseInsensitiveDict = CaseInsensitiveDict
The trick here is that you iterate over a copy of your dict structure, by doing self.copy().items(), and not your original one - the self one.
For reference, my package version: jira==2.0.0.
Should be fixed from jira lib version 3.1.1 (https://github.com/pycontribs/jira/commit/a83cc8f447fa4f9b6ce55beca8b4aee4a669c098)
so assuming you use reqirements edit your requirements.txt file with
jira>=3.1.1 and install them pip install -r requirements.txt
otherwise use:
pip install jira==3.1.1
This works as a fix using Python 3.9.
The dictionary keys in the following lines in client.py in the Jira site-packages are not all lower case.
headers=CaseInsensitiveDict({'content-type': None, 'X-Atlassian-Token': 'nocheck'}))
url, data=m, headers=CaseInsensitiveDict({'content-type': m.content_type, 'X-Atlassian-Token': 'nocheck'}), retry_data=file_stream)
This works as a solution.
You can alter the dictionary keys to be all lower case. This will then allow the attachment to be added to the JIRA ticket.
headers = CaseInsensitiveDict({'content-type': None, 'x-atlassian-token': 'nocheck'}))
url, data = m, headers = CaseInsensitiveDict({'content-type': m.content_type, 'x-atlassian-token': 'nocheck'}), retry_data = file_stream)

Python Scrapy: Crawl from local file: Content-Type undefined

I want to let Scrapy crawl local html files but am stuck because the header lacks the Content-type field. I've followed the tutorial here: Use Scrapy to crawl local XML file - Start URL local file address So basically, I am pointing scrapy to local urls, such as file:///Users/felix/myfile.html
However, scrapy will crash then, since it looks like (on MacOS) the resulting response object does not contain the required field Content-type.
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/felix/IdeaProjects/news-please/newsplease/__init__.py
[scrapy.core.scraper:158|ERROR] Spider error processing <GET file:///Users/felix/IdeaProjects/news-please/newsplease/0a2199bdcef84d2bb2f920cf042c5919> (referer: None)
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/felix/IdeaProjects/news-please/newsplease/crawler/spiders/download_crawler.py", line 33, in parse
if not self.helper.parse_crawler.content_type(response):
File "/Users/felix/IdeaProjects/news-please/newsplease/helper_classes/parse_crawler.py", line 116, in content_type
if not re.match('text/html', response.headers.get('Content-Type').decode('utf-8')):
AttributeError: 'NoneType' object has no attribute 'decode'
Someone suggested to run a simple http server, see Python Scrapy on offline (local) data but that is not an option, mainly because of the overhead caused by running another server.
I need to use scrapy in the first place, as we have a larger framework that uses scrapy. We plan to add the functionality to crawl from local files to that framework. However, since there are several questions on SO on how to crawl from local files (see previous links), I assume this problem is of general interest.
You can actually fork news-please or change scrapy to always return True in the function def content_type(self, response) in newsplease/helper_classes/parse_crawler.py if it is from local storage.
The new file will look like this:
def content_type(self, response):
"""
Ensures the response is of type
:param obj response: The scrapy response
:return bool: Determines wether the response is of the correct type
"""
if response.url.startswith('file:///'):
return True
if not re.match('text/html', response.headers.get('Content-Type').decode('utf-8')):
self.log.warn("Dropped: %s's content is not of type "
"text/html but %s", response.url,
response.headers.get('Content-Type'))
return False
else:
return True

gmail API: TypeError: sequence item 0: expected str instance, bytes found

I'm trying to download one message using the GMail API. Below is my traceback:
pdiracdelta#pdiracdelta-Laptop:~/GMail Metadata$ ./main.py
<oauth2client.client.OAuth2Credentials object at 0x7fd6306c4d30>
False
Traceback (most recent call last):
File "./main.py", line 105, in <module>
main()
File "./main.py", line 88, in main
service = discovery.build('gmail', 'v1', http=http)
File "/usr/lib/python3/dist-packages/oauth2client/util.py", line 137, in positional_wrapper
return wrapped(*args, **kwargs)
File "/usr/lib/python3/dist-packages/googleapiclient/discovery.py", line 197, in build
resp, content = http.request(requested_url)
File "/usr/lib/python3/dist-packages/oauth2client/client.py", line 562, in new_request
redirections, connection_type)
File "/usr/lib/python3/dist-packages/httplib2/__init__.py", line 1138, in request
headers = self._normalize_headers(headers)
File "/usr/lib/python3/dist-packages/httplib2/__init__.py", line 1106, in _normalize_headers
return _normalize_headers(headers)
File "/usr/lib/python3/dist-packages/httplib2/__init__.py", line 194, in _normalize_headers
return dict([ (key.lower(), NORMALIZE_SPACE.sub(value, ' ').strip()) for (key, value) in headers.items()])
File "/usr/lib/python3/dist-packages/httplib2/__init__.py", line 194, in <listcomp>
return dict([ (key.lower(), NORMALIZE_SPACE.sub(value, ' ').strip()) for (key, value) in headers.items()])
TypeError: sequence item 0: expected str instance, bytes found
And below is a snippet of code which produces the credential object and boolean print just before the Traceback. It confirms that the credentials object is valid and is being used as suggested by Google:
credentials = get_credentials()
print(credentials)
print(str(credentials.invalid))
http = credentials.authorize(httplib2.Http())
service = discovery.build('gmail', 'v1', http=http)
What is going wrong here? It seems to me that I am not at fault, since the problem can be traced back to service = discovery.build('gmail', 'v1', http=http) which uses nothing but valid information (implying one of the packages used further in the stack cannot handle this valid information). Is this a bug, or am I doing something wrong?
UPDATE: it seems that the _normalize_headers function has now been patched. Updating your python version should fix the problem (I'm using 3.6.7 now).
Solved with help from Padraic Cunningham, who identified the problem as an encoding issue. I solved this problem by applying .decode('utf-8') to the header keys and values (headers is a dict) if they are bytes-type objects (which are apparently UTF-8 encoded) and transforming them into python3 strings. This is probably due to some python2/3 mixing in the google API.
The fix also includes changing all code from google API examples to python3 code (e.g. exception handling) but most importantly my workaround involves editing /usr/lib/python3/dist-packages/httplib2/__init__.py at lines 193-194, redefining the _normalize_headers(headers) function as:
def _normalize_headers(headers):
for key in headers:
# if not encoded as a string, it is ASSUMED to be encoded as UTF-8, as it used to be in python2.
if not isinstance(key, str):
newkey = key.decode('utf-8')
headers[newkey] = headers[key]
del headers[key]
key = newkey
if not isinstance(headers[key], str):
headers[key] = headers[key].decode('utf-8')
return dict([ (key.lower(), NORMALIZE_SPACE.sub(value, ' ').strip()) for (key, value) in headers.items()])
WARNING: this workaround is obviously quite dirty as it involves editing files from the httplib2 package. If someone finds a better fix, please post it here.

Resources