Scrapy's pagination error - pagination

Hi guys I'm getting the following pagination error while tying to scrape a website
2017-07-27 18:30:21 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.pedidosja.com.br/restaurantes/sao-paulo?a=rua%20tenente%20negr%C3%A3o%20200&cep=04530030&doorNumber=200&bairro=Itaim%20Bibi&lat=-23.585202&lng=-46.671715199999994> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/root/Documents/Spiders/pedidosYa/pedidosYa/spiders/pedidosya.py", line 35, in parse
next_page_url = response.urljoin(next_page_url)
File "/usr/local/lib/python3.5/dist-packages/scrapy/http/response/text.py", line 82, in urljoin
return urljoin(get_base_url(self), url)
File "/usr/lib/python3.5/urllib/parse.py", line 416, in urljoin
base, url, _coerce_result = _coerce_args(base, url)
File "/usr/lib/python3.5/urllib/parse.py", line 112, in _coerce_args
raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments
2017-07-27 18:30:21 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-27 18:30:21 [scrapy.extensions.feedexport] INFO: Stored csv feed (13 items) in: test3.csv
2017-07-27 18:30:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 653,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 62571,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 7, 27, 23, 30, 21, 221038),
'item_scraped_count': 13,
'log_count/DEBUG': 16,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'memusage/max': 49278976,
'memusage/startup': 49278976,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/TypeError': 1,
'start_time': datetime.datetime(2017, 7, 27, 23, 30, 17, 538310)}
2017-07-27 18:30:21 [scrapy.core.engine] INFO: Spider closed (finished)
The spider is raising a type error "Cannot mix str and non-str arguments" I not very experienced in pyhton, I would also apreciate some
resources where I could learn about this type of error. Bellow you will find the code of the spider.
# -*- coding: utf-8 -*-
import scrapy
from pedidosYa.items import PedidosyaItem
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
class PedidosyaSpider(scrapy.Spider):
name = 'pedidosya'
allowed_domains = ['www.pedidosya.com.br']
start_urls = [
'https://www.pedidosja.com.br/restaurantes/sao-paulo?a=rua%20tenente%20negr%C3%A3o%20200&cep=04530030&doorNumber=200&bairro=Itaim%20Bibi&lat=-23.585202&lng=-46.671715199999994']
def parse(self, response):
# need to define wrapper
for wrapper in response.css('.restaurant-wrapper.peyaCard.show.with_tags'):
l = ItemLoader(item=PedidosyaItem(), selector=wrapper)
l.add_css('Name', 'a.arrivalName::text')
l.add_css('Menu1', 'span.categories > span::text', MapCompose(str.strip))
l.add_css('Menu2', 'span.categories > span + span::text', MapCompose(str.strip))
l.add_css('Menu3', 'span.categories > span + span + span::text', MapCompose(str.strip))
l.add_css('Address', 'span.address::text', MapCompose(str.strip))
l.add_css('DeliveryTime', 'i.delTime::text', MapCompose(str.strip))
l.add_css('CreditCard', 'ul.content_credit_cards > li > img::attr(alt)', MapCompose(str.strip))
l.add_css('DeliveryCost', 'div.shipping > i::text', MapCompose(str.strip))
l.add_css('Rankink', 'span.ranking i::text', MapCompose(str.strip))
l.add_css('No', 'span.ranking a::text', MapCompose(str.strip))
l.add_css('Sponsor', 'span.grey_small.not-logged::text', MapCompose(str.strip))
l.add_css('DeliveryMinimun', 'div.minDelivery::text', MapCompose(str.strip))
l.add_css('Distance', 'div.distance i::text', MapCompose(str.strip))
yield l.load_item()
next_page_url = response.css('li.arrow.next > a ::attr(href)').extract()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
Thank you in advance and have a wonderfull day!!

next_page_url = response.css('li.arrow.next > a ::attr(href)').extract()
^^^^^^^^^^
if next_page_url:
next_page_url = response.urljoin(next_page_url)
^^^^^^^^^^^^^
Here you are calling urljoin on a list since extract() method when creating next_page_url returns a list of all values, even if it's only one member.
To remedy this use extract_first() instead:
next_page_url = response.css('li.arrow.next > a ::attr(href)').extract_first()
^^^^^^^^^^^^^^^

The problem is in this line:
next_page_url = response.css('li.arrow.next > a::attr(href)').extract()
because extract() method returns always a list of results, even if it founds just one. Either use extract_first() method which will give you just the first result:
next_page_url = response.css('li.arrow.next > a::attr(href)').extract_first()
or get the first element of the results list yourself:
next_page_url = response.css('li.arrow.next > a::attr(href)').extract()[0]

Related

Error when trying to run via multiprocessing in Python 3

The following code works fine
[process_data(item, data_frame_list[item]) for item in data_frame_list if data_frame_list[item].shape[0] > 5]
I'm trying to convert this code to run in parallel
pool_obj = multiprocessing.Pool()
[pool_obj.map(process_data,item, data_frame_list[item]) for item in data_frame_list if data_frame_list[item].shape[0] > 5]
This results in errors
Traceback (most recent call last):
File "/home/pyuser/PycharmProjects/project_sample/testyard_2.py", line 425, in <module>
[pool_obj.map(process_data,item, data_frame_list[item]) for item in data_frame_list if data_frame_list[item].shape[0] > 5]
File "/home/pyuser/PycharmProjects/project_sample/testyard_2.py", line 425, in <listcomp>
[pool_obj.map(process_data,item, data_frame_list[item]) for item in data_frame_list if data_frame_list[item].shape[0] > 5]
File "/usr/lib/python3.8/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.8/multiprocessing/pool.py", line 485, in _map_async
result = MapResult(self, chunksize, len(iterable), callback,
File "/usr/lib/python3.8/multiprocessing/pool.py", line 797, in __init__
if chunksize <= 0:
File "/home/pyuser/PycharmProjects/project_sample/venv/lib/python3.8/site-packages/pandas/core/ops/common.py", line 69, in new_method
return method(self, other)
File "/home/pyuser/PycharmProjects/project_sample/venv/lib/python3.8/site-packages/pandas/core/arraylike.py", line 44, in __le__
return self._cmp_method(other, operator.le)
File "/home/pyuser/PycharmProjects/project_sample/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 6849, in _cmp_method
new_data = self._dispatch_frame_op(other, op, axis=axis)
File "/home/pyuser/PycharmProjects/project_sample/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 6888, in _dispatch_frame_op
bm = self._mgr.apply(array_op, right=right)
File "/home/pyuser/PycharmProjects/project_sample/venv/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 325, in apply
applied = b.apply(f, **kwargs)
File "/home/pyuser/PycharmProjects/project_sample/venv/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 382, in apply
result = func(self.values, **kwargs)
File "/home/pyuser/PycharmProjects/project_sample/venv/lib/python3.8/site-packages/pandas/core/ops/array_ops.py", line 284, in comparison_op
res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
File "/home/pyuser/PycharmProjects/project_sample/venv/lib/python3.8/site-packages/pandas/core/ops/array_ops.py", line 73, in comp_method_OBJECT_ARRAY
result = libops.scalar_compare(x.ravel(), y, op)
File "pandas/_libs/ops.pyx", line 107, in pandas._libs.ops.scalar_compare
TypeError: '<=' not supported between instances of 'str' and 'int'
I'm not able to work out what is incorrect with what I've done. Could I please request some guidance?
Used a different library that has easier usage. All is working now.
from joblib import Parallel, delayed
import multiprocessing
Parallel(n_jobs=multiprocessing.cpu_count())(delayed(process_data)(item, data_frame_list[item]) for item in data_frame_list if data_frame_list[item].shape[0] > 5)

Scrapy NotSupported and TimeoutError

My goal is to find out each and every link that contains daraz.com.bd/shop/
What I tried so far is bellow..
import scrapy
class LinksSpider(scrapy.Spider):
name = 'links'
allowed_domains = ['daraz.com.bd']
extracted_links = []
shop_list = []
def start_requests(self):
start_urls = 'https://www.daraz.com.bd'
yield scrapy.Request(url=start_urls, callback=self.extract_link)
def extract_link(self, response):
str_response_content_type = str(response.headers.get('content-type'))
if str_response_content_type == "b'text/html; charset=utf-8'" :
links = response.xpath("//a/#href").extract()
for link in links:
link = link.lstrip("/")
if ("https://" or "http://") not in link:
link = "https://" + str(link)
split_link = link.split('.')
if "daraz.com.bd" in link and link not in self.extracted_links:
self.extracted_links.append(link)
if len(split_link) > 1:
if "www" in link and "daraz" in split_link[1]:
yield scrapy.Request(url=link, callback=self.extract_link, dont_filter=True)
elif "www" not in link and "daraz" in split_link[0]:
yield scrapy.Request(url=link, callback=self.extract_link, dont_filter=True)
if "daraz.com.bd/shop/" in link and link not in self.shop_list:
yield {
"links" : link
}
Here is my settings.py file:
BOT_NAME = 'chotosite'
SPIDER_MODULES = ['chotosite.spiders']
NEWSPIDER_MODULE = 'chotosite.spiders'
ROBOTSTXT_OBEY = False
REDIRECT_ENABLED = False
DOWNLOAD_DELAY = 0.25
USER_AGENT = 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/W.X.Y.Z‡ Safari/537.36'
AUTOTHROTTLE_ENABLED = True
What problem am I facing ?
It stops automatically after collecting only 6-7 links that contains daraz.com.bd/shop/.
User timeout caused connection failure: Getting https://www.daraz.com.bd/kettles/ took longer than 180.0 seconds..
INFO: Ignoring response <301 https://www.daraz.com.bd/toner-and-mists/>: HTTP status code is not handled or not allowed
How do I solve those issues ? Please help me.
If you have some other idea to reach my goal I will be more than happy. thank you...
Here are some console log:
2020-12-04 22:21:23 [scrapy.extensions.logstats] INFO: Crawled 891 pages (at 33 pages/min), scraped 6 items (at 0 items/min)
2020-12-04 22:22:05 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.daraz.com.bd/kettles/> (failed 1 times): User timeout caused connection failure: Getting https://www.daraz.com.bd/kettles/ took longer than 180.0 seconds..
2020-12-04 22:22:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.daraz.com.bd/kettles/> (referer: https://www.daraz.com.bd)
2020-12-04 22:22:11 [scrapy.core.engine] INFO: Closing spider (finished)
2020-12-04 22:22:11 [scrapy.extensions.feedexport] INFO: Stored csv feed (6 items) in: dara.csv
2020-12-04 22:22:11 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 4,
'downloader/exception_type_count/scrapy.exceptions.NotSupported': 1,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 3,
'downloader/request_bytes': 565004,
'downloader/request_count': 896,
'downloader/request_method_count/GET': 896,
'downloader/response_bytes': 39063472,
'downloader/response_count': 892,
'downloader/response_status_count/200': 838,
'downloader/response_status_count/301': 45,
'downloader/response_status_count/302': 4,
'downloader/response_status_count/404': 5,
'elapsed_time_seconds': 828.333752,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 12, 4, 16, 22, 11, 864492),
'httperror/response_ignored_count': 54,
'httperror/response_ignored_status_count/301': 45,
'httperror/response_ignored_status_count/302': 4,
'httperror/response_ignored_status_count/404': 5,
'item_scraped_count': 6,
'log_count/DEBUG': 901,
'log_count/ERROR': 1,
'log_count/INFO': 78,
'memusage/max': 112971776,
'memusage/startup': 53370880,
'request_depth_max': 5,
'response_received_count': 892,
'retry/count': 3,
'retry/reason_count/twisted.internet.error.TimeoutError': 3,
'scheduler/dequeued': 896,
'scheduler/dequeued/memory': 896,
'scheduler/enqueued': 896,
'scheduler/enqueued/memory': 896,
'start_time': datetime.datetime(2020, 12, 4, 16, 8, 23, 530740)}
2020-12-04 22:22:11 [scrapy.core.engine] INFO: Spider closed (finished)
You can use link extract object to extract all link. Then you can filter your desire link.
In you scrapy shell
scrapy shell https://www.daraz.com.bd
from scrapy.linkextractors import LinkExtractor
l = LinkExtractor()
links = l.extract_links(response)
for link in links:
print(link.url)

Can't iterate over multiprocessing.managers.DictProxy

Can't iterate over multiprocessing.managers.DictProxy. through pytest-paralell but works fine in python.
I found this issue even here https://bugs.python.org/issue9733 but since the managers.py is Read-Only; I cannot do changes there. Has anybody faced this issue earlier ? How can I resolve it ?
test_Run.py
from multiprocessing import Process, Manager
def f(d, l):
d[1] = '1'
d['2'] = 2
d[0.25] = None
l.reverse()
if __name__ == '__main__':
with Manager() as manager:
d = manager.dict()
l = manager.list(range(10))
p = Process(target=f, args=(d, l))
p.start()
p.join()
print(d)
print(l)
If you run the
(venv) [tivo#localhost src]$ python test_run.py
{0.25: None, 1: '1', '2': 2}
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
(venv) [tivo#localhost src]$
EDITED:
If you run this with pytest with the following code.
from multiprocessing import Process, Manager
def test_f():
d, l = {}, []
d[1] = '1'
d['2'] = 2
d[0.25] = None
print(d)
with Manager() as manager:
d = manager.dict()
l = manager.list(range(10))
l.reverse()
print(l)
p = Process(target=f)
p.start()
p.join()
(venv) [tivo#localhost src]$ pytest -v -s test_run.py
collecting ... [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
{0.25: None, 1: '1', '2': 2}
collected 1 item
test_run.py::test_f {0.25: None, 1: '1', '2': 2}
PASSED
(venv) [tivo#localhost src]$
But If you run through pytest along with pytest-paralell package, it throws error
(venv) [tivo#localhost src]$ pytest -v -s --tests-per-worker auto --workers auto test_run.py
===================================================================== test session starts ======================================================================
platform linux -- Python 3.4.4, pytest-4.5.0, py-1.8.0, pluggy-0.11.0 -- /home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/bin/python3
cachedir: .pytest_cache
rootdir: /home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/src, inifile: pytest.ini
plugins: xdist-1.28.0, remotedata-0.3.1, pipeline-0.3.0, parallel-0.0.9, forked-1.0.2, flake8-1.0.4, cov-2.7.1
collecting ... [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
{0.25: None, 1: '1', '2': 2}
collected 1 item
pytest-parallel: 2 workers (processes), 0 test per worker (thread)
Traceback (most recent call last):
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/bin/pytest", line 10, in <module>
sys.exit(main())
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/_pytest/config/__init__.py", line 79, in main
return config.hook.pytest_cmdline_main(config=config)
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/pluggy/hooks.py", line 289, in __call__
return self._hookexec(self, self.get_hookimpls(), kwargs)
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/pluggy/manager.py", line 68, in _hookexec
return self._inner_hookexec(hook, methods, kwargs)
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/pluggy/manager.py", line 62, in <lambda>
firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/pluggy/callers.py", line 208, in _multicall
return outcome.get_result()
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/pluggy/callers.py", line 80, in get_result
raise ex[1].with_traceback(ex[2])
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/pluggy/callers.py", line 187, in _multicall
res = hook_impl.function(*args)
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/_pytest/main.py", line 242, in pytest_cmdline_main
return wrap_session(config, _main)
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/_pytest/main.py", line 235, in wrap_session
session=session, exitstatus=session.exitstatus
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/pluggy/hooks.py", line 289, in __call__
return self._hookexec(self, self.get_hookimpls(), kwargs)
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/pluggy/manager.py", line 68, in _hookexec
return self._inner_hookexec(hook, methods, kwargs)
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/pluggy/manager.py", line 62, in <lambda>
firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/pluggy/callers.py", line 203, in _multicall
gen.send(outcome)
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/_pytest/terminal.py", line 678, in pytest_sessionfinish
self.summary_stats()
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/_pytest/terminal.py", line 876, in summary_stats
(line, color) = build_summary_stats_line(self.stats)
File "/home/tivo/workspace/ServicePortal/autotestscripts/CAT/scripts/ServerQE/brat/venv/lib/python3.4/site-packages/_pytest/terminal.py", line 1034, in build_summary_stats_line
for found_type in stats:
File "<string>", line 2, in __getitem__
File "/usr/local/lib/python3.4/multiprocessing/managers.py", line 747, in _callmethod
raise convert_to_error(kind, result)
KeyError: 0
(venv) [tivo#localhost src]$
I have the following packages with me
pytest-parallel==0.0.9
pytest-pipeline==0.3.0
Q. What workaround can I do to get this above code PASSED without Error logs ? The issue being the results are not giving me output as to how many test cases are PASSED.
Why is it taking pytest-parallel: 2 workers (processes), 1 test per worker (thread) ? I have provided only one function there !
HINT:
If I add flags --worker 1; this above error does not appear; but usually fails my scripts and hence I am forced to use --tests-per-worker 1 along with it. But paralellism is not present here !

Efficient Find Next Greater in Another Array

Is it possible to remove the for loops in this function and get a speed up in the process? I have not been able to get the same results with vector methods for this function. Or is there another option?
import numpy as np
indices = np.array(
[814, 935, 1057, 3069, 3305, 3800, 4093, 4162, 4449])
within = np.array(
[193, 207, 243, 251, 273, 286, 405, 427, 696,
770, 883, 896, 1004, 2014, 2032, 2033, 2046, 2066,
2079, 2154, 2155, 2156, 2157, 2158, 2159, 2163, 2165,
2166, 2167, 2183, 2184, 2208, 2210, 2212, 2213, 2221,
2222, 2223, 2225, 2226, 2227, 2281, 2282, 2338, 2401,
2611, 2612, 2639, 2640, 2649, 2700, 2775, 2776, 2785,
3030, 3171, 3191, 3406, 3427, 3527, 3984, 3996, 3997,
4024, 4323, 4331, 4332])
def get_first_ind_after(indices, within):
"""returns array of the first index after each listed in indices
indices and within must be sorted ascending
"""
first_after_leading = []
for index in indices:
for w_ind in within:
if w_ind > index:
first_after_leading.append(w_ind)
break
# convert to np array
first_after_leading = np.array(first_after_leading).flatten()
return np.unique(first_after_leading)
It should return the next greatest number for each in the indices array if there is one.
# Output:
[ 883 1004 2014 3171 3406 3984 4323]
Here's one based on np.searchsorted -
def next_greater(indices, within):
idx = np.searchsorted(within, indices)
idxv = idx[idx<len(within)]
idxv_unq = np.unique(idxv)
return within[idxv_unq]
Alternatively, idxv_unq could be computed like so and should be more efficient -
idxv_unq = idxv[np.r_[True,idxv[:-1] != idxv[1:]]]
Try this:
[within[within>x][0] if len(within[within>x])>0 else 0 for x in indices]
As in,
In [35]: import numpy as np
...: indices = np.array([814, 935, 1057, 3069, 3305, 3800, 4093, 4162, 4449])
...:
...: within = np.array(
...: [193, 207, 243, 251, 273, 286, 405, 427, 696,
...: 770, 883, 896, 1004, 2014, 2032, 2033, 2046, 2066,
...: 2079, 2154, 2155, 2156, 2157, 2158, 2159, 2163, 2165,
...: 2166, 2167, 2183, 2184, 2208, 2210, 2212, 2213, 2221,
...: 2222, 2223, 2225, 2226, 2227, 2281, 2282, 2338, 2401,
...: 2611, 2612, 2639, 2640, 2649, 2700, 2775, 2776, 2785,
...: 3030, 3171, 3191, 3406, 3427, 3527, 3984, 3996, 3997,
...: 4024, 4323, 4331, 4332])
In [36]: [within[within>x][0] if len(within[within>x])>0 else 0 for x in indices]
Out[36]: [883, 1004, 2014, 3171, 3406, 3984, 4323, 4323, 0]
This is the pythonic approach called list comprehension it's a shortened version of a foreach loop. So if I were to expand this out:
result = []
for x in indices:
# This next line is a boolean index into the array, if returns all of the items in the array that have a value greater than x
y = within[within>x]
# At this point, y is an array of all the items which are larger than x. Since you wanted the first of these items, we'll just take the first item off of this new array, but it is possible that y is None (there are no values that match the condition), so there is a check for that
if len(y) > 0:
z = y[0]
else:
z = 0 # or None or whatever you like
# Now add this value to the array that we are building
result.append(z)
# Now result has the array
I wrote it this way, because it uses the vector operations (i.e. the boolean mask) and also leverages list comprehension, which is a much cleaner simpler way to write a foreach which returns an array.

how to save scraped data in db?

I m trying to save scraped data in db but got stuck,
first I have saved scraped data in csv file and using glob library to find newest csv and upload data of that csv into db-
I m not sure what i m doing wrong here plase find code and error
i have created table yahoo_data in db with same column name as that of csv and my code output
import scrapy
from scrapy.http import Request
import MySQLdb
import os
import csv
import glob
class YahooScrapperSpider(scrapy.Spider):
name = 'yahoo_scrapper'
allowed_domains = ['in.news.yahoo.com']
start_urls = ['http://in.news.yahoo.com/']
def parse(self, response):
news_url=response.xpath('//*[#class="Mb(5px)"]/a/#href').extract()
for url in news_url:
absolute_url=response.urljoin(url)
yield Request (absolute_url,callback=self.parse_text)
def parse_text(self,response):
Title=response.xpath('//meta[contains(#name,"twitter:title")]/#content').extract_first()
# response.xpath('//*[#name="twitter:title"]/#content').extract_first(),this also works
Article= response.xpath('//*[#class="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm Mt(0.8em)--sm"]/text()').extract()
yield {'Title':Title,
'Article':Article}
def close(self, reason):
csv_file = max(glob.iglob('*.csv'), key=os.path.getctime)
mydb = MySQLdb.connect(host='localhost',
user='root',
passwd='prasun',
db='books')
cursor = mydb.cursor()
csv_data = csv.reader(csv_file)
row_count = 0
for row in csv_data:
if row_count != 0:
cursor.execute('INSERT IGNORE INTO yahoo_data (Title,Article) VALUES(%s, %s)', row)
row_count += 1
mydb.commit()
cursor.close()
gettting this error
ana. It should be directed not to disrespect the Sikh community and hurt its sentiments by passing such arbitrary and uncalled for orders," said Badal.', 'The SAD president also "brought it to the notice of the Haryana chief minister that Article 25 of the constitution safeguarded the rights of all citizens to profess and practices the tenets of their faith."', '"Keeping these facts in view I request you to direct the Haryana Public Service Commission to rescind its notification and allow Sikhs as well as candidates belonging to other religions to sport symbols of their faith during all examinations," said Badal. (ANI)']}
2019-04-01 16:49:41 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-01 16:49:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (25 items) in: items.csv
2019-04-01 16:49:41 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method YahooScrapperSpider.close of <YahooScrapperSpider 'yahoo_scrapper' at 0x2c60f07bac8>>
Traceback (most recent call last):
File "C:\Users\prasun.j\AppData\Local\Continuum\anaconda3\lib\site-packages\MySQLdb\cursors.py", line 201, in execute
query = query % args
TypeError: not enough arguments for format string
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\prasun.j\AppData\Local\Continuum\anaconda3\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
result = f(*args, **kw)
File "C:\Users\prasun.j\AppData\Local\Continuum\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "C:\Users\prasun.j\Desktop\scrapping\scrapping\spiders\yahoo_scrapper.py", line 44, in close
cursor.execute('INSERT IGNORE INTO yahoo_data (Title,Article) VALUES(%s, %s)', row)
File "C:\Users\prasun.j\AppData\Local\Continuum\anaconda3\lib\site-packages\MySQLdb\cursors.py", line 203, in execute
raise ProgrammingError(str(m))
MySQLdb._exceptions.ProgrammingError: not enough arguments for format string
2019-04-01 16:49:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 7985,
'downloader/request_count': 27,
'downloader/request_method_count/GET': 27,
'downloader/response_bytes': 2148049,
'downloader/response_count': 27,
'downloader/response_status_count/200': 26,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 1, 11, 19, 41, 350717),
'item_scraped_count': 25,
'log_count/DEBUG': 53,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'request_depth_max': 1,
'response_received_count': 26,
'scheduler/dequeued': 27,
'scheduler/dequeued/memory': 27,
'scheduler/enqueued': 27,
'scheduler/enqueued/memory': 27,
'start_time': datetime.datetime(2019, 4, 1, 11, 19, 36, 743594)}
2019-04-01 16:49:41 [scrapy.core.engine] INFO: Spider closed (finished)
This error
MySQLdb._exceptions.ProgrammingError: not enough arguments for format string
seems motivated by the lack of a sufficient number of arguments in the row you passed.
You can try to print the row, to understand what is going wrong.
Anyway, if you want to save scraped data to DB, I suggest to write a simple item pipeline, which exports data to DB, without passing through CSV.
For further information abuot item pipelines, see http://doc.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline
You can found a useful example at Writing items to a MySQL database in Scrapy
seems like you are passing list to the parameters that need to be mentioned by the comma
try to add asterix to 'row' var:
cursor.execute('INSERT IGNORE INTO yahoo_data (Title,Article) VALUES(%s, %s)', row)
to:
cursor.execute('INSERT IGNORE INTO yahoo_data (Title,Article) VALUES(%s, %s)', *row)

Resources