Post request/Form submission with scrapy leads to Error 404 - python-3.x

I am learning how to build a spider using scrapy to scrape this webpage: https://www.beesmart.city. To have access, it is necessary to do a form submission here: https://www.beesmart.city/login
Inspecting the page, I didn't find a CSRF Token so I guess it is not needed in this case.
I used the following code:
import scrapy
class BeesmartLogin(scrapy.Spider):
name = 'beesmart_login'
login_url = 'https://www.beesmart.city/login'
start_urls = [login_url]
def parse(self, response):
# extract the csrf token value
# create a python dictionary with the form values
data = {
'email': '<...#...>',
'password': '********',
}
# submit a POST request to it
yield scrapy.FormRequest(url=self.login_url, formdata=data, callback=self.parse_quotes)
def parse_quotes(self, response):
print("Parsing is done")
When I run it I get a 404 however. Does anyone have an idea why? I am very thankful for any help.
Tobi :)
The full response is:
(base) C:\Users\tobia\OneDrive\Documents\Python Scripts\Webscraping>scrapy runspider beesmart_login.py
2019-04-05 15:21:10 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: scrapybot)
2019-04-05 15:21:10 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Windows-10-10.0.17134-SP0
2019-04-05 15:21:10 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2019-04-05 15:21:10 [scrapy.extensions.telnet] INFO: Telnet Password: ef43803864a4422d
2019-04-05 15:21:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-04-05 15:21:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 15:21:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 15:21:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 15:21:11 [scrapy.core.engine] INFO: Spider opened
2019-04-05 15:21:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-05 15:21:11 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 15:21:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.beesmart.city/login> (referer: None)
2019-04-05 15:21:11 [scrapy.core.engine] DEBUG: Crawled (404) <POST https://www.beesmart.city/login> (referer: https://www.beesmart.city/login)
2019-04-05 15:21:11 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.beesmart.city/login>: HTTP status code is not handled or not allowed
2019-04-05 15:21:11 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-05 15:21:11 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 657,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 10500,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 5, 13, 21, 11, 770670),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/404': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 9,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2019, 4, 5, 13, 21, 11, 50573)}
2019-04-05 15:21:11 [scrapy.core.engine] INFO: Spider closed (finished)

If you are looking to understand CSFR Tokens - they have been explained on the following SO question
As furas explained it's not possible to use pure scrapy to login over javascript.
I recommend that you read the following article about Web scraping javascript using python

Related

Unsure how to troubleshoot Scrapy terminal output

and thanks in advance.
I'm attempting to use scrapy, which is somewhat new to me. I built (what I thought was) a simple spider which does the following:
class SuperSpider(CrawlSpider):
name = 'KYM_entries'
start_urls = ['https://knowyourmeme.com/memes/all/page/1']
def parse(self, response):
for entry in response.xpath('/html/body/div[3]/div/div[3]/section'):
yield {
# The link to a meme entry page on Know Your Meme
'entry_link': entry.xpath('./td[2]/a/#href').get()
}
Then I run the following in a terminal window:
$ scrapy crawl KYM_entries -O practice.csv
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release
warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
warnings.warn(
2022-12-26 20:08:04 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: KYM_spider)
2022-12-26 20:08:04 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.13, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0], pyOpenSSL 21.0.0 (OpenSSL 3.0.2 15 Mar 2022), cryptography 3.4.8, Platform Linux-5.15.0-56-generic-x86_64-with-glibc2.35
2022-12-26 20:08:04 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'KYM_spider',
'NEWSPIDER_MODULE': 'KYM_spider.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['KYM_spider.spiders']}
2022-12-26 20:08:04 [py.warnings] WARNING: /usr/local/lib/python3.10/dist-packages/scrapy/utils/request.py:231: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
2022-12-26 20:08:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-12-26 20:08:04 [scrapy.extensions.telnet] INFO: Telnet Password: 97ac3d17f1e4cea1
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-26 20:08:04 [scrapy.core.engine] INFO: Spider opened
2022-12-26 20:08:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-26 20:08:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-26 20:08:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://knowyourmeme.com/robots.txt> (referer: None)
2022-12-26 20:08:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://knowyourmeme.com/memes/all/page/1> (referer: None)
2022-12-26 20:08:05 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-26 20:08:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 466,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 11690,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.953839,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 27, 1, 8, 5, 833510),
'httpcompression/response_bytes': 45804,
'httpcompression/response_count': 2,
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 65228800,
'memusage/startup': 65228800,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 12, 27, 1, 8, 4, 879671)}
2022-12-26 20:08:05 [scrapy.core.engine] INFO: Spider closed (finished)
This returns an empty CSV, which I suppose means either something is wrong with the xpath, or there is something wrong with the connection to Know Your Meme. However, beyond the 200 code saying it is connecting to the site, I'm unsure how to troubleshoot what is happening here.
So I have a couple questions, one more direct to my issue, and one more broadly interested in this output:
Is there a way to see at what point my script is failing to retrieve the specified data in the xpath for this particular case?
Is there a simple guide or reference for how to read scrapy output?
I have looked into your code. There are a few issues with the selectors/XPath. I have updated the CSS selector and removed the XPATH. meme URLs are relative URLs so I have added urljoin method to make these URLs absolute URLs. I have added start_request method as my version of scrapy is 2.6.0. if you are using a lower version of scrapy (1.6.0) you can remove this method.
class SuperSpider(CrawlSpider):
name = 'KYM_entries'
start_urls = ['https://knowyourmeme.com/memes/all/page/1']
def start_requests(self):
yield Request(self.start_urls[0], callback=self.parse)
def parse(self, response):
for entry in response.css('.entry-grid-body .photo'):
yield {
# The link to a meme entry page on Know Your Meme
'entry_link': response.urljoin(entry.css('::attr(href)').get())
}
The code is working fine now. Below is the output.
2022-12-27 13:14:52 [scrapy.core.engine] INFO: Spider opened
2022-12-27 13:14:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-27 13:14:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-27 13:14:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://knowyourmeme.com/memes/all/page/1> (referer: None)
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/mayinquangcao'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/this-is-x-bitch-we-clown-in-this-muthafucka-betta-take-yo-sensitive-ass-back-to-y'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/choo-choo-charles'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/bug-fables-the-everlasting-sapling'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/onii-holding-a-picture'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/vintage-recipe-videos'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/ytpmv-elf'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/i-just-hit-a-dog-going-70mph-on-my-truck'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/women-dodging-accountability'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/grinchs-ultimatum'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/where-is-idos-black-and-white'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/basilisk-time'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/rankinbass-productions'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/error143'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/whatsapp-university'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/messi-autism-speculation-messi-is-autistic'}

Stuck in a loop / bad Xpath when doing scraping of website

I'm trying to scrape data from this website: https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM
I have made the following script for the intial data:
import scrapy
class WaiascrapSpider(scrapy.Spider):
name = 'waiascrap'
allowed_domains = ['clsaa-dc.org']
start_urls = ['https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM']
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = rows.xpath("(//tr/td[#class='time']/span)[1]/text()").get()
time = rows.xpath("//tr/td[#class='time']/span/time/text()").get()
yield{
'day': day,
'time': time,
}
however the data I'm getting is repeated, like if I'm not navigating the For cycle:
PS C:\Users\gasgu\PycharmProjects\ScrapingProject\projects\waia>
scrapy crawl waiascrap 2021-08-20 15:25:11 [scrapy.utils.log] INFO:
Scrapy 2.5.0 started (bot: waia) 2021-08-20 15:25:11
[scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5,
cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python
3.9.6 (t ags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography
3.4.7, Platform Windows-10-
10.0.19042-SP0 2021-08-20 15:25:11 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2021-08-20
15:25:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME':
'waia', 'NEWSPIDER_MODULE': 'waia.spiders', 'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['waia.spiders']} 2021-08-20 15:25:11
[scrapy.extensions.telnet] INFO: Telnet Password: 9299b6be5840b21c
2021-08-20 15:25:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats'] 2021-08-20 15:25:11
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2021-08-20
15:25:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-08-20 15:25:11
[scrapy.middleware] INFO: Enabled item pipelines: [] 2021-08-20
15:25:11 [scrapy.core.engine] INFO: Spider opened 2021-08-20 15:25:11
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2021-08-20 15:25:11
[scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023 2021-08-20 15:25:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://aa-dc.org/robots.txt> (referer: None) 2021-08-20
15:25:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> (referer: None)
2021-08-20 15:25:16 [scrapy.core.scraper] DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:19 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:22 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:26 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:29 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:32 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:35 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:39 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'}
EDIT:
now it's working, there was a combination of the errors marked by #Prophet, and a problem with my Xpath.
I'm putting my code working below:
import scrapy
class WaiascrapSpider(scrapy.Spider):
name = 'waiascrap'
allowed_domains = ['clsaa-dc.org']
start_urls = ['https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM']
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = row.xpath(".//td[#class='time']/span/text()").get()
time = row.xpath(".//td[#class='time']/span/time/text()").get()
yield {
'day': day,
'time': time,
}
To select element inside element you have to put a dot . in front of the XPath expression saying "from here".
Otherwise it will bring you the first match of (//tr/td[#class='time']/span)[1]/text() on the entire page each time, as you see.
Also, since you are iterating per each row it should be row.xpath..., not rows.xpath since rows is a list of elements while each row is an element.
Also, to apply search on a web element according to XPath locator you should use find_element_by_xpath method, not xpath.
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = row.find_element_by_xpath(".(//tr/td[#class='time']/span)[1]/text()").get()
time = row.find_element_by_xpath("//.tr/td[#class='time']/span/time/text()").get()
yield{
'day': day,
'time': time,
}

apscheduler+scrapy+asyncio Can't execute first task smoothly

version:
python 3.7、Scrapy 2.1.0、APScheduler 3.6.1
i create a simple spider for test
# -*- coding: utf-8 -*-
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://stackoverflow.com//']
custom_settings = {
'EXTENSIONS': {
'scrapy.extensions.logstats.LogStats': None,
},
'TELNETCONSOLE_ENABLED': False,
'LOG_LEVEL': 'INFO'
}
def parse(self, response):
self.logger.info('parse--------------------------')
run use the code:
from datetime import datetime
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from scrapy.utils.reactor import install_reactor
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
from apscheduler.schedulers.twisted import TwistedScheduler
from twisted.internet import reactor
configure_logging()
scheduler = TwistedScheduler(reactor=reactor)
process = CrawlerProcess(get_project_settings())
scheduler.add_job(process.crawl, 'interval', args=['test'], minutes=1, next_run_time=datetime.now())
scheduler.start()
reactor.run()
i want the spider run immediately,and then run every minute.
the spider opened immediately ,but it seems the spider sleep until the next task run
2020-05-25 14:34:33 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: testspider)
2020-05-25 14:34:33 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 20.3.0, Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-7-6.1.7601-SP1
2020-05-25 14:34:33 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2020-05-25 14:34:33 [apscheduler.scheduler] INFO: Adding job tentatively -- it will be properly scheduled when the scheduler starts
2020-05-25 14:34:33 [apscheduler.scheduler] INFO: Added job "CrawlerRunner.crawl" to job store "default"
2020-05-25 14:34:33 [apscheduler.scheduler] INFO: Scheduler started
2020-05-25 14:34:33 [apscheduler.scheduler] DEBUG: Looking for jobs to run
2020-05-25 14:34:33 [apscheduler.scheduler] DEBUG: Next wakeup is due at 2020-05-25 14:35:33.270223+08:00 (in 59.966950 seconds)
2020-05-25 14:34:33 [apscheduler.executors.default] INFO: Running job "CrawlerRunner.crawl (trigger: interval[0:01:00], next run at: 2020-05-25 14:35:33 CST)" (scheduled at 2020-05-25 14:34:33.270223+08:00)
2020-05-25 14:34:33 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'testspider',
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'testspider.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['testspider.spiders'],
'TELNETCONSOLE_ENABLED': False}
2020-05-25 14:34:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats']
2020-05-25 14:34:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-25 14:34:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-25 14:34:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-25 14:34:33 [scrapy.core.engine] INFO: Spider opened
2020-05-25 14:34:33 [apscheduler.executors.default] INFO: Job "CrawlerRunner.crawl (trigger: interval[0:01:00], next run at: 2020-05-25 14:35:33 CST)" executed successfully
2020-05-25 14:35:33 [apscheduler.executors.default] INFO: Running job "CrawlerRunner.crawl (trigger: interval[0:01:00], next run at: 2020-05-25 14:36:33 CST)" (scheduled at 2020-05-25 14:35:33.270223+08:00)
2020-05-25 14:35:33 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'testspider',
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'testspider.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['testspider.spiders'],
'TELNETCONSOLE_ENABLED': False}
2020-05-25 14:35:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats']
2020-05-25 14:35:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-25 14:35:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-25 14:35:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-25 14:35:33 [scrapy.core.engine] INFO: Spider opened
2020-05-25 14:35:33 [apscheduler.executors.default] INFO: Job "CrawlerRunner.crawl (trigger: interval[0:01:00], next run at: 2020-05-25 14:36:33 CST)" executed successfully
2020-05-25 14:35:34 [test] INFO: parse--------------------------
2020-05-25 14:35:34 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-25 14:35:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 764,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 26324,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 61.426001,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 25, 6, 35, 34, 905538),
'log_count/INFO': 17,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 5, 25, 6, 34, 33, 479537)}
2020-05-25 14:35:34 [scrapy.core.engine] INFO: Spider closed (finished)
2020-05-25 14:35:35 [test] INFO: parse--------------------------
2020-05-25 14:35:35 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-25 14:35:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 764,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 26322,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 1.62243,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 25, 6, 35, 35, 9694),
'log_count/INFO': 13,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 5, 25, 6, 35, 33, 387264)}
2020-05-25 14:35:35 [scrapy.core.engine] INFO: Spider closed (finished)
If not used AsyncioSelectorReactor ,first task will work smoothly
I had the same problem and I found the problem and solution:
First the solution: It seems that scrapy.utils.reactor.install_reactor uses asyncioreactor from the package twisted.internet and asyncio as a global variables and fails silently if it cant find it. So the right way to go would be:
# asyncio reactor installation (CORRECT) - `reactor` must not be defined at this point
# https://docs.scrapy.org/en/latest/_modules/scrapy/utils/reactor.html?highlight=asyncio%20reactor#
import scrapy
import asyncio
from twisted.internet import asyncioreactor
scrapy.utils.reactor.install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
is_asyncio_reactor_installed = scrapy.utils.reactor.is_asyncio_reactor_installed()
print(f"Is asyncio reactor installed: {is_asyncio_reactor_installed}")
from twisted.internet import reactor
However, the following set of instructions fails:
# asyncio reactor BAD INSTALLED (INCORRECT) Import order IS important
import scrapy
import asyncio
from twisted.internet import reactor
scrapy.utils.reactor.install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
is_asyncio_reactor_installed = scrapy.utils.reactor.is_asyncio_reactor_installed()
print(f"Is asyncio reactor installed: {is_asyncio_reactor_installed}")
Its bad code, it should not be this way. I hope scrapy developers make this right soon.
It first waits the interval and only then runs the scheduled function. So it's working correctly. If you want to make it run immediately, add next_run_time=datetime.now() to you add_job() call.
i found a solution,"use another task to activate it",for example:
add the task:
sched.add_job(lambda :print('activate'), 'interval', minutes=1, next_run_time=datetime.now() + timedelta(seconds=5))
I found this old question, and it seems like the TwistedScheduler can't work very well with scrapy framework
This is my solution by using another thread only for APScheduler
import threading
from twisted.internet import reactor
from apscheduler.schedulers.blocking import BlockingScheduler
from scrapy.crawler import CrawlerRunner
runner = CrawlerRunner( YOUR_SCRAPY_SETTINGS )
scheduler = BlockingScheduler()
scheduler.add_job(
lambda: reactor.callFromThread(lambda: runner.crawl('test')),
trigger='cron', minute='*/1',
)
scheduler_thread = threading.Thread(target=lambda: scheduler.start())
scheduler_thread.start()
reactor.run()
scheduler.shutdown()
References:
https://docs.scrapy.org/en/latest/topics/practices.html
https://docs.twisted.org/en/twisted-18.7.0/core/howto/threading.html

Getting sqlite3.OperationalError: unrecognized token: ":" when calling DB Update

Getting an error when trying to update a column in the db in my pipelines file, set_data_update function.
What I am trying to do is use the function get_data to return the url and the price, for each url that is returned, call the set_data_update function, where I will swap the existing new_price into old_price, and then put
the new scraped price into new_price. It seems like my call to set_data_update in get_data always runs twice. It should run once, because at the moment I only have one row in the DB for the 2nd URL -
"https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB/ref=sr_1_10?keywords=coffee+maker&qid=1559098604&s=home-garden&sr=1-10".
Also I see an error in the traceback
sqlite3.OperationalError: unrecognized token: ":"
products.json
{
"itemdata": [
{ "url": "https://www.amazon.com/dp/B07GWKT87L/?`coliid=I36XKNB8MLE3&colid=KRASGH7290D0&psc=0&ref_=lv_ov_lig_dp_it#customerReview",`
"title": "coffee_maker_black_and_decker",
"name": "Cobi Maguire",
"email": "cobi#noemail.com"
},
{ "url": "https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB/ref=sr_1_10?keywords=coffee+maker&qid=1559098604&s=home-garden&sr=1-10",
"title": "coffee_maker_hamilton_beach",
"name": "Ryan Murphy",
"email": "ryan#noemail.com"
}
]
}
Error Traceback- Traceback (most recent call last):
(price_monitor) C:\Users\hassy\Documents\python_venv\price_monitor\price_monitor>scrapy crawl price_monitor
2019-06-15 17:00:10 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: price_monitor)
2019-06-15 17:00:10 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 16:07:46) [MSC v.1900 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Windows-10-10.0.17134-SP0
2019-06-15 17:00:10 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'price_monitor', 'NEWSPIDER_MODULE': 'price_monitor.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['price_monitor.spiders'], 'USER_AGENT': 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
2019-06-15 17:00:10 [scrapy.extensions.telnet] INFO: Telnet Password: 3c0578dfed20521c
2019-06-15 17:00:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-06-15 17:00:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-06-15 17:00:10 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-06-15 17:00:10 [scrapy.middleware] INFO: Enabled item pipelines:
['price_monitor.pipelines.PriceMonitorPipeline']
2019-06-15 17:00:10 [scrapy.core.engine] INFO: Spider opened
2019-06-15 17:00:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-06-15 17:00:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-06-15 17:00:11 [scrapy.core.engine] DEBUG: Crawled (200) https://www.amazon.com/robots.txt> (referer: None)
2019-06-15 17:00:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to https://www.amazon.com/BLACK-DECKER-CM4202S-Programmable-Coffeemaker/dp/B07GWKT87L> from https://www.amazon.com/dp/B07GWKT87L/?coliid=I36XKNB8MLE3&colid=KRASGH7290D0&psc=0&ref_=lv_ov_lig_dp_it#customerReview>
2019-06-15 17:00:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB> from https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB/ref=sr_1_10?keywords=coffee+maker&qid=1559098604&s=home-garden&sr=1-10>
2019-06-15 17:00:12 [scrapy.core.engine] DEBUG: Crawled (200) https://www.amazon.com/BLACK-DECKER-CM4202S-Programmable-Coffeemaker/dp/B07GWKT87L> (referer: None)
2019-06-15 17:00:12 [scrapy.core.engine] DEBUG: Crawled (200) https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB> (referer: None)
Printing rows
('https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB/ref=sr_1_10?keywords=coffee+maker&qid=1559098604&s=home-garden&sr=1-10', '$37.99')
calling func
2019-06-15 17:00:12 [scrapy.core.scraper] ERROR: Error processing {'email': 'ryan#noemail.com',
'name': 'Ryan Murphy',
'price': '$49.99',
'title': 'BLACK+DECKER CM4202S Select-A-Size Easy Dial Programmable '
'Coffeemaker, Extra Large 80 ounce Capacity, Stainless Steel',
'url': 'h'}
Traceback (most recent call last):
File "c:\users\hassy\documents\python_venv\price_monitor\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\users\hassy\documents\python_venv\price_monitor\price_monitor\pipelines.py", line 37, in process_item
self.get_data(item)
File "c:\users\hassy\documents\python_venv\price_monitor\price_monitor\pipelines.py", line 60, in get_data
self.set_data_update(item, url, new_price)
File "c:\users\hassy\documents\python_venv\price_monitor\price_monitor\pipelines.py", line 88, in set_data_update
{'old_price': old_price, 'new_price': item['price']})
sqlite3.OperationalError: unrecognized token: ":"
Printing rows
('https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB/ref=sr_1_10?keywords=coffee+maker&qid=1559098604&s=home-garden&sr=1-10', '$37.99')
calling func
2019-06-15 17:00:12 [scrapy.core.scraper] ERROR: Error processing {'email': 'ryan#noemail.com',
'name': 'Ryan Murphy',
'price': '$34.99',
'title': 'Hamilton Beach 46310 Programmable Coffee Maker, 12 Cups, Black',
'url': 'h'}
Traceback (most recent call last):
File "c:\users\hassy\documents\python_venv\price_monitor\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\users\hassy\documents\python_venv\price_monitor\price_monitor\pipelines.py", line 37, in process_item
self.get_data(item)
File "c:\users\hassy\documents\python_venv\price_monitor\price_monitor\pipelines.py", line 60, in get_data
self.set_data_update(item, url, new_price)
File "c:\users\hassy\documents\python_venv\price_monitor\price_monitor\pipelines.py", line 88, in set_data_update
{'old_price': old_price, 'new_price': item['price']})
sqlite3.OperationalError: unrecognized token: ":"
2019-06-15 17:00:12 [scrapy.core.engine] INFO: Closing spider (finished)
2019-06-15 17:00:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1888,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 261495,
'downloader/response_count': 5,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/301': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 6, 15, 21, 0, 12, 534906),
'log_count/DEBUG': 5,
'log_count/ERROR': 2,
'log_count/INFO': 9,
'response_received_count': 3,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2019, 6, 15, 21, 0, 10, 799145)}
2019-06-15 17:00:12 [scrapy.core.engine] INFO: Spider closed (finished)
(price_monitor) C:\Users\hassy\Documents\python_venv\price_monitor\price_monitor>
pipelines.py
import sqlite3
class PriceMonitorPipeline(object):
def __init__(self):
self.create_connection()
self.create_table()
def create_connection(self):
self.conn = sqlite3.connect("price_monitor.db")
self.curr = self.conn.cursor()
def process_item(self, item, spider):
# self.store_data(item)
print("printing items")
print(item['title'])
print(item['price'])
self.get_data(item)
return item
def get_data(self, item):
""" Check if the row already exists for this url """
rows = 0
url = ''
new_price = ''
self.rows = rows
self.url = url
self.new_price = new_price
self.curr.execute("""select url, new_price from price_monitor WHERE url =:url""",
{'url': item['url']})
rows = self.curr.fetchone()
print("Printing rows")
print(rows)
rows_url = rows[0]
new_price = rows[1]
if rows is not None:
for item['url'] in rows_url:
print("calling func")
self.set_data_update(item, url, new_price)
else:
pass
def set_data_update(self, item, url, new_price):
url = 'https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB/ref=sr_1_10?keywords=coffee+maker&qid=1559098604&s=home-garden&sr=1-10'
old_price = new_price
price = item['price']
print("printing old price")
print(old_price)
print("New Price".format(item['price']))
self.curr.execute("""update price_monitor SET old_price=: old_price, new_price=: new_price
WHERE url=: url""",
{'old_price': old_price, 'new_price': price})
self.conn.commit()
items.py
import scrapy
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
url = scrapy.Field()
title = scrapy.Field()
price = scrapy.Field()
name = scrapy.Field()
email = scrapy.Field()
Spider
import scrapy
import json
import sys
from ..items import AmazonItem
class MySpider(scrapy.Spider):
name = 'price_monitor'
newlist = []
start_urls = []
itemdatalist = []
with open('C:\\Users\\hassy\\Documents\\python_venv\\price_monitor\\price_monitor\\products.json') as f:
data = json.load(f)
itemdatalist = data['itemdata']
for item in itemdatalist:
start_urls.append(item['url'])
def start_requests(self):
for item in MySpider.start_urls:
yield scrapy.Request(url=item, callback=self.parse)
def parse(self, response):
for url in MySpider.start_urls:
scrapeitem = AmazonItem()
title = response.css('span#productTitle::text').extract_first()
title = title.strip()
price = response.css('span#priceblock_ourprice::text').extract_first()
scrapeitem['title'] = title
scrapeitem['price'] = price
for item in MySpider.data['itemdata']:
url = item['url']
name = item['name']
email = item['email']
scrapeitem['url'] = url
scrapeitem['name'] = name
scrapeitem['email'] = email
yield scrapeitem

Scrapy crawl spider does not follow all the links and does not populate Item loader

I am trying to take the links for this website (https://minerals.usgs.gov/science/mineral-deposit-database/#products) and scrape the title from each one. However it doen't work! The spider does not seem to follow the links!
CODE
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import datetime
import socket
from depositsusa.items import DepositsusaItem
from scrapy.loader import ItemLoader
class DepositsSpider(CrawlSpider):
name = 'deposits'
allowed_domains = ['web']
start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-
database/#products', ]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[#id="products"][1]/p/a'),
callback='parse'),
)
def parse(self, response):
i = ItemLoader(item=DepositsusaItem(), response=response)
i.add_xpath('name', '//*[#class="container"][1]/header/h1/text()')
i.add_value('url', response.url)
i.add_value('project', self.settings.get('BOT_NAME'))
i.add_value('spider', self.name)
i.add_value('server', socket.gethostname())
i.add_value('date', datetime.datetime.now())
return i.load_item()
items
import scrapy
from scrapy.item import Item, Field
class DepositsusaItem(Item):
# main fields
name = Field()
# Housekeeping Fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
pass
OUTPUT
(base) C:\Users\User\Documents\Python WebCrawling Learing
Projects\depositsusa>scrapy crawl deposits
2018-11-17 00:29:48 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot:
depositsusa)
2018-11-17 00:29:48 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2
2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.7.0, Python
3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)],
pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform
Windows-10-10.0.17134-SP0
2018-11-17 00:29:48 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME':
'depositsusa', 'NEWSPIDER_MODULE': 'depositsusa.spiders', 'ROBOTSTXT_OBEY':
True, 'SPIDER_MODULES': ['depositsusa.spiders']}
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled downloader
middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-11-17 00:29:48 [scrapy.core.engine] INFO: Spider opened
2018-11-17 00:29:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0
pages/min), scraped 0 items (at 0 items/min)
2018-11-17 00:29:48 [scrapy.extensions.telnet] DEBUG: Telnet console
listening on 127.0.0.1:6024
2018-11-17 00:29:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://minerals.usgs.gov/robots.txt> (referer: None)
2018-11-17 00:29:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://minerals.usgs.gov/science/mineral-deposit-database/#products>
(referer: None)
2018-11-17 00:29:49 [scrapy.core.scraper] DEBUG: Scraped from <200
https://minerals.usgs.gov/science/mineral-deposit-database/>
{'date': [datetime.datetime(2018, 11, 17, 0, 29, 49, 832526)],
'project': ['depositsusa'],
'server': ['DESKTOP-9CUE746'],
'spider': ['deposits'],
'url': ['https://minerals.usgs.gov/science/mineral-deposit-database/']}
2018-11-17 00:29:49 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-17 00:29:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 475,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 25123,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 16, 23, 29, 49, 848053),
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 11, 16, 23, 29, 48, 520273)}
2018-11-17 00:29:49 [scrapy.core.engine] INFO: Spider closed (finished)
I am quite new at python, so what seems to be the problem? Is it something to do with linkextraction or the parse function?
You have to change a couple of things.
First, when you use a CrawlSpider, you can't have a callback named parse as you would override the CrawlSpider's parse: https://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
Secondly, you want to have the correct list of allowed_domains.
Try something like this:
class DepositsSpider(CrawlSpider):
name = 'deposits'
allowed_domains = ['doi.org']
start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-database/#products', ]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[#id="products"][1]/p/a'),
callback='parse_x'),
)
def parse_x(self, response):
i = ItemLoader(item=DepositsusaItem(), response=response)
i.add_xpath('name', '//*[#class="container"][1]/header/h1/text()')
i.add_value('url', response.url)
i.add_value('project', self.settings.get('BOT_NAME'))
i.add_value('spider', self.name)
i.add_value('server', socket.gethostname())
i.add_value('date', datetime.datetime.now())
return i.load_item()

Resources