and thanks in advance.
I'm attempting to use scrapy, which is somewhat new to me. I built (what I thought was) a simple spider which does the following:
class SuperSpider(CrawlSpider):
name = 'KYM_entries'
start_urls = ['https://knowyourmeme.com/memes/all/page/1']
def parse(self, response):
for entry in response.xpath('/html/body/div[3]/div/div[3]/section'):
yield {
# The link to a meme entry page on Know Your Meme
'entry_link': entry.xpath('./td[2]/a/#href').get()
}
Then I run the following in a terminal window:
$ scrapy crawl KYM_entries -O practice.csv
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release
warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
warnings.warn(
2022-12-26 20:08:04 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: KYM_spider)
2022-12-26 20:08:04 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.13, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0], pyOpenSSL 21.0.0 (OpenSSL 3.0.2 15 Mar 2022), cryptography 3.4.8, Platform Linux-5.15.0-56-generic-x86_64-with-glibc2.35
2022-12-26 20:08:04 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'KYM_spider',
'NEWSPIDER_MODULE': 'KYM_spider.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['KYM_spider.spiders']}
2022-12-26 20:08:04 [py.warnings] WARNING: /usr/local/lib/python3.10/dist-packages/scrapy/utils/request.py:231: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
2022-12-26 20:08:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-12-26 20:08:04 [scrapy.extensions.telnet] INFO: Telnet Password: 97ac3d17f1e4cea1
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-26 20:08:04 [scrapy.core.engine] INFO: Spider opened
2022-12-26 20:08:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-26 20:08:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-26 20:08:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://knowyourmeme.com/robots.txt> (referer: None)
2022-12-26 20:08:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://knowyourmeme.com/memes/all/page/1> (referer: None)
2022-12-26 20:08:05 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-26 20:08:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 466,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 11690,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.953839,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 27, 1, 8, 5, 833510),
'httpcompression/response_bytes': 45804,
'httpcompression/response_count': 2,
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 65228800,
'memusage/startup': 65228800,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 12, 27, 1, 8, 4, 879671)}
2022-12-26 20:08:05 [scrapy.core.engine] INFO: Spider closed (finished)
This returns an empty CSV, which I suppose means either something is wrong with the xpath, or there is something wrong with the connection to Know Your Meme. However, beyond the 200 code saying it is connecting to the site, I'm unsure how to troubleshoot what is happening here.
So I have a couple questions, one more direct to my issue, and one more broadly interested in this output:
Is there a way to see at what point my script is failing to retrieve the specified data in the xpath for this particular case?
Is there a simple guide or reference for how to read scrapy output?
I have looked into your code. There are a few issues with the selectors/XPath. I have updated the CSS selector and removed the XPATH. meme URLs are relative URLs so I have added urljoin method to make these URLs absolute URLs. I have added start_request method as my version of scrapy is 2.6.0. if you are using a lower version of scrapy (1.6.0) you can remove this method.
class SuperSpider(CrawlSpider):
name = 'KYM_entries'
start_urls = ['https://knowyourmeme.com/memes/all/page/1']
def start_requests(self):
yield Request(self.start_urls[0], callback=self.parse)
def parse(self, response):
for entry in response.css('.entry-grid-body .photo'):
yield {
# The link to a meme entry page on Know Your Meme
'entry_link': response.urljoin(entry.css('::attr(href)').get())
}
The code is working fine now. Below is the output.
2022-12-27 13:14:52 [scrapy.core.engine] INFO: Spider opened
2022-12-27 13:14:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-27 13:14:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-27 13:14:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://knowyourmeme.com/memes/all/page/1> (referer: None)
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/mayinquangcao'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/this-is-x-bitch-we-clown-in-this-muthafucka-betta-take-yo-sensitive-ass-back-to-y'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/choo-choo-charles'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/bug-fables-the-everlasting-sapling'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/onii-holding-a-picture'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/vintage-recipe-videos'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/ytpmv-elf'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/i-just-hit-a-dog-going-70mph-on-my-truck'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/women-dodging-accountability'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/grinchs-ultimatum'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/where-is-idos-black-and-white'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/basilisk-time'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/rankinbass-productions'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/error143'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/whatsapp-university'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/messi-autism-speculation-messi-is-autistic'}
Related
I'm new to Scrapy but I'm running into an issue forming an accurate selector based on scrapy's tutorial code basically I'm trying to list all business,their Address and their website. But when I try to list them only one result comes out (if i set all of them to getall then i'm getting all of them just they are thrown there randomly and i need them in format:
{"address": "mazowieckie, Warszawa", "name": "Dom Development S.A.", "link": "domd.pl"})
Here is code that I use:
class RynekMainSpider(scrapy.Spider):
name = "RynekMain"
start_urls = [
'https://rynekpierwotny.pl/deweloperzy/?page=1',
]
def parse(self, response):
for quote in response.css('ul.rp-1qtpzi4'):
yield {
'address': quote.css('address.rp-o9b83y::text').get(),
'name': quote.css('h2.rp-69f2r4::text').get(),
'link': quote.css('li.rp-np9kb1 a::attr(href)').get(),
}
```
Thanks in advance.
You are getting only one output because the element selection/locator strategy ul.rp-1qtpzi4 is incorrect meaning it's not selecting all the lising from the entire page but the correct selection like
.rp-y89gny.eboilu01 ul li select all 24 items
import scrapy
from scrapy.crawler import CrawlerProcess
class RynekMainSpider(scrapy.Spider):
name = "RynekMain"
start_urls = [
'https://rynekpierwotny.pl/deweloperzy/?page=1']
def parse(self, response):
for quote in response.css('.rp-y89gny.eboilu01 ul li'):
yield {
'address': quote.css('address.rp-o9b83y::text').get(),
'name': quote.css('h2.rp-69f2r4::text').get(),
'link': quote.css('li.rp-np9kb1 a::attr(href)').get(),
}
if __name__ == "__main__":
process =CrawlerProcess()
process.crawl(RynekMainSpider)
process.start()
Output:
{'address': 'mazowieckie, Warszawa', 'name': 'Dom Development S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/dom-development-sa-955/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Ronson Development Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/ronson-development-sp-z-oo-863/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Echo Investment S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/echo-investment-sa-7478/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław, Psie Pole', 'name': 'INTER-ES Deweloper', 'link': 'https://rynekpierwotny.pl/deweloperzy/inter-es-deweloper-928/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'śląskie, Bielsko-Biała', 'name': 'Murapol S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/murapol-sa-884/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Robyg S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/robyg-sa-888/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'śląskie, cieszyński, Cieszyn', 'name': 'ATAL S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/atal-sa-1084/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'zachodniopomorskie, Szczecin', 'name': 'Assethome – Przedstawiciel Dewelopera', 'link': 'https://rynekpierwotny.pl/deweloperzy/asset-home-przedstawiciel-dewelopera-7429/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Hreit', 'link': 'https://rynekpierwotny.pl/deweloperzy/hreit-7892/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław', 'name': 'Develia', 'link': 'https://rynekpierwotny.pl/deweloperzy/develia-1048/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław, Fabryczna', 'name': 'PROFIT Development', 'link': 'https://rynekpierwotny.pl/deweloperzy/profit-development-940/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Novisa Development Sp. z o.o. Sp. J.', 'link': 'https://rynekpierwotny.pl/deweloperzy/novisa-development-sp-z-oo-sp-j-484/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'pomorskie, Gdańsk', 'name': 'Robyg', 'link': 'https://rynekpierwotny.pl/deweloperzy/robyg-grupa-deweloperska-4251/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Arche S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/arche-sp-z-oo-934/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'warmińsko-mazurskie, ełcki, Ełk', 'name': 'Rutkowski Development Sp. J.', 'link': 'https://rynekpierwotny.pl/deweloperzy/rutkowski-development-sp-j-1846/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Cordia Polska', 'link': 'https://rynekpierwotny.pl/deweloperzy/cordia-polska-3824/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Budlex Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/budlex-sp-z-oo-1684/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'pomorskie, Gdańsk', 'name': 'Euro Styl S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/euro-styl-sa-964/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'łódzkie, Skierniewice', 'name': 'JHM DEVELOPMENT S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/jhm-development-sa-892/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław', 'name': 'Lokum Deweloper', 'link': 'https://rynekpierwotny.pl/deweloperzy/lokum-deweloper-948/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'podlaskie, Łomża', 'name': 'Eldor Bud Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/eldor-bud-sp-z-oo-4355/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Nexity Polska Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/nexity-polska-sp-z-oo-2856/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Spravia', 'link': 'https://rynekpierwotny.pl/deweloperzy/spravia-1236/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'małopolskie, Kraków', 'name': 'Bryksy', 'link': 'https://rynekpierwotny.pl/deweloperzy/bryksy-914/'}
'item_scraped_count': 24,,
response.css('ul.rp-1qtpzi4') will get you the container of the items, and not the items (li tag) themselves. So you're looping over the container (once) and getting just the first item.
Change it to:
import scrapy
class RynekMainSpider(scrapy.Spider):
name = "RynekMain"
start_urls = [
'https://rynekpierwotny.pl/deweloperzy/?page=1',
]
def parse(self, response):
for quote in response.css('ul.rp-1qtpzi4 li'):
yield {
'address': quote.css('address.rp-o9b83y::text').get(),
'name': quote.css('h2.rp-69f2r4::text').get(),
'link': quote.css('li.rp-np9kb1 a::attr(href)').get(),
}
I'm trying to scrape data from this website: https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM
I have made the following script for the intial data:
import scrapy
class WaiascrapSpider(scrapy.Spider):
name = 'waiascrap'
allowed_domains = ['clsaa-dc.org']
start_urls = ['https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM']
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = rows.xpath("(//tr/td[#class='time']/span)[1]/text()").get()
time = rows.xpath("//tr/td[#class='time']/span/time/text()").get()
yield{
'day': day,
'time': time,
}
however the data I'm getting is repeated, like if I'm not navigating the For cycle:
PS C:\Users\gasgu\PycharmProjects\ScrapingProject\projects\waia>
scrapy crawl waiascrap 2021-08-20 15:25:11 [scrapy.utils.log] INFO:
Scrapy 2.5.0 started (bot: waia) 2021-08-20 15:25:11
[scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5,
cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python
3.9.6 (t ags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography
3.4.7, Platform Windows-10-
10.0.19042-SP0 2021-08-20 15:25:11 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2021-08-20
15:25:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME':
'waia', 'NEWSPIDER_MODULE': 'waia.spiders', 'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['waia.spiders']} 2021-08-20 15:25:11
[scrapy.extensions.telnet] INFO: Telnet Password: 9299b6be5840b21c
2021-08-20 15:25:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats'] 2021-08-20 15:25:11
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2021-08-20
15:25:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-08-20 15:25:11
[scrapy.middleware] INFO: Enabled item pipelines: [] 2021-08-20
15:25:11 [scrapy.core.engine] INFO: Spider opened 2021-08-20 15:25:11
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2021-08-20 15:25:11
[scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023 2021-08-20 15:25:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://aa-dc.org/robots.txt> (referer: None) 2021-08-20
15:25:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> (referer: None)
2021-08-20 15:25:16 [scrapy.core.scraper] DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:19 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:22 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:26 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:29 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:32 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:35 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:39 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'}
EDIT:
now it's working, there was a combination of the errors marked by #Prophet, and a problem with my Xpath.
I'm putting my code working below:
import scrapy
class WaiascrapSpider(scrapy.Spider):
name = 'waiascrap'
allowed_domains = ['clsaa-dc.org']
start_urls = ['https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM']
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = row.xpath(".//td[#class='time']/span/text()").get()
time = row.xpath(".//td[#class='time']/span/time/text()").get()
yield {
'day': day,
'time': time,
}
To select element inside element you have to put a dot . in front of the XPath expression saying "from here".
Otherwise it will bring you the first match of (//tr/td[#class='time']/span)[1]/text() on the entire page each time, as you see.
Also, since you are iterating per each row it should be row.xpath..., not rows.xpath since rows is a list of elements while each row is an element.
Also, to apply search on a web element according to XPath locator you should use find_element_by_xpath method, not xpath.
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = row.find_element_by_xpath(".(//tr/td[#class='time']/span)[1]/text()").get()
time = row.find_element_by_xpath("//.tr/td[#class='time']/span/time/text()").get()
yield{
'day': day,
'time': time,
}
I'm creating a web scraper and want to callback to get sub-pages, but it seems not working correctly and no result, any one can help ?
Here is my code
class YellSpider(scrapy.Spider):
name = "yell"
start_urls = [url, ]
def parse(self, response):
pageNum = 0
pages = len(response.xpath(".//div[#class='col-sm-14 col-md-16 col-lg-14 text-center']/*"))
for page in range(pages):
pageNum = page + 1
for x in range(5):
num = random.randint(5, 8)
time.sleep(num)
for item in response.xpath(".//a[#href][contains(#href,'/#view=map')][contains(#href,'/biz/')]"):
subcategory = base_url + item.xpath("./#href").extract_first().replace("/#view=map", "")
sub_req = scrapy.Request(subcategory, callback=self.parse_details)
yield sub_req
next_page = base_url + "ucs/UcsSearchAction.do?&selectedClassification=" + classification + "&keywords=" + keyword + "&location=" + city + "&pageNum=" + str(
pageNum + 1)
if next_page:
yield scrapy.Request(next_page, self.parse)
def parse_details(self, sub_req):
for x in range(5):
num = random.randint(1, 5)
# time.sleep(num)
name = sub_req.xpath(".//h1[#class='text-h1 businessCard--businessName']/text()").extract_first()
address = " ".join(
sub_req.xpath(".//span[#class='address'][#itemprop='address']/child::node()/text()").extract())
telephone = sub_req.xpath(".//span[#class='business--telephoneNumber']/text()").extract_first()
web = sub_req.xpath(
".//div[#class='row flexColumns-sm-order-8 floatedColumns-md-right floatedColumns-lg-right floatedColumns-md-19 floatedColumns-lg-19']//a[#itemprop='url']/#href").extract_first()
hours = ""
overview = ""
yield {
'Name': name,
'Address': address,
'Telephone': telephone,
'Web Site': web
}
I want to callback from response to parse_details
I expect to loop over all adds in sub_req and scrape the data from it
and here is the logfile:
2019-06-17 15:50:33 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: Web_Scraper)
2019-06-17 15:50:33 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-10-10.0.16299-SP0
2019-06-17 15:50:33 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Web_Scraper', 'LOG_FILE': 'output.log', 'NEWSPIDER_MODULE': 'Web_Scraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Web_Scraper.spiders']}
2019-06-17 15:50:33 [scrapy.extensions.telnet] INFO: Telnet Password: 91c423015a2cd984
2019-06-17 15:50:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-06-17 15:50:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-06-17 15:50:34 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-06-17 15:50:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-06-17 15:50:34 [scrapy.core.engine] INFO: Spider opened
2019-06-17 15:50:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-06-17 15:50:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-06-17 15:50:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/robots.txt> (referer: None)
2019-06-17 15:50:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1> (referer: None)
2019-06-17 15:50:49 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.yell.com//biz/pennine-tuition-services-liverpool-9465687> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/pennine-tuition-services-liverpool-9465687/> from <GET https://www.yell.com//biz/pennine-tuition-services-liverpool-9465687>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/home-maths-tutoring-liverpool-7467622/> from <GET https://www.yell.com//biz/home-maths-tutoring-liverpool-7467622>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/kumon-maths-and-english-wallasey-8945913/> from <GET https://www.yell.com//biz/kumon-maths-and-english-wallasey-8945913>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/maths-tution-wallasey-7574939/> from <GET https://www.yell.com//biz/maths-tution-wallasey-7574939>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/patrick-haslam-tutoring-liverpool-8777349/> from <GET https://www.yell.com//biz/patrick-haslam-tutoring-liverpool-8777349>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/maths-tution-wallasey-8327361/> from <GET https://www.yell.com//biz/maths-tution-wallasey-8327361>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/dr-john-ankers-science-and-maths-tutor-liverpool-8467525/> from <GET https://www.yell.com//biz/dr-john-ankers-science-and-maths-tutor-liverpool-8467525>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/tutor-services-liverpool-liverpool-8134849/> from <GET https://www.yell.com//biz/tutor-services-liverpool-liverpool-8134849>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/advanced-maths-tutorials-liverpool-3755223/> from <GET https://www.yell.com//biz/advanced-maths-tutorials-liverpool-3755223>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/kumon-maths-and-english-liverpool-8903743/> from <GET https://www.yell.com//biz/kumon-maths-and-english-liverpool-8903743>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/kumon-maths-and-english-study-centre-bebington-wirral-7460985/> from <GET https://www.yell.com//biz/kumon-maths-and-english-study-centre-bebington-wirral-7460985>
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/pennine-tuition-services-liverpool-9465687/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/north-west-tutors-liverpool-901511208/> from <GET https://www.yell.com//biz/north-west-tutors-liverpool-901511208>
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/home-maths-tutoring-liverpool-7467622/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/triple-m-education-prenton-8934754/> from <GET https://www.yell.com//biz/triple-m-education-prenton-8934754>
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/kumon-maths-and-english-wallasey-8945913/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/pennine-tuition-services-liverpool-9465687/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/maths-tution-wallasey-7574939/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/maths-tuition-wallasey-901339881/> from <GET https://www.yell.com//biz/maths-tuition-wallasey-901339881>
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/dr-john-ankers-science-and-maths-tutor-liverpool-8467525/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/home-maths-tutoring-liverpool-7467622/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/tutor-services-liverpool-liverpool-8134849/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/kumon-maths-and-english-wallasey-8945913/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/patrick-haslam-tutoring-liverpool-8777349/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/maths-tution-wallasey-8327361/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/advanced-maths-tutorials-liverpool-3755223/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/maths-tution-wallasey-7574939/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/activett-liverpool-901311152/> from <GET https://www.yell.com//biz/activett-liverpool-901311152>
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/kumon-maths-and-english-liverpool-8903743/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/dr-john-ankers-science-and-maths-tutor-liverpool-8467525/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=4> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/tutor-services-liverpool-liverpool-8134849/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/liz-beattie-tutoring-prenton-7618961/> from <GET https://www.yell.com//biz/liz-beattie-tutoring-prenton-7618961>
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/askademia-liverpool-8680035> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/patrick-haslam-tutoring-liverpool-8777349/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/maths-tution-wallasey-8327361/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/kumon-maths-and-english-study-centre-bebington-wirral-7460985/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/north-west-tutors-liverpool-901511208/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/advanced-maths-tutorials-liverpool-3755223/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/triple-m-education-prenton-8934754/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/1-2-1-tutoring-liverpool-901224945> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/kumon-maths-and-english-liverpool-8903743/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/maths-tuition-wallasey-901339881/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/dmw-tuition-ltd-liverpool-6887458> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/askademia-liverpool-8680035>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/wallasey-tuition-wallasey-7390339> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/zenitheducators-liverpool-7791342> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/explore-learning-liverpool-901511688> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/kumon-maths-and-english-study-centre-bebington-wirral-7460985/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/north-west-tutors-liverpool-901511208/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/edes-educational-centre-liverpool-8380869> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/triple-m-education-prenton-8934754/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/love2teach-liverpool-liverpool-9678322> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/activett-liverpool-901311152/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/1-2-1-tutoring-liverpool-901224945>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/maths-tuition-wallasey-901339881/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/guaranteed-grades-liverpool-7368523> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/dmw-tuition-ltd-liverpool-6887458>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/wallasey-tuition-wallasey-7390339>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/liz-beattie-tutoring-prenton-7618961/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/zenitheducators-liverpool-7791342>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=4)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/explore-learning-liverpool-901511688>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/edes-educational-centre-liverpool-8380869>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/love2teach-liverpool-liverpool-9678322>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/activett-liverpool-901311152/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/guaranteed-grades-liverpool-7368523>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/liz-beattie-tutoring-prenton-7618961/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] INFO: Closing spider (finished)
2019-06-17 15:50:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 21544,
'downloader/request_count': 45,
'downloader/request_method_count/GET': 45,
'downloader/response_bytes': 257613,
'downloader/response_count': 45,
'downloader/response_status_count/200': 29,
'downloader/response_status_count/301': 16,
'dupefilter/filtered': 51,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 6, 17, 13, 50, 57, 376036),
'item_scraped_count': 25,
'log_count/DEBUG': 71,
'log_count/INFO': 9,
'request_depth_max': 3,
'response_received_count': 29,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 44,
'scheduler/dequeued/memory': 44,
'scheduler/enqueued': 44,
'scheduler/enqueued/memory': 44,
'start_time': datetime.datetime(2019, 6, 17, 13, 50, 34, 114473)}
2019-06-17 15:50:57 [scrapy.core.engine] INFO: Spider closed (finished)
I am learning how to build a spider using scrapy to scrape this webpage: https://www.beesmart.city. To have access, it is necessary to do a form submission here: https://www.beesmart.city/login
Inspecting the page, I didn't find a CSRF Token so I guess it is not needed in this case.
I used the following code:
import scrapy
class BeesmartLogin(scrapy.Spider):
name = 'beesmart_login'
login_url = 'https://www.beesmart.city/login'
start_urls = [login_url]
def parse(self, response):
# extract the csrf token value
# create a python dictionary with the form values
data = {
'email': '<...#...>',
'password': '********',
}
# submit a POST request to it
yield scrapy.FormRequest(url=self.login_url, formdata=data, callback=self.parse_quotes)
def parse_quotes(self, response):
print("Parsing is done")
When I run it I get a 404 however. Does anyone have an idea why? I am very thankful for any help.
Tobi :)
The full response is:
(base) C:\Users\tobia\OneDrive\Documents\Python Scripts\Webscraping>scrapy runspider beesmart_login.py
2019-04-05 15:21:10 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: scrapybot)
2019-04-05 15:21:10 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Windows-10-10.0.17134-SP0
2019-04-05 15:21:10 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2019-04-05 15:21:10 [scrapy.extensions.telnet] INFO: Telnet Password: ef43803864a4422d
2019-04-05 15:21:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-04-05 15:21:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 15:21:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 15:21:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 15:21:11 [scrapy.core.engine] INFO: Spider opened
2019-04-05 15:21:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-05 15:21:11 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 15:21:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.beesmart.city/login> (referer: None)
2019-04-05 15:21:11 [scrapy.core.engine] DEBUG: Crawled (404) <POST https://www.beesmart.city/login> (referer: https://www.beesmart.city/login)
2019-04-05 15:21:11 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.beesmart.city/login>: HTTP status code is not handled or not allowed
2019-04-05 15:21:11 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-05 15:21:11 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 657,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 10500,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 5, 13, 21, 11, 770670),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/404': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 9,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2019, 4, 5, 13, 21, 11, 50573)}
2019-04-05 15:21:11 [scrapy.core.engine] INFO: Spider closed (finished)
If you are looking to understand CSFR Tokens - they have been explained on the following SO question
As furas explained it's not possible to use pure scrapy to login over javascript.
I recommend that you read the following article about Web scraping javascript using python
I am trying to take the links for this website (https://minerals.usgs.gov/science/mineral-deposit-database/#products) and scrape the title from each one. However it doen't work! The spider does not seem to follow the links!
CODE
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import datetime
import socket
from depositsusa.items import DepositsusaItem
from scrapy.loader import ItemLoader
class DepositsSpider(CrawlSpider):
name = 'deposits'
allowed_domains = ['web']
start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-
database/#products', ]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[#id="products"][1]/p/a'),
callback='parse'),
)
def parse(self, response):
i = ItemLoader(item=DepositsusaItem(), response=response)
i.add_xpath('name', '//*[#class="container"][1]/header/h1/text()')
i.add_value('url', response.url)
i.add_value('project', self.settings.get('BOT_NAME'))
i.add_value('spider', self.name)
i.add_value('server', socket.gethostname())
i.add_value('date', datetime.datetime.now())
return i.load_item()
items
import scrapy
from scrapy.item import Item, Field
class DepositsusaItem(Item):
# main fields
name = Field()
# Housekeeping Fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
pass
OUTPUT
(base) C:\Users\User\Documents\Python WebCrawling Learing
Projects\depositsusa>scrapy crawl deposits
2018-11-17 00:29:48 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot:
depositsusa)
2018-11-17 00:29:48 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2
2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.7.0, Python
3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)],
pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform
Windows-10-10.0.17134-SP0
2018-11-17 00:29:48 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME':
'depositsusa', 'NEWSPIDER_MODULE': 'depositsusa.spiders', 'ROBOTSTXT_OBEY':
True, 'SPIDER_MODULES': ['depositsusa.spiders']}
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled downloader
middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-11-17 00:29:48 [scrapy.core.engine] INFO: Spider opened
2018-11-17 00:29:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0
pages/min), scraped 0 items (at 0 items/min)
2018-11-17 00:29:48 [scrapy.extensions.telnet] DEBUG: Telnet console
listening on 127.0.0.1:6024
2018-11-17 00:29:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://minerals.usgs.gov/robots.txt> (referer: None)
2018-11-17 00:29:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://minerals.usgs.gov/science/mineral-deposit-database/#products>
(referer: None)
2018-11-17 00:29:49 [scrapy.core.scraper] DEBUG: Scraped from <200
https://minerals.usgs.gov/science/mineral-deposit-database/>
{'date': [datetime.datetime(2018, 11, 17, 0, 29, 49, 832526)],
'project': ['depositsusa'],
'server': ['DESKTOP-9CUE746'],
'spider': ['deposits'],
'url': ['https://minerals.usgs.gov/science/mineral-deposit-database/']}
2018-11-17 00:29:49 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-17 00:29:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 475,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 25123,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 16, 23, 29, 49, 848053),
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 11, 16, 23, 29, 48, 520273)}
2018-11-17 00:29:49 [scrapy.core.engine] INFO: Spider closed (finished)
I am quite new at python, so what seems to be the problem? Is it something to do with linkextraction or the parse function?
You have to change a couple of things.
First, when you use a CrawlSpider, you can't have a callback named parse as you would override the CrawlSpider's parse: https://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
Secondly, you want to have the correct list of allowed_domains.
Try something like this:
class DepositsSpider(CrawlSpider):
name = 'deposits'
allowed_domains = ['doi.org']
start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-database/#products', ]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[#id="products"][1]/p/a'),
callback='parse_x'),
)
def parse_x(self, response):
i = ItemLoader(item=DepositsusaItem(), response=response)
i.add_xpath('name', '//*[#class="container"][1]/header/h1/text()')
i.add_value('url', response.url)
i.add_value('project', self.settings.get('BOT_NAME'))
i.add_value('spider', self.name)
i.add_value('server', socket.gethostname())
i.add_value('date', datetime.datetime.now())
return i.load_item()