Hi I have the following spider:
import scrapy
class TREC_spider(scrapy.Spider):
"use this spider to obtain the proper tagged questions from http://cogcomp.org/Data/QA/QC/"
name = "TREC"
start_urls = ["http://cogcomp.org/Data/QA/QC/train_5500.label"]
def parse(self,response):
for question in response.selector.xpath("/html/body/pre/text()"):
yield question
I turned the robots.txt thing to False, but I still get the following text on my prompt:
2018-12-25 14:02:06 [scrapy.core.engine] INFO: Spider opened
2018-12-25 14:02:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-12-25 14:02:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on <inserrt adress here>
2018-12-25 14:02:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://cogcomp.org/Data/QA/QC/train_5500.label> (referer: None)
2018-12-25 14:02:07 [scrapy.core.engine] INFO: Closing spider (finished)
How can I get my spider to actually crawl the page?
you need to return items or dictionaries, try changing yield question to:
yield {'question': question.extract_first()}
Your xpath doesn't match because the response is actually a TextResponse. That URL does not return HTML it returns text/plain.
You will likely want yield response.body_as_unicode() or to actually chop up the lines in the response before yielding them as structured data
Related
I want my running spider to be closed immediately without processing any scheduled requests, I've tried the following approaches to no avail,
Raising CloseSpider in callback functions
Calling spider.crawler.engine.close_spider(spider, 'reason') in a downloader middleware
The script is automated as it runs several spiders in a loop. I want the running spider to be closed instantly when it meets a certain condition and the program to be continued with the rest of the spiders inside the loop.
Is there a way to drop the requests from the scheduler queue?
I have included a snippet where i'm trying to terminate the spider
class TooManyRequestsMiddleware:
def process_response(self, request, response, spider):
if response.status == 429:
spider.crawler.engine.close_spider(
spider,
f"Too many requests!, response status code: {response.status}
)
elif 'change_spotted' in list(spider.kwargs.keys()):
print("Attempting to close down the spider")
spider.crawler.engine.close_spider(spider, "Spider is terminated!")
return response
I know there's dicussions whether scraping LinkedIn is allowed or not; but from following article:
https://www.forbes.com/sites/emmawoollacott/2019/09/10/linkedin-data-scraping-ruled-legal/#787286c31b54
I think it is safe to say that scraping publicly available data from LinkedIn is legal.
Now, I am trying to scrape job searches for a specific job title in a specific region.
So far so good, everything works, except for the limit of the amount of scraped jobs to be 25.
I am trying to use following trick:
Inside the URL I pass a keyword &start=X
with X going from 0, to 25, 50, and so on.
In browser, this allows me to go to the next page view and extract jobs from there.
However, using scrapy this method doesn't work.
The code is as follows:
res = requests.get('https://www.linkedin.com/jobs/search/?keywords={}&location={}&start=25'.format(job, location))
response = TextResponse(res.url, body=res.text, encoding='utf-8')
print("processing:" + response.url)
Output:
processing:https://www.linkedin.com/jobs/search/?keywords=Data+Scientist&location=Brussels&start=0
Even if I hardcode it to 25 (page 2), it sets it to 0.
Any idea on how to solve this?
Just disable RedirectMiddleware with a REDIRECT_ENABLED=0 setting on the scrapy shell.
scrapy shell -s REDIRECT_ENABLED=0 "https://www.linkedin.com/jobs/search/?keywords=Data+Scientist&location=Brussels&start=75"
_
2019-10-24 21:50:09 [scrapy.core.engine] DEBUG: Crawled (303) <GET https://www.linkedin.com/jobs/search/?keywords=Data+Scientist&location=Brussels&start=75> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0684AB30>
[s] item {}
[s] request <GET https://www.linkedin.com/jobs/search/?keywords=Data+Scientist&location=Brussels&start=75>
In [2]: fetch('https://www.linkedin.com/jobs/search/?keywords=Data+Scientist&location=Brussels&start=50')
2019-10-24 21:56:39 [scrapy.core.engine] DEBUG: Crawled (303) <GET https://www.linkedin.com/jobs/search/?keywords=Data+Scientist&location=Brussels&start=50> (referer: None)
In [3]: fetch('https://www.linkedin.com/jobs/search/?keywords=Data+Scientist&location=Brussels&start=100')
2019-10-24 21:56:49 [scrapy.core.engine] DEBUG: Crawled (303) <GET https://www.linkedin.com/jobs/search/?keywords=Data+Scientist&location=Brussels&start=100> (referer: None)
For example, if you want the redirect middleware to ignore 301 and 302 responses (and pass them through to your spider) you can do this:
class MySpider(CrawlSpider):
handle_httpstatus_list = [301, 302]
This middleware handles redirection of requests based on response status.
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect
When running Scrapy from an own script that loads URLs from a DB and follows all internal links on those websites, I encounter a pitty. I need to know which start_url is currently used as I have to maintain consistency with a database (SQL DB). But: When Scrapy uses the built-in list called 'start_urls' in order to receive a list of links to follow and those websites have an immediate redirect, a problem occurs. For example, when Scrapy starts and the start_urls are being crawled and the crawler follows all internal links that are being found there, I later can only determine the currently visited URL, not the start_url where Scrapy started out.
Other answers from the web are wrong, for other use cases or deprecated as there seems to have been a change in Scrapy's code last year.
MWE:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
class CustomerSpider(CrawlSpider):
name = "my_crawler"
rules = [Rule(LinkExtractor(unique=True), callback="parse_obj", ), ]
def parse_obj(self, response):
print(response.url) # find current start_url and do something
a = CustomerSpider
a.start_urls = ["https://upb.de", "https://spiegel.de"] # I want to re-identify upb.de in the crawling process in process.crawl(a), but it is redirected immediately # I have to hand over the start_urls this way, as I use the class CustomerSpider in another class
a.allowed_domains = ["upb.de", "spiegel.de"]
process = CrawlerProcess()
process.crawl(a)
process.start()
Here, I provide an MWE where Scrapy (my crawler) receives a list of URLs like I have to do it. An example redirection-url is https://upb.de which redirects to https://uni-paderborn.de.
I am searching for an elegant way of handling this as I want to make use of Scrapy's numerous features such as parallel crawling etc. Thus, I do not want to use something like the requests-library additionally. I want to find the Scrapy start_url which is currently used internally (in the Scrapy library).
I appreciate your help.
Ideally, you would set a meta property on the original request, and reference it later in the callback. Unfortunately, CrawlSpider doesn't support passing meta through a Rule (see #929).
You're best to build your own spider, instead of subclassing CrawlSpider. Start by passing your start_urls in as a parameter to process.crawl, which makes it available as a property on the instance. Within the start_requests method, yield a new Request for each url, including the database key as a meta value.
When parse receives the response from loading your url, run a LinkExtractor on it, and yield a request for each one to scrape it individually. Here, you can again pass meta, propagating your original database key down the chain.
The code looks like this:
from scrapy.spiders import Spider
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
class CustomerSpider(Spider):
name = 'my_crawler'
def start_requests(self):
for url in self.root_urls:
yield Request(url, meta={'root_url': url})
def parse(self, response):
links = LinkExtractor(unique=True).extract_links(response)
for link in links:
yield Request(
link.url, callback=self.process_link, meta=response.meta)
def process_link(self, response):
print {
'root_url': response.meta['root_url'],
'resolved_url': response.url
}
a = CustomerSpider
a.allowed_domains = ['upb.de', 'spiegel.de']
process = CrawlerProcess()
process.crawl(a, root_urls=['https://upb.de', 'https://spiegel.de'])
process.start()
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/video/'}
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/netzwelt/netzpolitik/'}
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/thema/buchrezensionen/'}
Currently I have written very simple spider, as follows
class QASpider(CrawlSpider):
name = "my-spider";
handle_httpstatus_list = [400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,426,428,429,431,451,500,501,502,503,504,505,506,507,508,510,511];
allowed_domains = ["local-02"];
start_urls = preview_starting_urls;
rules = [Rule(LinkExtractor(), callback='parse_url', follow=True)]
def parse_url(self, response):
# Some operations
In preview_starting_urls there are urls I intend to start crawling from and the spider works just fine, as long as I get response code 200 from the starting URL. But when there is 503 on any of the starting URLS, parse_url method is not called.
I figured that this behavior occurs because scrapy does not call my own callbacks if request to start_url(s) fails, so I tried defining default callback method:
def parse(self, response)
parse_url(response);
But this resulted in my spider crawling only start_urls (and in sending some other scrapy requests, like for robots.txt and similar) and nothing else.
The point is that when I do not define default callback parse/2 method, I do not get to process any of the start_urls in case they return request code different than 200. If I define parse/2 method as written above, spider does not crawl all the urls, as it would crawl without parse/2 defined.
How do I force scrapy to call my callback even for start_urls that return response different from 200?
Edit: Also I am open for suggestions on how to fill handle_httpstatus_list with values elegantly.
To catch error in scrapy is pretty simple. Just create a new Function that needs to be called when an Error occurs, and pass this as the default function to be called for errors when doing a Request. As you want to do it for the starting URLs, you will have to call the start_request function manually to access the yield Request call
#replace start_urls
#error_function() called when error occurs
def start_requests(self):
urls = preview_starting_urls
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_url, errback=self.error_function)
def error_function(self,failure):
self.logger.error(repr(failure))
#write your error parse code
errback (callable) – a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a Twisted Failure instance as first parameter.
I have notice there are several ways to iniciate http connections for web scraping. I am not sure if some are more recent and up-to-date ways of coding, or if they are just different modules with different advantages and disadvantages. More specifically, I am trying to understand what are the differences between the following two approaches, and what would you reccomend?
1) Using urllib3:
http = PoolManager()
r = http.urlopen('GET', url, preload_content=False)
soup = BeautifulSoup(r, "html.parser")
2) Using requests
html = requests.get(url).content
soup = BeautifulSoup(html, "html5lib")
What sets these two options apart, besides the simple fact that they require importing different modules?
Under the hood, requests uses urllib3 to do most of the http heavy lifting. When used properly, it should be mostly the same unless you need more advanced configuration.
Except, in your particular example they're not the same:
In the urllib3 example, you're re-using connections whereas in the requests example you're not re-using connections. Here's how you can tell:
>>> import requests
>>> requests.packages.urllib3.add_stderr_logger()
2016-04-29 11:43:42,086 DEBUG Added a stderr logging handler to logger: requests.packages.urllib3
>>> requests.get('https://www.google.com/')
2016-04-29 11:45:59,043 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:45:59,158 DEBUG "GET / HTTP/1.1" 200 None
>>> requests.get('https://www.google.com/')
2016-04-29 11:45:59,815 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:45:59,925 DEBUG "GET / HTTP/1.1" 200 None
To start re-using connections like in a urllib3 PoolManager, you need to make a requests session.
>>> session = requests.session()
>>> session.get('https://www.google.com/')
2016-04-29 11:46:49,649 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:46:49,771 DEBUG "GET / HTTP/1.1" 200 None
>>> session.get('https://www.google.com/')
2016-04-29 11:46:50,548 DEBUG "GET / HTTP/1.1" 200 None
Now it's equivalent to what you were doing with http = PoolManager(). One more note: urllib3 is a lower-level more explicit library, so you explicitly create a pool and you'll explicitly need to specify your SSL certificate location, for example. It's an extra line or two of more work but also a fair bit more control if that's what you're looking for.
All said and done, the comparison becomes:
1) Using urllib3:
import urllib3, certifi
http = urllib3.PoolManager(ca_certs=certifi.where())
html = http.request('GET', url).read()
soup = BeautifulSoup(html, "html5lib")
2) Using requests:
import requests
session = requests.session()
html = session.get(url).content
soup = BeautifulSoup(html, "html5lib")