How to find the current start_url in Scrapy CrawlSpider? - python-3.x

When running Scrapy from an own script that loads URLs from a DB and follows all internal links on those websites, I encounter a pitty. I need to know which start_url is currently used as I have to maintain consistency with a database (SQL DB). But: When Scrapy uses the built-in list called 'start_urls' in order to receive a list of links to follow and those websites have an immediate redirect, a problem occurs. For example, when Scrapy starts and the start_urls are being crawled and the crawler follows all internal links that are being found there, I later can only determine the currently visited URL, not the start_url where Scrapy started out.
Other answers from the web are wrong, for other use cases or deprecated as there seems to have been a change in Scrapy's code last year.
MWE:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
class CustomerSpider(CrawlSpider):
name = "my_crawler"
rules = [Rule(LinkExtractor(unique=True), callback="parse_obj", ), ]
def parse_obj(self, response):
print(response.url) # find current start_url and do something
a = CustomerSpider
a.start_urls = ["https://upb.de", "https://spiegel.de"] # I want to re-identify upb.de in the crawling process in process.crawl(a), but it is redirected immediately # I have to hand over the start_urls this way, as I use the class CustomerSpider in another class
a.allowed_domains = ["upb.de", "spiegel.de"]
process = CrawlerProcess()
process.crawl(a)
process.start()
Here, I provide an MWE where Scrapy (my crawler) receives a list of URLs like I have to do it. An example redirection-url is https://upb.de which redirects to https://uni-paderborn.de.
I am searching for an elegant way of handling this as I want to make use of Scrapy's numerous features such as parallel crawling etc. Thus, I do not want to use something like the requests-library additionally. I want to find the Scrapy start_url which is currently used internally (in the Scrapy library).
I appreciate your help.

Ideally, you would set a meta property on the original request, and reference it later in the callback. Unfortunately, CrawlSpider doesn't support passing meta through a Rule (see #929).
You're best to build your own spider, instead of subclassing CrawlSpider. Start by passing your start_urls in as a parameter to process.crawl, which makes it available as a property on the instance. Within the start_requests method, yield a new Request for each url, including the database key as a meta value.
When parse receives the response from loading your url, run a LinkExtractor on it, and yield a request for each one to scrape it individually. Here, you can again pass meta, propagating your original database key down the chain.
The code looks like this:
from scrapy.spiders import Spider
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
class CustomerSpider(Spider):
name = 'my_crawler'
def start_requests(self):
for url in self.root_urls:
yield Request(url, meta={'root_url': url})
def parse(self, response):
links = LinkExtractor(unique=True).extract_links(response)
for link in links:
yield Request(
link.url, callback=self.process_link, meta=response.meta)
def process_link(self, response):
print {
'root_url': response.meta['root_url'],
'resolved_url': response.url
}
a = CustomerSpider
a.allowed_domains = ['upb.de', 'spiegel.de']
process = CrawlerProcess()
process.crawl(a, root_urls=['https://upb.de', 'https://spiegel.de'])
process.start()
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/video/'}
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/netzwelt/netzpolitik/'}
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/thema/buchrezensionen/'}

Related

Unable to fetch web elements in pytest-bdd behave in python

So, I'm in the middle of creating a test-automation framework using pytest-bdd and behave in python 3.10.
I have some codes, but the thing is, I'm not able to fetch the web element from the portal. The error doesn't say anything about it. Here is the error I'm getting in the console.
Let me share the codes here too for better understanding.
demo.feature
Feature: Simple first feature
#test1
Scenario: Very first test scenario
Given Launch Chrome browser
When open my website
Then verify that the logo is present
And close the browser
test_demo.py
from behave import *
from selenium import webdriver
from selenium.webdriver.common.by import By
# from pytest_bdd import scenario, given, when, then
import time
import variables
import xpaths
from pages.functions import *
import chromedriver_autoinstaller
# #scenario('../features/demo.feature', 'Very first test scenario')
# def test_eventLog():
# pass
#given('Launch Chrome browser')
def given_launchBrowser(context):
launchWebDriver(context)
print("DEBUG >> Browser launches successfully.")
#when('Open my website')
def when_openSite(context):
context.driver.get(variables.link)
# context.driver.get(variables.nitsanon)
print("DEBUG >> Portal opened successfully.")
#then('verify that the logo is present')
def then_verifyLogo(context):
time.sleep(5)
status = context.driver.find_element(By.XPATH, xpaths.logo_xpath)
# status = findElement(context, xpaths.logo_xpath)
print('\n\n\n\n', status, '\n\n\n\n')
assert status is True, 'No logo present'
print("DEBUG >> Logo validated successfully.")
#then('close the browser')
def then_closeBrowser(context):
context.driver.close()
variables.py
link = 'https://nitin-kr.onrender.com/'
xpaths.py
logo_xpath = "//*[#id='logo']/div"
requirements.txt
behave~=1.2.6
selenium~=4.4.3
pytest~=7.1.3
pytest-bdd~=6.0.1
Let me know if you need any more information. I'm very eager to create an automation testing framework without any OOPs concept used.
Just the thing is, I'm not able to fetch the web elements. Not able to use find_element(By.XPATH, XPATH).send_keys(VALUE) like methods of selenium.

Stop Scrapy from converting "—" to "—" from URL

One of the websites that I am working on, frequently uses — in URL and Scrapy converts it in to — before processing; I am trying to modify it back to — through adding few lines in default Download Middleware and it does print everything fine however Scrapy converts it back again, which eventually results as 404.
'DOWNLOADER_MIDDLEWARES' : {
'something.middlewares.MyDownloaderMiddleware': 540
}
middlewares.py
from urllib.parse import unquote
from html import escape, unescape
class MyDownloaderMiddleware(object):
def process_request(self, request, spider):
new_url = unescape(unquote(request.url))
print (new_url)
request = request.replace(url=new_url)
return None
I tried return request instead of return none in the middleware, however still it doesn't seem to work either.
Solution:
I have placed couple of .replace("—", "—") in spider code, it's working now though my first approach was to do it via middleware, which should have been a cleaner approach.

With Flask-Restplus how to create a POST api without making the URL same with GET

I'm new to Flask and Flask-RestPlus. I'm creating a web api where I want keep my POST urls different from the GET urls which are visible in Swagger. For example in Flask-Restplus
#api.route('/my_api/<int:id>')
class SavingsModeAction(Resource):
#api.expect(MyApiModel)
def post(self):
pass #my code goes here
def get(self, id):
pass #my code goes here
So in swagger for the both apis url would look like
GET: /my_api/{id}
POST: /my_api/{id}
But so far I have absolutely no use of {id} part in my post api and it perhaps creates a bit of confusion for an user whether to update an existing record or to create a new, however the purpose of the api is just to create.
It will be better to use query params
like GET: /my_api?id=
your above code will be like
from flask import request
#api.route('/my_api')
class SavingsModeAction(Resource):
#api.expect(MyApiModel)
def post(self):
...
def get(self):
_id = request.args.get("_id", type=int)
...

How to get resource path in flask-RESTPlus?

I am fairly new at working with flask and flask-RESTPlus. I have the following and it is not clear how can I determine which path was used in the get request?
ns = api.namespace('sample', description='get stuff')
#ns.route(
'/resource-settings/<string:address>',
'/unit-settings/<string:address>',
'/resource-proposals/<string:address>',
'/unit-proposals/<string:address>')
#ns.param('address', 'The address to decode')
class Decode(Resource):
#ns.doc(id='Get the decoded result of a block address')
def get(self, address):
# How do I know what get path was called?
pass
A better solution would be to use the request context. To get the full path, you can do:
from flask import request
def get(self, address):
# How do I know what get path was called?
print(request.full_path)
Through lot's of digging I found that url_for in flask import.
Still feels a bit wonky but I can create a fully qualified link with:
result = api.base_url + url_for('resource-settings', address=id)
So this works and I get the desired results.

Scrapy start_urls request returns 503, how to catch it?

Currently I have written very simple spider, as follows
class QASpider(CrawlSpider):
name = "my-spider";
handle_httpstatus_list = [400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,426,428,429,431,451,500,501,502,503,504,505,506,507,508,510,511];
allowed_domains = ["local-02"];
start_urls = preview_starting_urls;
rules = [Rule(LinkExtractor(), callback='parse_url', follow=True)]
def parse_url(self, response):
# Some operations
In preview_starting_urls there are urls I intend to start crawling from and the spider works just fine, as long as I get response code 200 from the starting URL. But when there is 503 on any of the starting URLS, parse_url method is not called.
I figured that this behavior occurs because scrapy does not call my own callbacks if request to start_url(s) fails, so I tried defining default callback method:
def parse(self, response)
parse_url(response);
But this resulted in my spider crawling only start_urls (and in sending some other scrapy requests, like for robots.txt and similar) and nothing else.
The point is that when I do not define default callback parse/2 method, I do not get to process any of the start_urls in case they return request code different than 200. If I define parse/2 method as written above, spider does not crawl all the urls, as it would crawl without parse/2 defined.
How do I force scrapy to call my callback even for start_urls that return response different from 200?
Edit: Also I am open for suggestions on how to fill handle_httpstatus_list with values elegantly.
To catch error in scrapy is pretty simple. Just create a new Function that needs to be called when an Error occurs, and pass this as the default function to be called for errors when doing a Request. As you want to do it for the starting URLs, you will have to call the start_request function manually to access the yield Request call
#replace start_urls
#error_function() called when error occurs
def start_requests(self):
urls = preview_starting_urls
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_url, errback=self.error_function)
def error_function(self,failure):
self.logger.error(repr(failure))
#write your error parse code
errback (callable) – a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a Twisted Failure instance as first parameter.

Resources