Stop Scrapy from converting "—" to "—" from URL - python-3.x

One of the websites that I am working on, frequently uses — in URL and Scrapy converts it in to — before processing; I am trying to modify it back to — through adding few lines in default Download Middleware and it does print everything fine however Scrapy converts it back again, which eventually results as 404.
'DOWNLOADER_MIDDLEWARES' : {
'something.middlewares.MyDownloaderMiddleware': 540
}
middlewares.py
from urllib.parse import unquote
from html import escape, unescape
class MyDownloaderMiddleware(object):
def process_request(self, request, spider):
new_url = unescape(unquote(request.url))
print (new_url)
request = request.replace(url=new_url)
return None
I tried return request instead of return none in the middleware, however still it doesn't seem to work either.
Solution:
I have placed couple of .replace("—", "—") in spider code, it's working now though my first approach was to do it via middleware, which should have been a cleaner approach.

Related

First Scrapy crawler works, subsequent crawlers in sequence fail

I have a script setup like below:
try:
from Xinhua import Xinhua
except:
error_message("Xinhua")
try:
from China_Daily import China_Daily
except:
error_message("China Daily")
try:
from Global_Times import Global_Times
except:
error_message("Global Times")
try:
from Peoples_Daily import Peoples_Daily
except:
error_message("People's Daily")
The purpose is to run a Scrapy crawler for each site, process the results, and upload those results to a database. When I run each script individually, each portion works fine. When I run from the block of code I've outlined, however, only the first Scrapy crawler actually works properly. All of the subsequent ones attempt to access the sites they are supposed to but don't return any results. I don't even get proper error messages back, just some "DEBUG... 200 None" and "[scrapy.crawler] INFO: Overridden settings: {}" lines. I also don't think the issue is my IP or anything being blocked; as soon as the crawlers fail I immediately launch them individually and they work great.
My guess is that the first crawler is leaving some settings behind that are interfering with the subsequent ones, but I haven't been able to find anything. I can rearrange the order of their execution and it is always the first in line that works while the others fail.
Any thoughts?
I fixed the issue by combining each crawler into one script and running them with CrawlerProcess.
spider_settings = [
{"FEEDS":{
xh_crawl_results:{
'format':'json',
'overwrite':True
}}},
{"FEEDS":{
cd_crawl_results:{
'format':'json',
'overwrite':True
}}},
{"FEEDS":{
gt_crawl_results:{
'format':'json',
'overwrite':True
}}},
{"FEEDS":{
pd_crawl_results:{
'format':'json',
'overwrite':True
}}}
]
process_xh = CrawlerRunner(spider_settings[0])
process_cd = CrawlerRunner(spider_settings[1])
process_gt = CrawlerRunner(spider_settings[2])
process_pd = CrawlerRunner(spider_settings[3])
#defer.inlineCallbacks
def crawl():
yield process_xh.crawl(XinhuaSpider)
yield process_cd.crawl(ChinaDailySpider)
yield process_gt.crawl(GlobalTimesSpider)
yield process_pd.crawl(PeoplesDailySpider)
reactor.stop()
print("Scraping started.")
crawl()
reactor.run()
print("Scraping completed.")

Request using trio asks returns a different response than with requests and aiohttp

Right, hello, so I'm trying to implement opticard (loyality card services) with my webapp using trio and asks (https://asks.readthedocs.io/).
So I'd like to send a request to their inquiry api:
Here goes using requests:
import requests
r = requests.post("https://merchant.opticard.com/public/do_inquiry.asp", data={'company_id':'Dt', 'FirstCardNum1':'foo', 'g-recaptcha-response':'foo','submit1':'Submit'})
This will return "Invalid ReCaptcha" and this is normal, and what I want
Same thing using aiohttp:
import asyncio
import aiohttp
async def fetch(session, url):
async with session.post(url, data={'company_id':'Dt', 'FirstCardNum1':'foo', 'g-recaptcha-response':'foo','submit1':'Submit'} ) as response:
return await response.text()
async def main():
async with aiohttp.ClientSession() as session:
html = await fetch(session, 'https://merchant.opticard.com/public/do_inquiry.asp')
print(html)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Now this also returns "Invalid ReCaptcha", so that's all nice and good.
However now, using trio/asks:
import asks
import trio
async def example():
r = await asks.post('https://merchant.opticard.com/public/do_inquiry.asp', data={'company_id':'Dt', 'FirstCardNum1':'foo', 'g-recaptcha-response':'foo','submit1':'Submit'})
print(r.text)
trio.run(example)
This returns a completely different response with 'Your session has expired to protect your account. Please login again.', this error/message can be accessed normally when inputting an invalid url such as 'https://merchant.opticard.com/do_inquiry.asp' instead of 'https://merchant.opticard.com/public/do_inquiry.asp'.
I have no idea where this error is coming from, I tried setting headers, cookies, encoding, nothing seems to make it work. I tried replicating the issue, but the only way I managed to replicate the result with aiohttp and requests is by setting an incorrect url like 'https://merchant.opticard.com/do_inquiry.asp' instead of 'https://merchant.opticard.com/public/do_inquiry.asp'.
This must be an issue from asks, maybe due to encoding or formatting, but I've been using asks for over a year and never had an issue where a simple post request with data would return differently on asks compared to everywhere else. And I'm baffled as I can't understand why this is happening, it couldn't possibly be a formatting error on asks' part because if so how come this is the first time something like this has ever happened after using it for over a year?
This is a bug how asks handles redirection when a non-standard location is received.
The server returns a 302 redirection with Location: inquiry.asp?... while asks expects it to be a full URL. You may want to file a bug report to asks.
How did I find this? A good way to go is to use a proxy (e.g. mitmproxy) to inspect the traffic. However asks doesn't support proxies. So I turned to wireshark instead and use a program to extract TLS keys so wireshark can decrypt the traffic.

How to find the current start_url in Scrapy CrawlSpider?

When running Scrapy from an own script that loads URLs from a DB and follows all internal links on those websites, I encounter a pitty. I need to know which start_url is currently used as I have to maintain consistency with a database (SQL DB). But: When Scrapy uses the built-in list called 'start_urls' in order to receive a list of links to follow and those websites have an immediate redirect, a problem occurs. For example, when Scrapy starts and the start_urls are being crawled and the crawler follows all internal links that are being found there, I later can only determine the currently visited URL, not the start_url where Scrapy started out.
Other answers from the web are wrong, for other use cases or deprecated as there seems to have been a change in Scrapy's code last year.
MWE:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
class CustomerSpider(CrawlSpider):
name = "my_crawler"
rules = [Rule(LinkExtractor(unique=True), callback="parse_obj", ), ]
def parse_obj(self, response):
print(response.url) # find current start_url and do something
a = CustomerSpider
a.start_urls = ["https://upb.de", "https://spiegel.de"] # I want to re-identify upb.de in the crawling process in process.crawl(a), but it is redirected immediately # I have to hand over the start_urls this way, as I use the class CustomerSpider in another class
a.allowed_domains = ["upb.de", "spiegel.de"]
process = CrawlerProcess()
process.crawl(a)
process.start()
Here, I provide an MWE where Scrapy (my crawler) receives a list of URLs like I have to do it. An example redirection-url is https://upb.de which redirects to https://uni-paderborn.de.
I am searching for an elegant way of handling this as I want to make use of Scrapy's numerous features such as parallel crawling etc. Thus, I do not want to use something like the requests-library additionally. I want to find the Scrapy start_url which is currently used internally (in the Scrapy library).
I appreciate your help.
Ideally, you would set a meta property on the original request, and reference it later in the callback. Unfortunately, CrawlSpider doesn't support passing meta through a Rule (see #929).
You're best to build your own spider, instead of subclassing CrawlSpider. Start by passing your start_urls in as a parameter to process.crawl, which makes it available as a property on the instance. Within the start_requests method, yield a new Request for each url, including the database key as a meta value.
When parse receives the response from loading your url, run a LinkExtractor on it, and yield a request for each one to scrape it individually. Here, you can again pass meta, propagating your original database key down the chain.
The code looks like this:
from scrapy.spiders import Spider
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
class CustomerSpider(Spider):
name = 'my_crawler'
def start_requests(self):
for url in self.root_urls:
yield Request(url, meta={'root_url': url})
def parse(self, response):
links = LinkExtractor(unique=True).extract_links(response)
for link in links:
yield Request(
link.url, callback=self.process_link, meta=response.meta)
def process_link(self, response):
print {
'root_url': response.meta['root_url'],
'resolved_url': response.url
}
a = CustomerSpider
a.allowed_domains = ['upb.de', 'spiegel.de']
process = CrawlerProcess()
process.crawl(a, root_urls=['https://upb.de', 'https://spiegel.de'])
process.start()
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/video/'}
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/netzwelt/netzpolitik/'}
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/thema/buchrezensionen/'}

How to get resource path in flask-RESTPlus?

I am fairly new at working with flask and flask-RESTPlus. I have the following and it is not clear how can I determine which path was used in the get request?
ns = api.namespace('sample', description='get stuff')
#ns.route(
'/resource-settings/<string:address>',
'/unit-settings/<string:address>',
'/resource-proposals/<string:address>',
'/unit-proposals/<string:address>')
#ns.param('address', 'The address to decode')
class Decode(Resource):
#ns.doc(id='Get the decoded result of a block address')
def get(self, address):
# How do I know what get path was called?
pass
A better solution would be to use the request context. To get the full path, you can do:
from flask import request
def get(self, address):
# How do I know what get path was called?
print(request.full_path)
Through lot's of digging I found that url_for in flask import.
Still feels a bit wonky but I can create a fully qualified link with:
result = api.base_url + url_for('resource-settings', address=id)
So this works and I get the desired results.

Flask - url_for automatically escapes '=' to '%3D'

So.. I'm having some issues with Flask's url_for . The code still works.. but when users navigate to a link that was generated by url_for the link looks bad in the address bar.
Namely, I have a decorated view function as follows:
#app.route("/")
#app.route("/page=<int:number")
def index(number=0):
return "Index Page: {}".format(number)
This all works fine except when I try to generate a url for that route. Calling:
url_for("index", number=10)
Yields: domain.tld:80/page%3D10
Is there any way to circumvent this issue? I'd like for '=' to be returned instead of '%3D' when it's built into the route itself.
I only noticed it was doing this when I was testing it in an assert and discovered that the routes were ending up different from what I expected them to be.
At the moment, I have my test case circumvent the issue by using urllib.parse.unquote to fix the url for testing purposes. I could probably just do that for all urls since I won't have any user input to worry about those causing problems.. but it's there for a reason so.... :P
One option you have is to not build the parameter in to the route itself, but use query parameters instead:
from flask import Flask, render_template, request, url_for
app = Flask(__name__)
#app.route("/")
def index():
page = request.args.get('page', 0, type=int)
print(url_for("index", page=10)) # yields /?page=10
return "Index Page: {}".format(page)
app.run(debug=True)
My making use of query parameters for the route, you avoid the issue of Flask encoding the = sign in the route definition.

Resources