scrapy: restricting link extraction to the request domain - python-3.x

I have a scrapy project which uses a list of URLs from different domains as the seeds, but for any given page, I only want to follow links in the same domain as that page's URL (so the usual LinkExtractor(accept='example.com') approach wouldn't work. I'm surprised I couldn't find a solution on the web, as I'd expect this to be a common task. The best I could come up with was this in the spider file and refer to it in the Rules:
class CustomLinkExtractor(LinkExtractor):
def get_domain(self, url):
# https://stackoverflow.com/questions/9626535/get-protocol-host-name-from-url
return '.'.join(tldextract.extract(url)[1:])
def extract_links(self, response):
domain = self.get_domain(response.url)
# https://stackoverflow.com/questions/40701227/using-scrapy-linkextractor-to-locate-specific-domain-extensions
return list(
filter(
lambda link: self.get_domain(link.url) == domain,
super(CustomLinkExtractor, self).extract_links(response)
)
)
But that doesn't work (the spider goes off-domain).
Now I'm trying to use the process_request option in the Rule:
rules = (
Rule(LinkExtractor(deny_domains='twitter.com'),
callback='parse_response',
process_request='check_r_r_domains',
follow=True,
),
)
and
def check_r_r_domains(request, response):
domain0 = '.'.join(tldextract.extract(request.url)[1:])
domain1 = '.'.join(tldextract.extract(response.url)[1:])
log('TEST:', domain0, domain1)
if (domain0 == domain1) and (domain0 != 'twitter.com'):
return request
log(domain0, ' != ', domain1)
return None
but I get an exception because it's passing self to the method (the spider has no url attribute); when I add self to the method signature, I get an exception that the response positional argument is missing! If I change the callback to process_request=self.check_r_r_domains, I get an error because self isn't defined where I set the rules!

If you are using Scrapy 1.7.0 or later, you can pass Rule a callable, process_request, to check the URLs of both the request and the response, and drop the request (return None) if the domains do not match.

Oops, it turns out that conda on the server I'm using had installed a 1.6 version of scrapy. I've forced it to install 1.8.0 from conda-forge and I think it's working now.

Related

TurboGears2: TypeError: redirect() got an unexpected keyword argument 'pagename'

This error occurs when following the TG2 wiki20.
I could not see an easy answer on here that was related to TG2, and this issue could be quite confusing, as it occurs during the official tutorial.
The problem is when using the tg.redirect method, as described in the tutorial:
#expose('wiki20.templates.page')
def _default(self, pagename="FrontPage"):
from sqlalchemy.exc import InvalidRequestError
try:
page = DBSession.query(Page).filter_by(pagename=pagename).one()
except InvalidRequestError:
raise redirect("notfound", pagename=pagename)
content = publish_parts(page.data, writer_name="html")["html_body"]
root = url('/')
content = wikiwords.sub(r'\1' % root, content)
return dict(content=content, wikipage=page)
Trying to use the method above, copied exactly from the tutorial returns the error.
The issue is clearly with the keyword arguments accepted by the redirect method.
I was able to solve this problem by checking the docs for the redirect method itself:
tg.controllers.util.redirect(base_url='/', params=None,
redirect_with=<class 'tg.exceptions.HTTPFound'>, scheme=None,
**kwargs)
You can that it accepts a params parameter (and NOT **kwargs -> perhaps in an earlier version, **kwargs was accepted). Unfortunately, the params argument is not type-hinted, but I guessed it should be a dict, and was correct.
The working method, as far as I can tell, should look like this:
#expose('wiki20.templates.page')
def _default(self, pagename="FrontPage"):
from sqlalchemy.exc import InvalidRequestError
try:
page = DBSession.query(Page).filter_by(pagename=pagename).one()
except InvalidRequestError:
raise redirect("notfound", params={'pagename': pagename})
content = publish_parts(page.data, writer_name="html")["html_body"]
root = url('/')
content = wikiwords.sub(r'\1' % root, content)
return dict(content=content, wikipage=page)
EDIT:
There is a "latest" version of the tutorial, which DOES include the change above, exactly as it is here (but it wasn't the default version for me when I first accessed the tutorial for some reason).
I will leave this here in case it helps any other googlers who didn't check their version numbers!

Using multiple parsers while scraping a page

I have searched some of the questions regarding this topic but i couldn't find a solution to my problem.
I'm currently trying to use multiple parsers on a site depending on the product I want to search. After trying some methods, I ended up with this:
With this start request:
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
That gets into my normal parse_item.
What i want to do is, with this parse_item (by checking with the item category like laptop, tablet, etc):
def parse_item(self,response):
#I get the items category for the if/else
category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(#class, "a-list-item")]//a/text())').extract_first())
#Get the product link, for example (https://www.amazon.com/Lenovo-T430s-Performance-Professional-Refurbished/dp/B07L4FR92R/ref=sr_1_7?s=pc&ie=UTF8&qid=1545829464&sr=1-7&keywords=laptop)
urlProducto = response.request.url
#This can be done in a nicer way, just trying out if it works atm
if category == 'Laptop':
yield response.follow(urlProducto, callback = parse_laptop)
With:
def parse_laptop(self, response):
#Parse things
Any suggestions? The error i get when running this code is 'parse_laptop' is not defined. I have already tried putting the parse_laptop above the parse_item and i still get the same error.
You need to refer to a method and not a function, so just change it like this:
yield response.follow(urlProducto, callback = self.parse_laptop)
yield response.follow(urlProducto, callback = parse_laptop)
This is the request and here's you function def parse_laptop(self, response): you probably have noticed that you parse_laptop function requires self object.
so please modify you request to :
yield response.follow(urlProducto, callback = self.parse_laptop)
This should do the work.
Thanks.

python3, Trying to get an output from my function I defined, need some guidance

I found pretty cool ASN API tool that allows me to supply an AS # and it will go out and pull down the subnets that relate with that ASN.
Here is (rough) but partial code. I am defining a function ASNNUMBER (to which I will supply the number through another file)
When I call url here, it just gives me an n...
What I'm trying to do here, is append my str(ASNNUMBER) to the end of the ?q= parameter in the URL.
Once I do that, I'd like to display my results and output it to a file
import requests
def asnfinder(ASNNUMBER):
print('n\n######## Running ASNFinder ########\n')
url = 'https://api.hackertarget.com/aslookup?q=' + str(ASNNUMBER)
response = requests.get(url)
My results I'd like to get is an output of the get request I'm performing
## Running ASNFinder
n
Try to write something like that:
import requests
def asnfinder(ASNNUMBER):
print('n\n######## Running ASNFinder ########\n')
url = 'https://api.hackertarget.com/aslookup?q=' + str(ASNNUMBER)
response = requests.get(url)
data = response.text
print(data)
with open('filename', 'r') as f:
f.write(data)
It must works fine
P.S. If it helped ya, please make sure you mark this as the answer :)

Using scrapy to find specific text from multiple websites

I would like to crawl/check multiple websites(on same domain) for a specific keyword. I have found this script, but I can't find how to add the specific keyword to be search for. What the script needs to do is find the keyword, and give the result in which link it was found. Could anyone point me to where i could read more about this ?
I have been reading scrapy's documentation, but I can't seem to find this.
Thank you.
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def __init__(self):
self.page_number = starting_number
def start_requests(self):
# generate page IDs from 1000 down to 501
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i, callback=self.parse)
def parse(self, response):
**parsing data from the webpage**
You'll need to use some parser or regex to find the text you are looking for inside the response body.
every scrapy callback method contains the response body inside the response object, which you can check with response.body (for example inside the parse method), then you'll have to use some regex or better xpath or css selectors to go to the path of your text knowing the xml structure of the page you crawled.
Scrapy lets you use the response object as a Selector, so you can go to the title of the page with response.xpath('//head/title/text()') for example.
Hope it helped.

Using Jinja2 with CherryPy 3.2

I have Python 3.2 set up with Apache via mod_wsgi. I have CherryPy 3.2 serving a simple "Hello World" web page. I'd like to start templating using Jinja2 as I build out the site. I'm new to Python and therefore don't know much about Python, CherryPy, or Jinja.
Using the code below, I can load the site root (/) and the products page (/products) with their basic text. That at least lets me know I've got Python, mod_wsgi, and CherryPy set up somewhat properly.
Because the site will have many pages, I'd like to implement the Jinja template in a way that prevents me from having to declare and render the template in each page handler class. As far as I can tell, the best way to do that is by wrapping the PageHandler, similar to these examples:
http://docs.cherrypy.org/dev/concepts/dispatching.html#replacing-page-handlers
http://docs.cherrypy.org/stable/refman/_cptools.html#cherrypy._cptools.HandlerWrapperTool
I've implemented the code in the second example, but it doesn't change anything.
[more details after code]
wsgi_handler.py - A mash-up of a few tutorials and examples
import sys, os
abspath = os.path.dirname(__file__)
sys.path.append(abspath)
sys.path.append(abspath + '/libs')
sys.path.append(abspath + '/app')
sys.stdout = sys.stderr
import atexit
import threading
import cherrypy
from cherrypy._cptools import HandlerWrapperTool
from libs.jinja2 import Environment, PackageLoader
# Import from custom module
from core import Page, Products
cherrypy.config.update({'environment': 'embedded'})
env = Environment(loader=PackageLoader('app', 'templates'))
# This should wrap the PageHandler
def interpolator(next_handler, *args, **kwargs):
template = env.get_template('base.html')
response_dict = next_handler(*args, **kwargs)
return template.render(**response_dict)
# Put the wrapper in place(?)
cherrypy.tools.jinja = HandlerWrapperTool(interpolator)
# Configure site routing
root = Page()
root.products = Products()
# Load the application
application = cherrypy.Application(root, '', abspath + '/app/config')
/app/config
[/]
request.dispatch: cherrypy.dispatch.MethodDispatcher()
core module classes
class Page:
exposed = True
def GET(self):
return "got Page"
def POST(self, name, password):
return "created"
class Products:
exposed = True
def GET(self):
return "got Products"
def POST(self, name, password):
return "created"
Based on what I read on a Google Group I figured I might need to "turn on" the Jinja tool, so I updated my config to this:
/app/config
[/]
tools.jinja.on = True
request.dispatch: cherrypy.dispatch.MethodDispatcher()
After updating the config, the site root and products pages display an CherryPy generated error page "500 Internal Server Error". No detailed error messages are found in the logs (at least not in the logs I'm aware of).
Unless it came pre-installed, I know I probably need the Jinja Tool that's out there, but I don't know where to put it or how to enable it. How do I do that?
Am I going about this the right way, or is there some better way?
Edit (21-May-2012):
Here is the Jinja2 template I'm working with:
<!DOCTYPE html>
<html>
<head>
<title>{{ title }}</title>
</head>
<body>
<h1>Hello World</h1>
</body>
</html>
I figured it out.
In the interpolator function, the next_handler function call the original PageHandler (Page.GET or Products.GET in this case). Those original PageHandlers return strings while the interpolator function is treating the response like a python dict (dictionary), hence the double asterisk when it's passed to template.render as **response_dict.
The Jinja template has a placeholder for title, so it needs a title to be defined. Passing a plain string to the render function doesn't define what title should be. We need to pass an actual dict to the render function (or nothing at all, but what good is that?).
Note: For either of these fixes, the jinja tool does need to be enabled, as shown in the question by setting tools.jinja.on to True in the config.
Quick Fix
Define the title as the render function is called. To do this I need to change this line:
return template.render(**response_dict) # passing the string as dict - bad
to this:
return template.render(title=response_dict) # pass as string and assign to title
Like this, the template renders with my PageHandler text as the page title.
Better Fix
Because the template will grow to be more complex, one render function probably won't always be able to correctly assign the necessary placeholders. It's probably a good idea to let the original page handler return an actual dict with the template's many placeholders assigned.
Leave the interpolator function as it was:
def interpolator(next_handler, *args, **kwargs):
template = env.get_template('base.html')
response_dict = next_handler(*args, **kwargs)
return template.render(**response_dict)
Update Page and Products to return actual dicts that define the value's for the template's placeholders:
class Page:
exposed = True
def GET(self):
dict = {'title' : "got Page"}
return dict
def POST(self, name, password):
# This should be updated too
# I just haven't used it yet
return "created"
class Products:
exposed = True
def GET(self):
dict = {'title' : "got Products"}
return dict
def POST(self, name, password):
# This should be updated too
# I just haven't used it yet
return "created"
Like this, the template renders with my PageHandler title text as the page title.
There's also an updated Jinja2 tool found on a repository of various recipes the community has contributed to.

Resources