Im currently having some issues trying to adapt my scrapy program. The thing im trying do is make a different parser work depending on the "site" im in.
Currently i have this start request
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
I want to find a way to, depending on which keyword i get from the txt file, call different parsers for extracting the data from the page.
Is it there a way to accomplish this?
what about matching the callback dependant on the keyword you got inside start_requests? Something like:
def start_requests(self):
keyword_callback = {
'keyword1': self.parse_keyword1,
'keyword2': self.parse_keyword2,
}
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword), callback=keyword_callback[keyword])
Related
I have over a dozen spiders in a scrapy project with variety of items being extracted from different sources, including others elements mostly i have to copy same regex code over and over again in each spider for example
item['element'] = re.findall('my_regex', response.text)
I use this regex to get same element which is defined in scrapy items, is there a way to avoid copying? where do i put this in project so that i don't have to copy this in each spider and only add those that are different.
my project structure is default
any help is appreciated thanks in advance
So if I understand your question correctly, you want use the same regular expression across multiple spiders.
You can do this:
create a python module called something like regex_to_use
inside that module place your regular expression.
example:
# regex_to_use.py
regex_one = 'test'
You can access this express this one in your spiders.
# spider.py
import regex_to_use
import re as regex
find_string = regex.search(regex_to_use.regex_one, ' this is a test')
print(find_string)
# output
<re.Match object; span=(11, 15), match='test'>
You could also do something like this in your regex_to_use module
# regex_to_use.py
import re as regex
class CustomRegularExpressions(object):
def __init__(self, text):
"""
:param text: string containing the variable to search for
"""
self._text = text
def search_text(self):
find_xyx = regex.search('test', self._text)
return find_xyx
and you would call it this way in your spiders:
# spider.py
from regex_to_use import CustomRegularExpressions
find_word = CustomRegularExpressions('this is a test').search_text()
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>
If you have multiple regular expressions you could do something like this:
# regex_to_use.py
import re as regex
class CustomRegularExpressions(object):
def __init__(self, text):
"""
:param text: string containing the variable to search for
"""
self._text = text
def search_text(self, regex_to_use):
regular_expressions = {"regex_one": 'test_1', "regex_two": 'test_2'}
expression = ''.join([v for k, v in regular_expressions.items() if k == regex_to_use])
find_xyx = regex.search(expression, self._text)
return find_xyx
# spider.py
from regex_to_use import CustomRegularExpressions
find_word = CustomRegularExpressions('this is a test').search_text('regex_one')
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>
You can also use a staticmethod in the class CustomRegularExpressions
# regex_to_use.py
import re as regex
class CustomRegularExpressions:
#staticmethod
def search_text(regex_to_use, text_to_search):
regular_expressions = {"regex_one": 'test_1', "regex_two": 'test_2'}
expression = ''.join([v for k, v in regular_expressions.items() if k == regex_to_use])
find_xyx = regex.search(expression, text_to_search)
return find_xyx
# spider.py
from regex_to_use import CustomRegularExpressions
# find_word would be replaced with item['element']
# this is a test would be replaced with response.text
find_word = CustomRegularExpressions.search_text('regex_one', 'this is a test')
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>
If you use docstrings in the function search_text() you can see the regular expressions in the Python dictionary.
Showing how all this works...
This is a python project that I wrote and published. Take a look at the folder utilities. In this folder I have functions that I can use throughout my code without having to copy and paste the same code over and over.
There is a lot of common data that is usual to use across multiple spiders, like regex or even XPath.
It's a good idea to isolate them.
You can use something like this:
/project
/site_data
handle_responses.py
...
/spiders
your_spider.py
...
Isolate functionalities with a common purpose.
# handle_responses.py
# imports ...
from re import search
def get_specific_commom_data(text: str):
# probably is a good idea handle predictable errors here (`try except`)
return search('your_regex', text)
And just use where is needed that functionality.
# your_spider.py
# imports ...
import scrapy
from site_data.handle_responses import get_specific_commom_data
class YourSpider(scrapy.Spider):
# ... previous code
def your_method(self, response):
# ... previous code
item['element'] = get_specific_commom_data(response.text)
Try to keep it simple and do what you need to solve your problem.
I can copy regex in multiple spiders instead of importing object from other .py files, i understand they have the use case but here i don't want to add anything to any of the spiders but still want the element in result
There are some good answers to this but don't really solve the problem so after searching for days i have come to this solution i hope its useful for others looking for similar answer.
#middlewares.py
import yourproject.items import youritem()
#find the function and add your element
def process_spider_output(self, response, result, spider):
item = YourItem()
item['element'] = re.findall('my_regex', response.text)
now uncomment middleware from
#settings.py
SPIDER_MIDDLEWARES = {
'yourproject.middlewares.YoursprojectMiddleware': 543,
}
For each spider you will get element in result data, i am still searching for better solution and i will update the answer because it slows the spider,
I am in the middle of learning scrappy right now and am building a simple scraper of a real estate site. With this code I am trying to scrape all of the URLs for the real estate listing of a specific city. I have run into the following error with my code - "Cannot mix str and non-str arguments".
I believe I have isolated my problem to following part of my code
props = response.xpath('//div[#class = "address ellipsis"]/a/#href').extract()
If I use the extract_first() function instead of the extract function in the props xpath assignment, the code kind of works. It grabs the first link for the property on each page. However, this ultimately is not what I want. I believe I have the xpath call correct as the code runs if I use the extract_first() method.
Can someone explain what I am doing wrong here? I have listed my full code below
import scrapy
from scrapy.http import Request
class AdvancedSpider(scrapy.Spider):
name = 'advanced'
allowed_domains = ['www.realtor.com']
start_urls = ['http://www.realtor.com/realestateandhomes-search/Houston_TX/']
def parse(self, response):
props = response.xpath('//div[#class = "address ellipsis"]/a/#href').extract()
for prop in props:
absolute_url = response.urljoin(props)
yield Request(absolute_url, callback=self.parse_props)
next_page_url = response.xpath('//a[#class = "next"]/#href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_next_page_url)
def parse_props(self, response):
pass
Please let me know if I can clarify anything.
You are passing props list of strings to response.urljoin() but meant prop instead:
for prop in props:
absolute_url = response.urljoin(prop)
Alecxe's is right, it was a simple oversight in the spelling of iterator in your loop. You can use the following notation:
for prop in response.xpath('//div[#class = "address ellipsis"]/a/#href').extract():
yield scrapy.Request(response.urljoin(prop), callback=self.parse_props)
It's cleaner and you're not instantiating the "absolute_url" per loop. On a larger scale, would help you save some memory.
I have searched some of the questions regarding this topic but i couldn't find a solution to my problem.
I'm currently trying to use multiple parsers on a site depending on the product I want to search. After trying some methods, I ended up with this:
With this start request:
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
That gets into my normal parse_item.
What i want to do is, with this parse_item (by checking with the item category like laptop, tablet, etc):
def parse_item(self,response):
#I get the items category for the if/else
category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(#class, "a-list-item")]//a/text())').extract_first())
#Get the product link, for example (https://www.amazon.com/Lenovo-T430s-Performance-Professional-Refurbished/dp/B07L4FR92R/ref=sr_1_7?s=pc&ie=UTF8&qid=1545829464&sr=1-7&keywords=laptop)
urlProducto = response.request.url
#This can be done in a nicer way, just trying out if it works atm
if category == 'Laptop':
yield response.follow(urlProducto, callback = parse_laptop)
With:
def parse_laptop(self, response):
#Parse things
Any suggestions? The error i get when running this code is 'parse_laptop' is not defined. I have already tried putting the parse_laptop above the parse_item and i still get the same error.
You need to refer to a method and not a function, so just change it like this:
yield response.follow(urlProducto, callback = self.parse_laptop)
yield response.follow(urlProducto, callback = parse_laptop)
This is the request and here's you function def parse_laptop(self, response): you probably have noticed that you parse_laptop function requires self object.
so please modify you request to :
yield response.follow(urlProducto, callback = self.parse_laptop)
This should do the work.
Thanks.
I would like to crawl/check multiple websites(on same domain) for a specific keyword. I have found this script, but I can't find how to add the specific keyword to be search for. What the script needs to do is find the keyword, and give the result in which link it was found. Could anyone point me to where i could read more about this ?
I have been reading scrapy's documentation, but I can't seem to find this.
Thank you.
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def __init__(self):
self.page_number = starting_number
def start_requests(self):
# generate page IDs from 1000 down to 501
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i, callback=self.parse)
def parse(self, response):
**parsing data from the webpage**
You'll need to use some parser or regex to find the text you are looking for inside the response body.
every scrapy callback method contains the response body inside the response object, which you can check with response.body (for example inside the parse method), then you'll have to use some regex or better xpath or css selectors to go to the path of your text knowing the xml structure of the page you crawled.
Scrapy lets you use the response object as a Selector, so you can go to the title of the page with response.xpath('//head/title/text()') for example.
Hope it helped.
Python newbie here, I'm trying to search a video API which, for some reason won't allow me to search video titles with certain characters in the video title such as : or |
Currently, I have a function which calls the video API library and searches by title, which looks like this:
def videoNameExists(vidName):
vidName = vidName.encode("utf-8")
bugFixVidName = vidName.replace(":", "")
search_url ='http://cdn-api.ooyala.com/v2/syndications/49882e719/feed?pcode=1xeGMxOt7GBjZPp2'.format(bugFixVidName) #this URL is altered to protect privacy for this post
Is there an alternative to .replace() (or a way to use it that I'm missing) that would let me search for more than one sub-string at the same time?
Take a look a the Python re module, specifically at the method re.sub().
Here's an example for your case:
import re
def videoNameExists(vidName):
vidName = vidName.encode("utf-8")
# bugFixVidName = vidName.replace(":", "")
bugFixVidName = re.sub(r'[:|]', "", vidName)
search_url ='http://cdn-api.ooyala.com/v2/syndications/49882e719/feed?pcode=1xeGMxOt7GBjZPp2'.format(bugFixVidName) #this URL is altered to protect privacy for this post