I have generated a new project and have a single Python file containing my spider.
The layout is:
import scrapy
from scrapy.http import *
import json
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
import unicodedata
from scrapy import signals
from pydispatch import dispatcher
from scrapy.crawler import CrawlerProcess
from scrapy.item import Item, Field
class TrainerItem(Item):
name = Field()
brand = Field()
link = Field()
type = Field()
price = Field()
previous_price = Field()
stock_img = Field()
alt_img = Field()
size = Field()
class SchuhSpider(scrapy.Spider):
name = "SchuhSpider"
payload = {"hash": "g=3|Mens,&c2=340|Mens Trainers&page=1&imp=1&o=new&",
"url": "/imperfects/", "type": "pageLoad", "NonSecureUrl": "http://www.schuh.co.uk"}
url = "http://schuhservice.schuh.co.uk/SearchService/GetResults"
headers = {'Content-Type': 'application/json; charset=UTF-8'}
finalLinks = []
def start_requests(self):
dispatcher.connect(self.quit, signals.spider_closed)
yield scrapy.Request(url=self.url, callback=self.parse, method="POST", body=json.dumps(self.payload), headers=self.headers)
def parse(self, response):
... do stuff ..
def quit(self, spider):
print(spider.name + " is about to die, here are your trainers..")
process = CrawlerProcess()
process.crawl(SchuhSpider)
process.start()
print("We Are Done.")
I run this spider using:
scrapy crawl SchuhSpider
The problem is I'm getting:
twisted.internet.error.ReactorNotRestartable
This is because the spider is actually running twice. Once at the start (I'm getting all my POST requests) then it says "SchuhSpider is about to die, here are you trainers..".
Then it opens the spider a second time, presumably when it does the process stuff.
My question is: How do I get the spider to stop automatically running when the script runs?
Even when I run:
scrapy list
It runs the entire spider (all my POST requests come through). I fear I'm missing something obvious but I can't see what.
You mix two ways how to run a spider. One way is as you do it now, i.e. using
scrapy crawl SchuhSpider
command. This way, don't (or better you don't have to) include the code
process = CrawlerProcess()
process.crawl(SchuhSpider)
process.start()
print("We Are Done.")
as it's inteded only if you want to run spider from script (see the documentation).
If you want to retain the possibility to run it either way, just wrap the above code like this
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(SchuhSpider)
process.start()
print("We Are Done.")
so that it doesn't run when the module is just loaded (the case when you run it using scrapy crawl).
Related
I have two python crawlers who can run independently.
crawler1.py
crawler2.py
They are part of an analysis that I want to run and I would like to import all to a commong script.
from crawler1.py import *
from crawler2.py import *
a bit lower in my script I have something like this
if <condition1>:
// running crawler1
runCrawler('crawlerName', '/dir1/dir2/')
if <condition2>:
// running crawler2
runCrawler('crawlerName', '/dir1/dir2/')
runCrawler is :
def runCrawler(crawlerName, crawlerFileName):
print('Running crawler for ' + crawlerName)
process = CP(
settings={
'FEED_URI' : crawlerFileName,
'FEED_FORMAT': 'csv'
}
)
process.crawl(globals()[crawlerName])
process.start()
I get the following error:
Exception has occurred: ReactorAlreadyInstalledError
reactor already installed
The first crawler runs ok. The second one has problems.
Any ideas?
I run the above through a visual studio debugger.
the best way to do it is this way
your code should be
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
# your code
settings={
'FEED_FORMAT': 'csv'
}
process = CrawlerRunner(Settings)
if condition1:
process.crawl(spider1,crawlerFileName=crawlerFileName)
if condition2:
process.crawl(spider2,crawlerFileName=crawlerFileName)
d = process.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # it will run both crawlers and code inside the function
your spiders should be like
class spider1(scrapy.Spider):
name = "spider1"
custom_settings = {'FEED_URI' : spider1.crawlerFileName}
def start_requests(self):
yield scrapy.Request('https://scrapy.org/')
def parse(self, response):
pass
I have written a script in Python using Scrapy. The code runs to fetch all the pages that exist containing the code. It works fine on the first page load when scrapy is started and as per the script logic gets us page no. 2. But after loading page 2 I am unable to get xpath of the new page loaded so I can move ahead this way and get all the web-page numbers.
Sharing the code snippet.
import scrapy
from scrapy import Spider
class PostsSpider(Spider):
name = "posts"
start_urls = [
'https://www.boston.com/category/news/'
]
def parse(self, response):
print("first time")
print(response)
results = response.xpath("//*[contains(#id, 'load-more')]/#data-next-page").extract_first()
print(results)
if results is not None:
for result in results:
page_number = 'page/' + result
new_url = self.start_urls[0] + page_number
print(new_url)
yield scrapy.Request(url=new_url, callback=self.parse)
else:
print("last page")
It is because the page doesn't create new get requests when it loads the next page, it makes an ajax call to an api that returns json.
I made some adjustments to your code so it should work properly now. I am assuming that there is something other than the next page number you are trying to extract from each page, so I wrapped the html string into a scrapy.Slector class so you can use Xpath and such on it. This script will crawl alot of pages really fast, so you might want to adjust your settings to slow it down too.
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
class PostsSpider(Spider):
name = "posts"
ajaxurl = "https://www.boston.com/wp-json/boston/v1/load-more?taxonomy=category&term_id=779&search_query=&author=&orderby=&page=%s&_wpnonce=f43ab1aae4&ad_count=4&redundant_ids=25129871,25130264,25129873,25129799,25128140,25126233,25122755,25121853,25124456,25129584,25128656,25123311,25128423,25128100,25127934,25127250,25126228,25126222"
start_urls = [
'https://www.boston.com/category/news/'
]
def parse(self, response):
new_url = None
try:
json_result = response.json()
html = json_result['data']['html']
selector = Selector(text=html, type="html")
# ... do some xpath stuff with selector.xpath.....
new_url = self.ajaxurl % json_result["data"]["nextPage"]
except:
results = response.xpath("//*[contains(#id, 'load-more')]/#data-next-page").extract_first()
if results is not None:
for result in results:
new_url = self.ajaxurl % result
if new_url:
print(new_url)
yield scrapy.Request(url=new_url, callback=self.parse)
So I'm making a script to test my spiders but I don't know how to capture the returned data in the script that it's running the spider.
I have this return [self.p_name, self.price, self.currency] to return at the end of the spider.
In the spider tester I have this script:
#!/usr/bin/python3
#-*- coding: utf-8 -*-
# Import external libraries
import scrapy
from scrapy.crawler import CrawlerProcess
from Ruby.spiders.furla import Furla
# Import internal libraries
# Variables
t = CrawlerProcess()
def test_furla():
x = t.crawl(Furla, url='https://www.furla.com/pt/pt/eshop/furla-sleek-BAHMW64BW000ZN98.html?dwvar_BAHMW64BW000ZN98_color=N98&cgid=SS20-Main-Collection')
return x
test_furla()
t.start()
It's running properly the only problem is that I don't know how to catch that return at the tester side. The output from the spider is ['FURLA SLEEK', '250.00', '€'].
If you need to access items yielded from the spider, I would probably use signals for the job, specifically item_scraped signal. Adapting your code it would like something like this:
from scrapy import signals
# other imports and stuff
t = CrawlerProcess()
def item_scraped(item, response, spider):
# do something with the item
def test_furla():
# we need Crawler instance to access signals
crawler = t.create_crawler(Furla)
crawler.signals.connect(item_scraped, signal=signals.item_scraped)
x = t.crawl(crawler, url='https://www.furla.com/pt/pt/eshop/furla-sleek-BAHMW64BW000ZN98.html?dwvar_BAHMW64BW000ZN98_color=N98&cgid=SS20-Main-Collection')
return x
test_furla()
t.start()
Additional info can be found in the CrawlerProcess documentation. If you on the other hand would need to work with the whole output from the spider, all the items, you would need to accumulate items using the above mechanism and work with them once crawl finishes.
#I am trying to run a script following these requirements:
After running the demo10.py script, The AmazonfeedSpider will crawl the product information using the generated urls saved in Purl and save the output into the dataset2.json file
After successfully crawling and saving data into dataset2.json file , The ProductfeedSpider will run and grab the 5 urls returned by the Final_Product() method of CompareString Class..
Finally after grabing the final product_url list from Comparestring4 Class, The ProductfeedSpider will scrape data from the returned url list and save the result into Fproduct.json file.
#Here is the demo10.py file:
import scrapy
from scrapy.crawler import CrawlerProcess
from AmazonScrap.spiders.Amazonfeed2 import AmazonfeedSpider
from scrapy.utils.project import get_project_settings
from AmazonScrap.spiders.Productfeed import ProductfeedSpider
import time
# from multiprocessing import Process
# def CrawlAmazon():
def main():
process1 = CrawlerProcess(settings=get_project_settings())
process1.crawl(AmazonfeedSpider)
process1.start()
process1.join()
# time.sleep(20)
process2 = CrawlerProcess(settings=get_project_settings())
process2.crawl(ProductfeedSpider)
process2.start()
process2.join()
if __name__ == "__main__":
main()
#After running the file it causes exception in the compiletime and says that dataset.json file doesn't exist. Do I need to use multiprocessing in order to create delay between the spiders? then how can I implement it?
#I am looking forward to hearing from experts
I created a crawl spider to download files. However the spider downloaded only the urls of the files and not the files themselves. I uploaded a question here Scrapy crawl spider does not download files? . While the the basic yield spider kindly suggested in the answers works perfectly, when I attempt to download files with items or item loaders the spider does not work! The original question does not include the items.py. So there it is:
ITEMS
import scrapy
from scrapy.item import Item, Field
class DepositsusaItem(Item):
# main fields
name = Field()
file_urls = Field()
files = Field()
# Housekeeping Fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
pass
EDIT: added original code
EDIT: further corrections
SPIDER
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import datetime
import socket
from us_deposits.items import DepositsusaItem
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
from urllib.parse import urljoin
class DepositsSpider(CrawlSpider):
name = 'deposits'
allowed_domains = ['doi.org']
start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-database/#products', ]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[#id="products"][1]/p/a'),
callback='parse_x'),
)
def parse_x(self, response):
i = ItemLoader(item=DepositsusaItem(), response=response)
i.add_xpath('name', '//*[#class="container"][1]/header/h1/text()')
i.add_xpath('file_urls', '//span[starts-with(#data-url, "/catalog/file/get/")]/#data-url',
MapCompose(lambda i: urljoin(response.url, i))
)
i.add_value('url', response.url)
i.add_value('project', self.settings.get('BOT_NAME'))
i.add_value('spider', self.name)
i.add_value('server', socket.gethostname())
i.add_value('date', datetime.datetime.now())
return i.load_item()
SETTINGS
BOT_NAME = 'us_deposits'
SPIDER_MODULES = ['us_deposits.spiders']
NEWSPIDER_MODULE = 'us_deposits.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'us_deposits.pipelines.UsDepositsPipeline': 1,
'us_deposits.pipelines.FilesPipeline': 2
}
FILES_STORE = 'C:/Users/User/Documents/Python WebCrawling Learning Projects'
PIPELINES
class UsDepositsPipeline(object):
def process_item(self, item, spider):
return item
class FilesPipeline(object):
def process_item(self, item, spider):
return item
It seems to me that using items and/or item loaders has nothing to do with your problem.
The only problems I see are in your settings file:
FilesPipeline is not activated (only us_deposits.pipelines.UsDepositsPipeline is)
FILES_STORE should be a string, not a set (an exception is raised when you activate the files pipeline)
ROBOTSTXT_OBEY = True will prevent the downloading of files
If I correct all of those issues, the file download works as expected.