Downloading files with ItemLoaders() in Scrapy

Downloading files with ItemLoaders() in Scrapy - python-3.x

I created a crawl spider to download files. However the spider downloaded only the urls of the files and not the files themselves. I uploaded a question here Scrapy crawl spider does not download files? . While the the basic yield spider kindly suggested in the answers works perfectly, when I attempt to download files with items or item loaders the spider does not work! The original question does not include the items.py. So there it is:
ITEMS
import scrapy
from scrapy.item import Item, Field
class DepositsusaItem(Item):
# main fields
name = Field()
file_urls = Field()
files = Field()
# Housekeeping Fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
pass
EDIT: added original code
EDIT: further corrections
SPIDER
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import datetime
import socket
from us_deposits.items import DepositsusaItem
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
from urllib.parse import urljoin
class DepositsSpider(CrawlSpider):
name = 'deposits'
allowed_domains = ['doi.org']
start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-database/#products', ]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[#id="products"][1]/p/a'),
callback='parse_x'),
)
def parse_x(self, response):
i = ItemLoader(item=DepositsusaItem(), response=response)
i.add_xpath('name', '//*[#class="container"][1]/header/h1/text()')
i.add_xpath('file_urls', '//span[starts-with(#data-url, "/catalog/file/get/")]/#data-url',
MapCompose(lambda i: urljoin(response.url, i))
)
i.add_value('url', response.url)
i.add_value('project', self.settings.get('BOT_NAME'))
i.add_value('spider', self.name)
i.add_value('server', socket.gethostname())
i.add_value('date', datetime.datetime.now())
return i.load_item()
SETTINGS
BOT_NAME = 'us_deposits'
SPIDER_MODULES = ['us_deposits.spiders']
NEWSPIDER_MODULE = 'us_deposits.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'us_deposits.pipelines.UsDepositsPipeline': 1,
'us_deposits.pipelines.FilesPipeline': 2
}
FILES_STORE = 'C:/Users/User/Documents/Python WebCrawling Learning Projects'
PIPELINES
class UsDepositsPipeline(object):
def process_item(self, item, spider):
return item
class FilesPipeline(object):
def process_item(self, item, spider):
return item

It seems to me that using items and/or item loaders has nothing to do with your problem.
The only problems I see are in your settings file:
FilesPipeline is not activated (only us_deposits.pipelines.UsDepositsPipeline is)
FILES_STORE should be a string, not a set (an exception is raised when you activate the files pipeline)
ROBOTSTXT_OBEY = True will prevent the downloading of files
If I correct all of those issues, the file download works as expected.

Related

Scrapy script not scrapping items

Im not sure why my script isnt scrapping any items, is the same script that im using for another website, maybe the classes im using are wrong.
`
import scrapy
import os
from scrapy.crawler import CrawlerProcess
from datetime import datetime
date = datetime.now().strftime("%d_%m_%Y")
class stiendaSpider(scrapy.Spider):
name = 'stienda'
start_urls = ['https://stienda.uy/tv']
def parse(self, response):
for products in response.css('.grp778'):
price = products.css('.precioSantander::text').get()
name = products.css('#catalogoProductos .tit::text').get()
if price and name:
yield {'name': name.strip(),
'price': price.strip()}
os.chdir('C:\\Users\\cabre\\Desktop\\scraping\\stienda\\data\\raw')
process = CrawlerProcess(
# settings={"FEEDS": {"items.csv": {"format": "csv"}}}
settings={"FEEDS": {f"stienda_{date}.csv": {"format": "csv"}}}
)
process.crawl(stiendaSpider)
process.start()
`
I tried several but I dont usnderstand why is not working..

I was able to get the name field, but the price attribute is rendered empty and filled in later from an ajax call. That is why it's not being extracted.
import scrapy
import os
from scrapy.crawler import CrawlerProcess
from datetime import datetime
date = datetime.now().strftime("%d_%m_%Y")
class stiendaSpider(scrapy.Spider):
name = 'stienda'
start_urls = ['https://stienda.uy/tv']
def parse(self, response):
for products in response.xpath('//div[#data-disp="1"]'):
name = products.css('.tit::text').get()
if name:
yield {'name': name.strip()}
You can see it if you look at the page source... all of the elements with the class 'precioSantander' are empty.

How to use css selector in object from HtmlResponse

I'm currently developing an application using Scrapy.
I want to get some value using CSS selector out of def parse, So I create a HtmlResponse object first and tried to get some value using css(), But I can't get any value...
Within def parse, I can get the value in the same way.
What should I do if it is outside of def parse?
Here is the code:
import scrapy
from scrapy.http import HtmlResponse
class SampleSpider(scrapy.Spider):
name = 'sample'
allowed_domains = ['sample.com']
start_urls = ['https://sample.com/search']
my_response = HtmlResponse(url=start_urls[0])
print('HtmlResponse')
print(my_response)
h3s = my_response.css('h3')
print(str(len(h3s)))
print('----------')
def parse(self, response, **kwargs):
print('def parse')
print(response)
h3s = response.css('h3')
print(str(len(h3s)))
Console display：
HtmlResponse
<200 https://sample.com/search>
0 # <- I want to show '3' here
----------
def parse
<200 https://sample.com/search>
3
update
The program I want to finally create is the following code:
[ (Note) The code below does not work for reference ]
import scrapy
from scrapy.http import HtmlResponse
class SampleSpider(scrapy.Spider):
name = 'sample'
allowed_domains = ['sample.com']
start_urls = []
response_url = 'https://sample.com/search'
my_response = HtmlResponse(url=response_url)
categories = my_response.css('.categories a::attr(href)').getall()
for category in categories:
start_urls.append(category)
def parse(self, response, **kwargs):
pages = response.css('h3')
for page in pages:
print(page.css('::text').get())
Python 3.8.5
Scrapy 2.5.0

I know what do you mean,your start url is the basic domain,but you also want to fetch all category page to extract h3.
in scrapy you can extract data and follow new links in the same parse method,here is a example.
import scrapy
class SampleSpider(scrapy.Spider):
name = 'sample'
allowed_domains = ['sample.com']
start_urls = ['https://sample.com/search']
def parse(self, response, **kwargs):
print('def parse')
print(response)
pages = response.css('h3')
#extract data at here
for page in pages:
print(page.css('::text').get())
yield page.css('::text').get()
#follow new links here
categories = response.css('.categories a::attr(href)').getall()
for category in categories:
yield scrapy.Request(category,callback=self.parse)
you can read scrapy document for more information

GETTING ERROR WITH IMPORTED MODULE IN SCRAPY IN PYTHON

I am trying to implement a spider in scrapy and I am getting an error when I run the spider and tried several things but couldn't resolved.The error is as follows,
runspider: error: Unable to load 'articleSpider.py': No module named 'wikiSpider.wikiSpider'
I still learning python as well as scrapy package . But I think this is to do with module import from a different directory , so I have include my directory tree in my virtual environment created in pycharm as below image.
Also note that it is python 3.9 I am using as my interpreter for my virtual environment.
Code I am using for this with spider is as follows,
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wikiSpider.wikiSpider.items import Article
class ArticleSpider(CrawlSpider):
name = 'articleItems'
allowed_domains = ['wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Benevolent'
'_dictator_for_life']
rules = [Rule(LinkExtractor(allow='(/wiki/)((?!:).)*$'),
callback='parse_items', follow=True)]
def parse_items(self, response):
article = Article()
article['url'] = response.url
article['title'] = response.css('h1::text').extract_first()
article['text'] = response.xpath('//div[#id='
'"mw-content-text"]//text()').extract()
lastUpdated = response.css('li#footer-info-lastmod::text').extract_first()
article['lastUpdated'] = lastUpdated.replace('This page was last edited on ', '')
return article
and this is the code in file generating the error ,
import scrapy
class Article(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
text = scrapy.Field()
lastUpdated = scrapy.Field()

from "wikiSpider".wikiSpider.items import Article
change this folder name.
and then edit: from wikiSpider.items import Article
Solved.

Scrapy Rules: Exclude certain urls with process links

I am very happy to having discovered the Scrapy Crawl Class with its Rule Objects. However when I am trying to extract urls which contain the word "login" with process_links it doesn't work. The solution I implemented comes from here: Example code for Scrapy process_links and process_request but it doesn't exclude the pages I want
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from accenture.items import AccentureItem
class AccentureSpiderSpider(CrawlSpider):
name = 'accenture_spider'
start_urls = ['https://www.accenture.com/us-en/internet-of-things-index']
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[contains(#href, "insight")]'), callback='parse_item',process_links='process_links', follow=True),
)
def process_links(self, links):
for link in links:
if 'login' in link.text:
continue # skip all links that have "login" in their text
yield link
def parse_item(self, response):
loader = ItemLoader(item=AccentureItem(), response=response)
url = response.url
loader.add_value('url', url)
yield loader.load_item()

My mistake was to use link.text
When using link.url it works fine :)

Scrapy runs Spider before CrawlerProcess()

I have generated a new project and have a single Python file containing my spider.
The layout is:
import scrapy
from scrapy.http import *
import json
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
import unicodedata
from scrapy import signals
from pydispatch import dispatcher
from scrapy.crawler import CrawlerProcess
from scrapy.item import Item, Field
class TrainerItem(Item):
name = Field()
brand = Field()
link = Field()
type = Field()
price = Field()
previous_price = Field()
stock_img = Field()
alt_img = Field()
size = Field()
class SchuhSpider(scrapy.Spider):
name = "SchuhSpider"
payload = {"hash": "g=3|Mens,&c2=340|Mens Trainers&page=1&imp=1&o=new&",
"url": "/imperfects/", "type": "pageLoad", "NonSecureUrl": "http://www.schuh.co.uk"}
url = "http://schuhservice.schuh.co.uk/SearchService/GetResults"
headers = {'Content-Type': 'application/json; charset=UTF-8'}
finalLinks = []
def start_requests(self):
dispatcher.connect(self.quit, signals.spider_closed)
yield scrapy.Request(url=self.url, callback=self.parse, method="POST", body=json.dumps(self.payload), headers=self.headers)
def parse(self, response):
... do stuff ..
def quit(self, spider):
print(spider.name + " is about to die, here are your trainers..")
process = CrawlerProcess()
process.crawl(SchuhSpider)
process.start()
print("We Are Done.")
I run this spider using:
scrapy crawl SchuhSpider
The problem is I'm getting:
twisted.internet.error.ReactorNotRestartable
This is because the spider is actually running twice. Once at the start (I'm getting all my POST requests) then it says "SchuhSpider is about to die, here are you trainers..".
Then it opens the spider a second time, presumably when it does the process stuff.
My question is: How do I get the spider to stop automatically running when the script runs?
Even when I run:
scrapy list
It runs the entire spider (all my POST requests come through). I fear I'm missing something obvious but I can't see what.

You mix two ways how to run a spider. One way is as you do it now, i.e. using
scrapy crawl SchuhSpider
command. This way, don't (or better you don't have to) include the code
process = CrawlerProcess()
process.crawl(SchuhSpider)
process.start()
print("We Are Done.")
as it's inteded only if you want to run spider from script (see the documentation).
If you want to retain the possibility to run it either way, just wrap the above code like this
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(SchuhSpider)
process.start()
print("We Are Done.")
so that it doesn't run when the module is just loaded (the case when you run it using scrapy crawl).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Downloading files with ItemLoaders() in Scrapy - python-3.x

Related

Scrapy script not scrapping items

How to use css selector in object from HtmlResponse

GETTING ERROR WITH IMPORTED MODULE IN SCRAPY IN PYTHON

Scrapy Rules: Exclude certain urls with process links

Scrapy runs Spider before CrawlerProcess()

Categories

Resources