Scrapy - filtering yielded items - python-3.x

I am trying to scrape some items, as shown below:
def parse(self, response):
item = GameItem()
item['game_commentary'] = response.css('tr td:nth-child(2)[style*=vertical-align]::text').extract()
item['game_movement'] = response.xpath("//tr/td[1][contains(#style,'vertical-align: top')]/text()").extract()
yield item
My problem is that I don't want to yield all the items that current response.xpath or response.css selectors extracts.
Is there a way of, before assigning these commands to item['game_commentary'] and item['game_movement'], applying a regex or something else to filter unwished values which are not to be yielded?

I would look into Item Loaders to accomplish this.
You'll have to rewrite your parsing as follows:
def parse(self, response):
loader = GameItemLoader(item=GameItem(), response=response)
loader.add_css('game_commentary', 'tr td:nth-child(2)[style*=vertical-align]::text')
loader.add_xpath('game_movement', "//tr/td[1][contains(#style,'vertical-align: top')]/text()")
item = loader.load_item()
yield item
Your items.py will look something like this:
from scrapy.item import Item, Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst
class GameItemLoader(Item):
# default input & output processors
# will be executed for each item loaded,
# except if a specific in or output processor is specified
default_output_processor = TakeFirst()
# you can specify specific input & output processors per field
game_commentary_in = '...'
game_commentary_out = '...'
class GameItem(RetviewsItem):
game_commentary = Field()
game_movement = Field()

Related

Where are my mistakes in my scrapy codes?

I want to crawl a website via scrapy but my codes come up with an error.
I have tried to use xpath but it seems I can not define the div class in the web site.
The following code raises an error on ("h2 ::text").extract().
import scrapy
from scrapy.selector import Selector
from artistlist.items import ArtistlistItem
class MySpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["baltictriennial13.org"]
start_urls = ["https://www.baltictriennial13.org/artist/caroline-achaintre/"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='artist']")
items = []
for titles in titles:
item = ArtistlistItem()
item["artist"] = titles.select("h2 ::text").extract()
item["biograpy"] = titles.select("p::text").extract()
items.append(item)
return items
I want to crawl the web site and store the data in a .csv file.
The main issue with your code is using of .select instead of .css. Here is what do you need but I'm not sure about titles part (may be you need it on other pages):
def parse(self, response):
titles = response.xpath("//div[#class='artist']")
# items = []
for title in titles:
item = ArtistlistItem()
item["artist"] = title.css("h2::text").get()
item["biograpy"] = title.css("p::text").get()
# items.append(item)
yield item
try to remove the space in h2 ::text --> h2::text. If that doesn't work try h2/text()

How to go to following link from a for loop?

I am using scrapy to scrape a website I am in a loop where every item have link I want to go to following every time in a loop.
import scrapy
class MyDomainSpider(scrapy.Spider):
name = 'My_Domain'
allowed_domains = ['MyDomain.com']
start_urls = ['https://example.com']
def parse(self, response):
Colums = response.xpath('//*[#id="tab-5"]/ul/li')
for colom in Colums:
title = colom.xpath('//*[#class="lng_cont_name"]/text()').extract_first()
address = colom.xpath('//*[#class="adWidth cont_sw_addr"]/text()').extract_first()
con_address = address[9:-9]
url= colom.xpath('//*[#id="tab-5"]/ul/li/#data-href').extract_first()
print(url)
print('*********************')
yield scrapy.Request(url, callback = self.parse_dir_contents)
def parse_dir_contents(self, response):
print('000000000000000000')
a = response.xpath('//*[#class="fn"]/text()').extract_first()
print(a)
I have tried something like this but zeros print only once but stars prints 10 time I want it to run 2nd function to run every time when the loop runs.
You are probably doing something that is not intended. With
url = colom.xpath('//*[#id="tab-5"]/ul/li/#data-href').extract_first()
inside the loop, url always results in the same value. By default, Scrapy filters duplicate requests (see here). If you really want to scrape the same URL multiple times, you can disable the filtering on request level with dont_filter=True argument to scrapy.Request constructor. However, I think that what you really want is to go like this (only the relevant part of the code left):
def parse(self, response):
Colums = response.xpath('//*[#id="tab-5"]/ul/li')
for colom in Colums:
url = colom.xpath('./#data-href').extract_first()
yield scrapy.Request(url, callback=self.parse_dir_contents)

How to get/import the Scrapy item list from items.py to pipelines.py?

In my items.py:
class NewAdsItem(Item):
AdId = Field()
DateR = Field()
AdURL = Field()
In my pipelines.py:
import sqlite3
from scrapy.conf import settings
con = None
class DbPipeline(object):
def __init__(self):
self.setupDBCon()
self.createTables()
def setupDBCon(self):
# This is NOT OK!
# I want to get the items already HERE!
dbfile = settings.get('SQLITE_FILE')
self.con = sqlite3.connect(dbfile)
self.cur = self.con.cursor()
def createTables(self):
# OR optionally HERE.
self.createDbTable()
...
def process_item(self, item, spider):
self.storeInDb(item)
return item
def storeInDb(self, item):
# This is OK, I CAN get the items in here, using:
# item.keys() and/or item.values()
sql = "INSERT INTO {0} ({1}) VALUES ({2})".format(self.dbtable, ','.join(item.keys()), ','.join(['?'] * len(item.keys())) )
...
How can I get the item list names (like "AdId" etc) from items.py, before process_item() (in pipelines.py) is executed?
I use scrapy runspider myspider.py for execution.
I already tried to add "item" and/or "spider" like this def setupDBCon(self, item), but that didn't work, and resulted in:
TypeError: setupDBCon() missing 1 required positional argument: 'item'
UPDATE: 2018-10-08
Result (A):
Partially following the solution from #granitosaurus I found that I can get the item keys as a list, by:
Adding (a): from adbot.items import NewAdsItem to my main spider code.
Adding (b): ikeys = NewAdsItem.fields.keys() within the Class of above.
I could then access the keys from my pipelines.py via:
def open_spider(self, spider):
self.ikeys = list(spider.ikeys)
print("Keys in pipelines: \t%s" % ",".join(self.ikeys) )
#self.createDbTable(ikeys)
However, there were 2 problems with this method:
I was not able to get the ikeys list, into the createDbTable(). (I kept getting errors about missing arguments here and there.)
The ikeys list (as retrieved) was re-arranged and did not keep the order of the items, as they appear in items.py, which partially defeated the purpose. I still don't understand why these are out of order, when all docs says that Python3 should keep the order of dicts and lists etc. While at the same time, when using process_item() and getting the items via: item.keys() their order remain intact.
Result (B):
At the end of the day, it turned out too laborious and complicated to fix (A), so I just imported the relevant items.py Class into my pipelines.py, and use the item list as a global variable, like this:
def createDbTable(self):
self.ikeys = NewAdsItem.fields.keys()
print("Keys in creatDbTable: \t%s" % ",".join(self.ikeys) )
...
In this case I just decided to accept that the list obtained seem to be alphabetically sorted, and worked around the issue by just changing the key names. (Cheating!)
This is disappointing, because the code is ugly and contorted.
Any better suggestions would be much appreciated.
Scrapy pipelines have 3 connected methods:
process_item(self, item, spider)
This method is called for every item pipeline component.
process_item() must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components.
open_spider(self, spider)
This method is called when the spider is opened.
close_spider(self, spider)
This method is called when the spider is closed.
https://doc.scrapy.org/en/latest/topics/item-pipeline.html
So you can only get access to item in process_item method.
If you want to get item class however you can attach it to spider class:
class MySpider(Spider):
item_cls = MyItem
class MyPipeline:
def open_spider(self, spider):
fields = spider.item_cls.fields
# fields is a dictionary of key: default value
self.setup_table(fields)
Alternative you can lazy load during process_item method itself:
class MyPipeline:
item = None
def process_item(self, item, spider):
if not self.item:
self.item = item
self.setup_table(item)

Scrapy - xpath - extract returns null

My goal is to build a scraper that extract data from a table from this site.
Initially I followed the tutorial of Scrapy where I succeeded in extracting data from the test site. When I try to replicate it for Bitinfocharts, first issue is I need to use xpath, which the tutorial doesn't cover in detail (they use css only). I have been able to scrape the specific data I want through shell.
My current issue is understanding how I can scrape them all from my code and at the same time write the results to a .csv / .json file?
I'm probably missing something completely obvious. If you can have a look at my code and let me know I'm doing wrong, I would deeply appreciate it.
Thanks!
First attempt:
import scrapy
class RichlistTestItem(scrapy.Item):
# overview details
wallet = scrapy.Field()
balance = scrapy.Field()
percentage_of_coins = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domain = ['https://bitinfocharts.com/']
start_urls = [
'https://bitinfocharts.com/top-100-richest-vertcoin-addresses.html'
]
def parse(self, response):
for sel in response.xpath("//*[#id='tblOne']/tbody/tr/"):
scrapy.Item in RichlistTestItem()
scrapy.Item['wallet'] = sel.xpath('td[2]/a/text()').extract()[0]
scrapy.Item['balance'] = sel.xpath('td[3]/a/text').extract()[0]
scrapy.Item['percentage_of_coins'] = sel.xpath('/td[4]/a/text').extract()[0]
yield('wallet', 'balance', 'percentage_of_coins')
Second attempt: (probably closer to 50th attempt)
import scrapy
class RichlistTestItem(scrapy.Item):
# overview details
wallet = scrapy.Field()
balance = scrapy.Field()
percentage_of_coins = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domain = ['https://bitinfocharts.com/']
start_urls = [
'https://bitinfocharts.com/top-100-richest-vertcoin-addresses.html'
]
def parse(self, response):
for sel in response.xpath("//*[#id='tblOne']/tbody/tr/"):
wallet = sel.xpath('td[2]/a/text()').extract()
balance = sel.xpath('td[3]/a/text').extract()
percentage_of_coins = sel.xpath('/td[4]/a/text').extract()
print(wallet, balance, percentage_of_coins)
I have fixed your second trial, specifically the code snippet below
for sel in response.xpath("//*[#id=\"tblOne\"]/tbody/tr"):
wallet = sel.xpath('td[2]/a/text()').extract()
balance = sel.xpath('td[3]/text()').extract()
percentage_of_coins = sel.xpath('td[4]/text()').extract()
The problems, I found are
there was a trailing "/" for the table row selector.
for balance the
value was inside td not inside a link inside td
for percetag.. again
the value was inside td.
Also there is a data-val property for each of the td. Scraping those might be little easier than getting the value from inside of td.

yield scrapy.Request does not return the title

I am new to Scrapy and try to use it to practice crawling the website. However, even I followed the codes provided by the tutorial, it does not return the results. It looks like yield scrapy.Request does not work. My codes are as follow:
Import scrapy
from bs4 import BeautifulSoup
from apple.items import AppleItem
class Apple1Spider(scrapy.Spider):
name = 'apple'
allowed_domains = ['appledaily.com']
start_urls =['http://www.appledaily.com.tw/realtimenews/section/new/']
def parse(self, response):
domain = "http://www.appledaily.com.tw"
res = BeautifulSoup(response.body)
for news in res.select('.rtddt'):
yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail)
def parse_detail(self, response):
res = BeautifulSoup(response.body)
appleitem = AppleItem()
appleitem['title'] = res.select('h1')[0].text
appleitem['content'] = res.select('.trans')[0].text
appleitem['time'] = res.select('.gggs time')[0].text
return appleitem
It shows that spider was opened and closed but it returns nothing. The version of Python is 3.6. Can anyone please help? Thanks.
EDIT I
The crawl log can be reached here.
EDIT II
Maybe if I change the codes as below will make the issue clearer:
Import scrapy
from bs4 import BeautifulSoup
class Apple1Spider(scrapy.Spider):
name = 'apple'
allowed_domains = ['appledaily.com']
start_urls = ['http://www.appledaily.com.tw/realtimenews/section/new/']
def parse(self, response):
domain = "http://www.appledaily.com.tw"
res = BeautifulSoup(response.body)
for news in res.select('.rtddt'):
yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail)
def parse_detail(self, response):
res = BeautifulSoup(response.body)
print(res.select('#h1')[0].text)
The codes should print out the url and the title separately but it does not return anything.
Your log states:
2017-07-10 19:12:47 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to
'www.appledaily.com.tw': http://www.appledaily.com.tw/realtimenews/article/life/201
70710/1158177/oBike%E7%A6%81%E5%81%9C%E6%A9%9F%E8%BB%8A%E6%A0%BC%E3%80%80%E6%96%B0%E5%8C%
97%E7%81%AB%E9%80%9F%E5%86%8D%E5%85%AC%E5%91%8A6%E5%8D%80%E7%A6%81%E5%81%9C>
Your spider is set to:
allowed_domains = ['appledaily.com']
So it should probably be:
allowed_domains = ['appledaily.com.tw']
It seems like the content you are interested in your parse method (i.e. list items with class rtddt) is generated dynamically -- it can be inspected for example using Chrome, but is not present in HTML source (what Scrapy obtains as a response).
You will have to use something to render the page for Scrapy first. I would recommend Splash together with scrapy-splash package.

Resources