I have recently returned to a scrapy code y made some months ago.
The objective of the code was to scrape some amazon products for data, it worked like this:
Lets take this page as an example
https://www.amazon.com/s?k=mac+makeup&crid=2JQQNTWC87ZPV&sprefix=MAC+mak%2Caps%2C312&ref=nb_sb_ss_i_1_7
What the code does is enter every product of that page and get data from it, after it finished scraping all the data from that page, it moved to the next one (page 2 in this case).
That last part stopped working.
I have something like this in the rules (I had to re-write some of the xpaths because they were outdated)
import scrapy
import re
import string
import random
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapyJuan.items import GenericItem
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
class GenericScraperSpider(CrawlSpider):
name = "generic_spider"
#Dominio permitido
allowed_domain = ['www.amazon.com']
search_url = 'https://www.amazon.com/s?field-keywords={}'
custom_settings = {
'FEED_FORMAT': 'csv',
'FEED_URI' : 'GenericProducts.csv'
}
rules = {
#Next button
Rule(LinkExtractor(allow =(), restrict_xpaths = ('//li[#class="a-last"]/a/#href') )),
#Every element of the page
Rule(LinkExtractor(allow =(), restrict_xpaths = ('//a[contains(#class, "a-link-normal") and contains(#class,"a-text-normal")]') ),
callback = 'parse_item', follow = False)
}
def start_requests(self):
txtfile = open('productosGenericosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
def parse_item(self,response):
This worked like a month ago, but I cant make it work now.
Any ideas on whats wrong?
Amazon has an antibot mechanism to request captcha after some iterations. You can confirm it checking the returned HTTP code, if it's waiting for captcha you should receive something like 503 Service Unavailable. I don't see anything wrong on your code snippet (apart from {} instead of () on rules, which actually isn't affecting the results, since you still can iterate over it).
Furthermore, make sure your spider is inheriting CrawlSpider and not Scrapy
Related
I want to scrape articles from web page (example article enter link description here). My code should scrape all of article text. I'm doing it by XPath. After pasting following XPath in Dev tools: (1.crtl+shift+i ///
2. ctrl+f)
//div[#class="item-page clearfix"]/*[self::p/text() or self::strong/text() or self::ol/text() or self::blockquote/text()]
It seems like it works and is able to find all text. Web page shows me that XPath is working properly. But my Python and Scrapy thinks otherwise. The code below in JSON is returning only first paragraph of article. I can't understand why. Why on web page it's working and in Python not? What I missed?
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags
class LubaczowSpider(CrawlSpider):
name = 'Lubaczow'
allowed_domains = ['zlubaczowa.pl']
start_urls = ['http://zlubaczowa.pl/index.php/']
rules = (
Rule(LinkExtractor(restrict_xpaths="//p[#class='readmore']/a"), callback='parse', follow=True),)
def parse(self, response):
yield {
"Text" : response.xpath('normalize-space(//div[#class="item-page clearfix"]/*[self::p/text() or self::strong/text() or self::ol/text() or self::blockquote/text()])').getall(),
"Url" : response.url
}
Thank you in advance for your suggestions and help!
When you use normalize-space in xpath version 1 (which I believe is used in scrapy), any trailing white space(s) is removed from the string before being returned see mdn. This has the effect that text nodes following each other will have the nodes after the first one replaced with a white space hence you only get the first paragraph back.
You can try to obtain all text data from the child nodes and then join them into one string. See sample code below
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags
class LubaczowSpider(CrawlSpider):
name = 'Lubaczow'
allowed_domains = ['zlubaczowa.pl']
start_urls = ['http://zlubaczowa.pl/index.php/']
rules = (
Rule(LinkExtractor(restrict_xpaths="//p[#class='readmore']/a"), callback='parse', follow=True),)
def parse(self, response):
all_text = response.xpath("//div[#class='item-page clearfix']//child::text()").getall()
text = ''.join([r.strip() for r in all_text]) # remove trailing spaces and combine into 1 string
yield {
"Text" : text,
"Url" : response.url
}
A sample screenshot showing the results of the above code is as shown below
So 2 days ago i was trying to parse the data between two same classes and Keyur helped me a lot then after he left other problems behind.. :D
Page link
Now I want to get the links under a specific class, here is my code, and here are the errors.
from bs4 import BeautifulSoup
import urllib.request
import datetime
headers = {} # Headers gives information about you like your operation system, your browser etc.
headers['User-Agent'] = 'Mozilla/5.0' # I defined a user agent because HLTV perceive my connection as bot.
hltv = urllib.request.Request('https://www.hltv.org/matches', headers=headers) # Basically connecting to website
session = urllib.request.urlopen(hltv)
sauce = session.read() # Getting the source of website
soup = BeautifulSoup(sauce, 'lxml')
a = 0
b = 1
# Getting the match pages' links.
for x in soup.find('span', text=datetime.date.today()).parent:
print(x.find('a'))
Error:
Actually there isn't any error but it outputs like:
None
None
None
-1
None
None
-1
Then i researched and saw that if there isn't any data to give, find function gives you nothing which is none.
Then i tried to use find_all
Code:
print(x.find_all('a'))
Output:
AttributeError: 'NavigableString' object has no attribute 'find_all'
This is the class name:
<div class="standard-headline">2018-05-01</div>
I don't want to post all the code to here so here is the link hltv.org/matches/ so you can check the classes more easily.
I'm not quite sure I could understand what links OP really wants to grab. However, I took a guess. The links are within compound classes a-reset block upcoming-match standard-box and if you can spot the right class then one individual calss will suffice to fetch you the data like selectors do. Give it a shot.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.parse import urljoin
import datetime
url = 'https://www.hltv.org/matches'
req = Request(url, headers={"User-Agent":"Mozilla/5.0"})
res = urlopen(req).read()
soup = BeautifulSoup(res, 'lxml')
for links in soup.find(class_="standard-headline",text=(datetime.date.today())).find_parent().find_all(class_="upcoming-match")[:-2]:
print(urljoin(url,links.get('href')))
Output:
https://www.hltv.org/matches/2322508/yeah-vs-sharks-ggbet-ascenso
https://www.hltv.org/matches/2322633/team-australia-vs-team-uk-showmatch-csgo
https://www.hltv.org/matches/2322638/sydney-saints-vs-control-fe-lil-suzi-winner-esl-womens-sydney-open-finals
https://www.hltv.org/matches/2322426/faze-vs-astralis-iem-sydney-2018
https://www.hltv.org/matches/2322601/max-vs-fierce-tiger-starseries-i-league-season-5-asian-qualifier
and so on ------
I am new to Scrapy and try to use it to practice crawling the website. However, even I followed the codes provided by the tutorial, it does not return the results. It looks like yield scrapy.Request does not work. My codes are as follow:
Import scrapy
from bs4 import BeautifulSoup
from apple.items import AppleItem
class Apple1Spider(scrapy.Spider):
name = 'apple'
allowed_domains = ['appledaily.com']
start_urls =['http://www.appledaily.com.tw/realtimenews/section/new/']
def parse(self, response):
domain = "http://www.appledaily.com.tw"
res = BeautifulSoup(response.body)
for news in res.select('.rtddt'):
yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail)
def parse_detail(self, response):
res = BeautifulSoup(response.body)
appleitem = AppleItem()
appleitem['title'] = res.select('h1')[0].text
appleitem['content'] = res.select('.trans')[0].text
appleitem['time'] = res.select('.gggs time')[0].text
return appleitem
It shows that spider was opened and closed but it returns nothing. The version of Python is 3.6. Can anyone please help? Thanks.
EDIT I
The crawl log can be reached here.
EDIT II
Maybe if I change the codes as below will make the issue clearer:
Import scrapy
from bs4 import BeautifulSoup
class Apple1Spider(scrapy.Spider):
name = 'apple'
allowed_domains = ['appledaily.com']
start_urls = ['http://www.appledaily.com.tw/realtimenews/section/new/']
def parse(self, response):
domain = "http://www.appledaily.com.tw"
res = BeautifulSoup(response.body)
for news in res.select('.rtddt'):
yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail)
def parse_detail(self, response):
res = BeautifulSoup(response.body)
print(res.select('#h1')[0].text)
The codes should print out the url and the title separately but it does not return anything.
Your log states:
2017-07-10 19:12:47 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to
'www.appledaily.com.tw': http://www.appledaily.com.tw/realtimenews/article/life/201
70710/1158177/oBike%E7%A6%81%E5%81%9C%E6%A9%9F%E8%BB%8A%E6%A0%BC%E3%80%80%E6%96%B0%E5%8C%
97%E7%81%AB%E9%80%9F%E5%86%8D%E5%85%AC%E5%91%8A6%E5%8D%80%E7%A6%81%E5%81%9C>
Your spider is set to:
allowed_domains = ['appledaily.com']
So it should probably be:
allowed_domains = ['appledaily.com.tw']
It seems like the content you are interested in your parse method (i.e. list items with class rtddt) is generated dynamically -- it can be inspected for example using Chrome, but is not present in HTML source (what Scrapy obtains as a response).
You will have to use something to render the page for Scrapy first. I would recommend Splash together with scrapy-splash package.
I tried to scrape flipkart.com (I randomly opened a category that displayed 60 products).
However using BeautifulSoup when I searched for all the links , I didn't get the links pointing to each product. I obtained 37 links none of which pointed towards the product description page....HELP!!!
import requests
from bs4 import BeautifulSoup
# a random product listing page
url='https://www.flipkart.com/search?q=mobile&sid=tyy/4io&as=on&as-show=on&otracker=start&as-pos=1_1_ic_mobile'
r=requests.get(url)
soup=BeautifulSoup(r.text,from_encoding="utf-8")
links=soup.find_all('a')
It gave all the links except the links toproduct descrtiption page .
As I understand it (warning, I'm a noob): When you open the page in question with a normal browser, there is javascript in the page that when processed creates additional html that your browser adds to the document it shows you. When you use the requests module to get the page html it does not process this javascript so it never gets this extra content. The info you want is contained in this missing content. So:
Based on code from this thread: Web-scraping JavaScript page with Python
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from bs4 import BeautifulSoup
# Take this class for granted.Just use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'https://www.flipkart.com/search?q=mobile&sid=tyy/4io&as=on&as-show=on&otracker=start&as-pos=1_1_ic_mobile'
r = Render(url)
result = r.frame.toHtml()
soup = BeautifulSoup(result, 'lxml')
links = soup.find_all('div', {'class': 'col col-7-12'})
target_links = [link.parent.parent.parent for link in links]
for link in target_links:
try:
print(link.find('a')['href'])
except TypeError: # we caught unwanted links in the find_all
pass
I'm sure the way I steered to the links could be improved.
I'm trying to get started crawling and scraping a website to disk but having trouble getting the callback function working as I would like.
The code below will visit the start_url and find all the "a" tags on the site. For each 1 of them it will make a callback which is to save the text response to disk and use the crawerItem to store some metadata about the page.
I was hoping someone could help me figure out how to pass
a unique id to each callback so it can be used as the filename when saving the file
Pass the url of the originating page so it can be added to the metadata via the Items
Follow the links on the child pages to go another level deeper into the site
Below is my code thus far
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from mycrawler.items import crawlerItem
class CrawlSpider(scrapy.Spider):
name = "librarycrawler"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
rules = (
Rule(LinkExtractor(),callback='scrape_page', follow=True)
)
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = LibrarycrawlerItem()
item['title'] =ScrapedPageTitle
item['file_urls'] = response.url
yield item
In Settings.py
ITEM_PIPELINES = [
'librarycrawler.files.FilesPipeline',
]
FILES_STORE = 'C:\Documents\Spider\crawler\ExtractedText'
In items.py
import scrapy
class LibrarycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
Files = scrapy.Field()
I'm not 100% sure but I think you can't rename the scrapy image files however you want, scrapy does that.
What you want to do looks like a job for CrawlSpider instead of Spider.
CrawlSpider by itself follows every link it finds in every page recursively and you can set rules on what pages you want to scrap. Here are the docs.
If you are stubborn enough to keep Spider you can use the meta tag on requests to pass the items and save links in them.
for link in soup.find_all("a"):
item=crawlerItem()
item['url'] = response.urljoin(link.get('href'))
request=scrapy.Request(url,callback=self.scrape_page)
request.meta['item']=item
yield request
To get the item just go look for it in the response:
def scrape_page(self, response):
item=response.meta['item']
In this specific example the item passed item['url'] is obsolete as you can get the current url with response.url
Also,
It's a bad idea to use Beautiful soup in scrapy as it just slows you down, the scrapy library is really well developed to the extent that you don't need anything else to extract data!