Using scrapy to find specific text from multiple websites - web

I would like to crawl/check multiple websites(on same domain) for a specific keyword. I have found this script, but I can't find how to add the specific keyword to be search for. What the script needs to do is find the keyword, and give the result in which link it was found. Could anyone point me to where i could read more about this ?
I have been reading scrapy's documentation, but I can't seem to find this.
Thank you.
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def __init__(self):
self.page_number = starting_number
def start_requests(self):
# generate page IDs from 1000 down to 501
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i, callback=self.parse)
def parse(self, response):
**parsing data from the webpage**

You'll need to use some parser or regex to find the text you are looking for inside the response body.
every scrapy callback method contains the response body inside the response object, which you can check with response.body (for example inside the parse method), then you'll have to use some regex or better xpath or css selectors to go to the path of your text knowing the xml structure of the page you crawled.
Scrapy lets you use the response object as a Selector, so you can go to the title of the page with response.xpath('//head/title/text()') for example.
Hope it helped.

Related

I am trying to join a URL in scrapy but unable to do so

I am trying to fetch (id and name)i.e name from one website and want to append the variable to another link. for eg in the name variable i get - /in/en/books/1446502-An-Exciting-Day.(There are many records) and then i want to append the name variable to 'https://www.storytel.com' to fetch data specific to the book. Also I want to put a condition for a_name i.e if response.css('span.expandAuthorName::text') is not available than put '-' else fetch the name.
import scrapy
class BrickSetSpider(scrapy.Spider):
name = 'brickset-spider'
start_urls = ['https://www.storytel.com/in/en/categories/1-Children?pageNumber=100']
def parse(self, response):
# for quote in response.css('div.gridBookTitle'):
# item = {
# 'name': quote.css('a::attr(href)').extract_first()
# }
# yield item
urls = response.css('div.gridBookTitle > a::attr(href)').extract()
for url in urls:
url = ['https://www.storytel.com'].urljoin(url)
yield scrapy.Request(url=url, callback=self.parse_details)
def parse_details(self, response):
yield {
'a_name': response.css('span.expandAuthorName::text').extract_first()
}
I am trying to append "https://www.storytel.com".urljoin(url) but i am getting error for the same. Being new to scrapy I tried many thing but unable to resolve the issue. I am getting error - in line 15 list object has no attribute urljoin. Any leads on how to overcome this. Thanks in advance.
Check with this solution.
for url in urls:
url = 'https://www.storytel.com'+ url
yield scrapy.Request(url=url, callback=self.parse_details)
it helps let me know.
url = ['https://www.storytel.com'].urljoin(url)
Here you are trying to "join" a string to a list of strings. If you want to append a given url (which is a string) to the base string (https://etc...), you could do it by:
full_url = "https://www.storytel.com".join(url)
# OR
full_url = "https://www.storytel.com" + url
You can check the docs about strings (specifically 'join') here: https://docs.python.org/3.8/library/stdtypes.html#str.join
EDIT: also, I'm not sure that urljoin exists...

Scrape Multiple articles from one page with each article with separate href

I am new to scrapy and writing my first spider make a scrapy spider for website similar to https://blogs.webmd.com/diabetes/default.htm
I want to scrape Headlines and then navigate to each article scrape the text content for each article.
I have tried by using rules and linkextractor but it's not able to navigate to next page and extract. i get the ERROR: Spider error processing https://blogs.webmd.com/diabetes/default.htm> (referer: None)
Below is my code
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
class MedicalSpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['https://blogs.webmd.com/diabetes/default.htm']
start_urls = ['https://blogs.webmd.com/diabetes/default.htm']
Rules = (Rule(LinkExtractor(allow=(), restrict_css=('.posts-list-post-content a ::attr(href)')), callback="parse", follow=True),)
def parse(self, response):
headline = response.css('.posts-list-post-content::text').extract()
body = response.css('.posts-list-post-desc::text').extract()
print("%s : %s" % (headline, body))
next_page = response.css('.posts-list-post-content a ::attr(href)').extract()
if next_page:
next_href = next_page[0]
next_page_url = next_href
request = scrapy.Request(url=next_page_url)
yield request
Please guide a newbie in scrapy to get this spider right for multiple articles on each page.
Usually when using scrapy each response is parsed by parse callback. The main parse method is the callback for the initial response obtained for each of the start_urls.
The goal for that parse function should then be to "Identify article links", and issue requests for each of them. Those responses would then be parsed by another callback, say parse_article that would then extract all the contents from that particular article.
You don't even need that LinkExtractor. Consider:
import scrapy
class MedicalSpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['blogs.webmd.com'] # Only the domain, not the URL
start_urls = ['https://blogs.webmd.com/diabetes/default.htm']
def parse(self, response):
article_links = response.css('.posts-list-post-content a ::attr(href)')
for link in article_links:
url = link.get()
if url:
yield response.follow(url=url, callback=self.parse_article)
def parse_article(self, response):
headline = 'some-css-selector-to-get-the-headline-from-the-aticle-page'
# The body is trickier, since it's spread through several tags on this particular site
body = 'loop-over-some-selector-to-get-the-article-text'
yield {
'headline': headline,
'body': body
}
I've not pasted the full code because I believe you still want some excitement learning how to do this, but you can find what I came up with on this gist
Note that the parse_article method is returning dictionaries. These are using Scrapy's items pipelines. You can get a neat json output by running your code using: scrapy runspider headlines/spiders/medical.py -o out.json

Scrapy spider returns no items data

My scrapy script seems not to follow links, which ends up not extracting data from each of them (to pass some content as scrapy items).
I am trying to scrape a lot of data from a news website. I managed to copy/write a spider that, as I assumed, should read links from a file (I've generated it with another script), put them in start_urls list and start following these links to extract some data, and then pass it as items, and also -- write each item's data in a separate file (last part is actually for another question).
After running scrapy crawl PNS, script goes through all the links from start_urls but does nothing more -- it follows links read from start_urls list (I'm getting "GET link" message in bash), but seems not to enter them and read some more links to follow and extract data from.
import scrapy
import re
from ProjectName.items import ProjectNameArticle
class ProjectNameSpider(scrapy.Spider):
name = 'PNS'
allowed_domains = ['www.project-domain.com']
start_urls = []
with open('start_urls.txt', 'r') as file:
for line in file:
start_urls.append(line.strip())
def parse(self, response):
for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('#href').extract():
# extracted links look like this: "/document.html"
link = "https://project-domain.com" + link
yield scrapy.Request(link, callback=self.parse_news)
def parse_news(self, response):
data_dic = ProjectNameArticle()
data_dic['article_date'] = response.css('div.article__date::text').extract_first().strip()
data_dic['article_time'] = response.css('span.article__time::text').extract_first().strip()
data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
news_text = response.css('div.article__text').extract_first()
news_text = re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
data_dic['article_text'] = news_text
return data_dic
Expected result:
Script opens start_urls.txt file, reads its lines (every line contains a single link), puts these links to start_urls list,
For each link opened spider extracts deeper links to be followed (that's about 50-200 links for each start_urls link),
Followed links are the main target from which I want to extract specific data: article title, date, time, text.
For now never mind writing each scrapy item to a distinc .txt file.
Actual result:
Running my spider triggers GET for each start_urls link, goes through around 150000, doesn't create a list of deeper links, nor enters them to extract any data.
Dude, I have been coding in Python Scrapy for long time and I hate using start_urls
You can simply use start_requests which is very easy to read, and also very easy to learn for beginners
class ProjectNameSpider(scrapy.Spider):
name = 'PNS'
allowed_domains = ['www.project-domain.com']
def start_requests(self):
with open('start_urls.txt', 'r') as file:
for line in file:
yield Request(line.strip(),
callback=self.my_callback_func)
def my_callback_func(self, response):
for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('#href').extract():
# extracted links look like this: "/document.html"
link = "https://project-domain.com" + link
yield scrapy.Request(link, callback=self.parse_news)
def parse_news(self, response):
data_dic = ProjectNameArticle()
data_dic['article_date'] = response.css('div.article__date::text').extract_first().strip()
data_dic['article_time'] = response.css('span.article__time::text').extract_first().strip()
data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
news_text = response.css('div.article__text').extract_first()
news_text = re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
data_dic['article_text'] = news_text
return data_dic
I also have never used Item class and find it useless too
You can simply have data_dic = {} instead of data_dic = ProjectNameArticle()

Using multiple parsers while scraping a page

I have searched some of the questions regarding this topic but i couldn't find a solution to my problem.
I'm currently trying to use multiple parsers on a site depending on the product I want to search. After trying some methods, I ended up with this:
With this start request:
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
That gets into my normal parse_item.
What i want to do is, with this parse_item (by checking with the item category like laptop, tablet, etc):
def parse_item(self,response):
#I get the items category for the if/else
category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(#class, "a-list-item")]//a/text())').extract_first())
#Get the product link, for example (https://www.amazon.com/Lenovo-T430s-Performance-Professional-Refurbished/dp/B07L4FR92R/ref=sr_1_7?s=pc&ie=UTF8&qid=1545829464&sr=1-7&keywords=laptop)
urlProducto = response.request.url
#This can be done in a nicer way, just trying out if it works atm
if category == 'Laptop':
yield response.follow(urlProducto, callback = parse_laptop)
With:
def parse_laptop(self, response):
#Parse things
Any suggestions? The error i get when running this code is 'parse_laptop' is not defined. I have already tried putting the parse_laptop above the parse_item and i still get the same error.
You need to refer to a method and not a function, so just change it like this:
yield response.follow(urlProducto, callback = self.parse_laptop)
yield response.follow(urlProducto, callback = parse_laptop)
This is the request and here's you function def parse_laptop(self, response): you probably have noticed that you parse_laptop function requires self object.
so please modify you request to :
yield response.follow(urlProducto, callback = self.parse_laptop)
This should do the work.
Thanks.

Pass values into scrapy callback

I'm trying to get started crawling and scraping a website to disk but having trouble getting the callback function working as I would like.
The code below will visit the start_url and find all the "a" tags on the site. For each 1 of them it will make a callback which is to save the text response to disk and use the crawerItem to store some metadata about the page.
I was hoping someone could help me figure out how to pass
a unique id to each callback so it can be used as the filename when saving the file
Pass the url of the originating page so it can be added to the metadata via the Items
Follow the links on the child pages to go another level deeper into the site
Below is my code thus far
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from mycrawler.items import crawlerItem
class CrawlSpider(scrapy.Spider):
name = "librarycrawler"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
rules = (
Rule(LinkExtractor(),callback='scrape_page', follow=True)
)
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = LibrarycrawlerItem()
item['title'] =ScrapedPageTitle
item['file_urls'] = response.url
yield item
In Settings.py
ITEM_PIPELINES = [
'librarycrawler.files.FilesPipeline',
]
FILES_STORE = 'C:\Documents\Spider\crawler\ExtractedText'
In items.py
import scrapy
class LibrarycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
Files = scrapy.Field()
I'm not 100% sure but I think you can't rename the scrapy image files however you want, scrapy does that.
What you want to do looks like a job for CrawlSpider instead of Spider.
CrawlSpider by itself follows every link it finds in every page recursively and you can set rules on what pages you want to scrap. Here are the docs.
If you are stubborn enough to keep Spider you can use the meta tag on requests to pass the items and save links in them.
for link in soup.find_all("a"):
item=crawlerItem()
item['url'] = response.urljoin(link.get('href'))
request=scrapy.Request(url,callback=self.scrape_page)
request.meta['item']=item
yield request
To get the item just go look for it in the response:
def scrape_page(self, response):
item=response.meta['item']
In this specific example the item passed item['url'] is obsolete as you can get the current url with response.url
Also,
It's a bad idea to use Beautiful soup in scrapy as it just slows you down, the scrapy library is really well developed to the extent that you don't need anything else to extract data!

Resources