How to get the urls those are with error status - python-3.x

I am writing a spider with scrapy in python3, and l just started scrapy not a long time. I was catching the data of a web-site and after some minutes, web site maybe get me the 302 status and redirect to another url to verify me. So l want to save the url to the file.
for example, https://www.test.com/article?id=123 is what I want to request, and then it response me 302 an redirect to https://www.test.com/vrcode
I want to save https://www.test.com/article?id=123 to file, how should I do?
class CatchData(scrapy.Spider):
name = 'test'
allowed_domains = ['test.com']
start_urls = ['test.com/article?id=1',
'test.com/article?id=2',
# ...
]
def parse(self, response):
item = LocationItem()
item['article'] = response.xpath('...')
yield item
I found a answer from How to get the scrapy failure URLs?
but It is an answer at six years ago, I want to know is there more simple way to do this

with open(file_name, 'w', encoding="utf-8") as f:
f.write(str(item))

Related

I am trying to join a URL in scrapy but unable to do so

I am trying to fetch (id and name)i.e name from one website and want to append the variable to another link. for eg in the name variable i get - /in/en/books/1446502-An-Exciting-Day.(There are many records) and then i want to append the name variable to 'https://www.storytel.com' to fetch data specific to the book. Also I want to put a condition for a_name i.e if response.css('span.expandAuthorName::text') is not available than put '-' else fetch the name.
import scrapy
class BrickSetSpider(scrapy.Spider):
name = 'brickset-spider'
start_urls = ['https://www.storytel.com/in/en/categories/1-Children?pageNumber=100']
def parse(self, response):
# for quote in response.css('div.gridBookTitle'):
# item = {
# 'name': quote.css('a::attr(href)').extract_first()
# }
# yield item
urls = response.css('div.gridBookTitle > a::attr(href)').extract()
for url in urls:
url = ['https://www.storytel.com'].urljoin(url)
yield scrapy.Request(url=url, callback=self.parse_details)
def parse_details(self, response):
yield {
'a_name': response.css('span.expandAuthorName::text').extract_first()
}
I am trying to append "https://www.storytel.com".urljoin(url) but i am getting error for the same. Being new to scrapy I tried many thing but unable to resolve the issue. I am getting error - in line 15 list object has no attribute urljoin. Any leads on how to overcome this. Thanks in advance.
Check with this solution.
for url in urls:
url = 'https://www.storytel.com'+ url
yield scrapy.Request(url=url, callback=self.parse_details)
it helps let me know.
url = ['https://www.storytel.com'].urljoin(url)
Here you are trying to "join" a string to a list of strings. If you want to append a given url (which is a string) to the base string (https://etc...), you could do it by:
full_url = "https://www.storytel.com".join(url)
# OR
full_url = "https://www.storytel.com" + url
You can check the docs about strings (specifically 'join') here: https://docs.python.org/3.8/library/stdtypes.html#str.join
EDIT: also, I'm not sure that urljoin exists...

Unable to perform login for scraping with Scrapy

my first time asking a question here so please bear with me if I'm not providing everything that is needed.
I'm trying to build a spider that goes to this website (https://newslink.sg/user/Login.action), logs in (I have a valid set of username and password) and then scrape some pages.
I'm unable to get past the login stage.
I suspect it has to do with the formdata and what I enter inside, as there are "login.x" and "login.y" fields when I check the form data. The login.x and login.y fields seem to change whenever I log in again.
This question and answer seems to provide a hint of how I can fix things but I don't know how to go about extracting the correct values.
Python scrapy - Login Authenication Issue
Below is my code with some modification.
import scrapy
from scrapy.selector import Selector
from scrapy.http import Request
class BtscrapeSpider(scrapy.Spider):
name = "btscrape"
#allowed_domains = [""]
start_urls = [
"https://newslink.sg/user/Login.action"
]
def start_requests(self):
return [scrapy.FormRequest("https://newslink.sg/user/Login.action",
formdata={'IDToken1': 'myusername',
'IDToken2': 'mypassword',
'login.x': 'what do I do here?',
'login.y': 'what do I do here?'
},
callback=self.after_login)]
def after_login(self, response):
return Request(
url="webpage I want to scrape after login",
callback=self.parse_bt
)
def parse_bt(self, response): # Define parse() function.
items = [] # Element for storing scraped information.
hxs = Selector(response) # Selector allows us to grab HTML from the response (target website).
item = BtscrapeItem()
item['headline'] = hxs.xpath("/html/body/h2").extract() # headline.
item['section'] = hxs.xpath("/html/body/table/tbody/tr[1]/td[2]").extract() # section of newspaper that story appeared.
item['date'] = hxs.xpath("/html/body/table/tbody/tr[2]/td[2]/text()").extract()# date of publication
item['page'] = hxs.xpath("/html/body/table/tbody/tr[3]/td[2]/text()").extract() # page that story appeared.
item['word_num'] = hxs.xpath("/html/body/table/tbody/tr[4]/td[2]").extract() # number of words in story.
item['text'] = hxs.xpath("/html/body/div[#id='bodytext']/text()").extract() # text of story.
items.append(item)
return items
If I run the code without the login.x and login.y lines, I get blank scrapes.
Thanks for your help!
Two possible reasons:
You don't send a goto: https://newslink.sg/secure/redirect2.jsp?dest=https://newslink.sg/user/Login.action?login= form parameter
You need cookies for auth part
So I recommend you to rewrite it this way:
start_urls = [
"https://newslink.sg/user/Login.action"
]
def parse(self, response):
yield scrapy.FormRequest.from_response(
formnumber=1,
formdata={
'IDToken1': 'myusername',
'IDToken2': 'mypassword',
'login.x': '2',
'login.y': '6',
},
callback=self.after_login,
)
Scrapy will send goto automatically for you. login.x and login.y is just cursor coordinates when you click on a Login button.

Scrape Multiple articles from one page with each article with separate href

I am new to scrapy and writing my first spider make a scrapy spider for website similar to https://blogs.webmd.com/diabetes/default.htm
I want to scrape Headlines and then navigate to each article scrape the text content for each article.
I have tried by using rules and linkextractor but it's not able to navigate to next page and extract. i get the ERROR: Spider error processing https://blogs.webmd.com/diabetes/default.htm> (referer: None)
Below is my code
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
class MedicalSpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['https://blogs.webmd.com/diabetes/default.htm']
start_urls = ['https://blogs.webmd.com/diabetes/default.htm']
Rules = (Rule(LinkExtractor(allow=(), restrict_css=('.posts-list-post-content a ::attr(href)')), callback="parse", follow=True),)
def parse(self, response):
headline = response.css('.posts-list-post-content::text').extract()
body = response.css('.posts-list-post-desc::text').extract()
print("%s : %s" % (headline, body))
next_page = response.css('.posts-list-post-content a ::attr(href)').extract()
if next_page:
next_href = next_page[0]
next_page_url = next_href
request = scrapy.Request(url=next_page_url)
yield request
Please guide a newbie in scrapy to get this spider right for multiple articles on each page.
Usually when using scrapy each response is parsed by parse callback. The main parse method is the callback for the initial response obtained for each of the start_urls.
The goal for that parse function should then be to "Identify article links", and issue requests for each of them. Those responses would then be parsed by another callback, say parse_article that would then extract all the contents from that particular article.
You don't even need that LinkExtractor. Consider:
import scrapy
class MedicalSpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['blogs.webmd.com'] # Only the domain, not the URL
start_urls = ['https://blogs.webmd.com/diabetes/default.htm']
def parse(self, response):
article_links = response.css('.posts-list-post-content a ::attr(href)')
for link in article_links:
url = link.get()
if url:
yield response.follow(url=url, callback=self.parse_article)
def parse_article(self, response):
headline = 'some-css-selector-to-get-the-headline-from-the-aticle-page'
# The body is trickier, since it's spread through several tags on this particular site
body = 'loop-over-some-selector-to-get-the-article-text'
yield {
'headline': headline,
'body': body
}
I've not pasted the full code because I believe you still want some excitement learning how to do this, but you can find what I came up with on this gist
Note that the parse_article method is returning dictionaries. These are using Scrapy's items pipelines. You can get a neat json output by running your code using: scrapy runspider headlines/spiders/medical.py -o out.json

Scrapy won't call scraper function

I have written a scraper using Scrapy and I have a weird yet simple problem.
I login using u/p to scrape the data but sometimes the site redirects me to /login.asp page with redir= query parameter containing the url I was about to scrape. so I added re_login_if_needed() function and I call it as the first statement of the parse() callback. The idea is to check if response.url has a redirect URL so I can re-login to the site and continue scrapping with parse() as before.
The problem is that somehow re_login_if_needed() function is never executed. Any DEBUG PRINT statement I put in there, is never printed out.
How could that be?
In by class I have:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
self.re_login_if_needed(response)
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
def re_login_if_needed(self, response):
# check if response.url contains redirect code, i.e: "/login.asp?redir="
# and relogin ...

Pass values into scrapy callback

I'm trying to get started crawling and scraping a website to disk but having trouble getting the callback function working as I would like.
The code below will visit the start_url and find all the "a" tags on the site. For each 1 of them it will make a callback which is to save the text response to disk and use the crawerItem to store some metadata about the page.
I was hoping someone could help me figure out how to pass
a unique id to each callback so it can be used as the filename when saving the file
Pass the url of the originating page so it can be added to the metadata via the Items
Follow the links on the child pages to go another level deeper into the site
Below is my code thus far
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from mycrawler.items import crawlerItem
class CrawlSpider(scrapy.Spider):
name = "librarycrawler"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
rules = (
Rule(LinkExtractor(),callback='scrape_page', follow=True)
)
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = LibrarycrawlerItem()
item['title'] =ScrapedPageTitle
item['file_urls'] = response.url
yield item
In Settings.py
ITEM_PIPELINES = [
'librarycrawler.files.FilesPipeline',
]
FILES_STORE = 'C:\Documents\Spider\crawler\ExtractedText'
In items.py
import scrapy
class LibrarycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
Files = scrapy.Field()
I'm not 100% sure but I think you can't rename the scrapy image files however you want, scrapy does that.
What you want to do looks like a job for CrawlSpider instead of Spider.
CrawlSpider by itself follows every link it finds in every page recursively and you can set rules on what pages you want to scrap. Here are the docs.
If you are stubborn enough to keep Spider you can use the meta tag on requests to pass the items and save links in them.
for link in soup.find_all("a"):
item=crawlerItem()
item['url'] = response.urljoin(link.get('href'))
request=scrapy.Request(url,callback=self.scrape_page)
request.meta['item']=item
yield request
To get the item just go look for it in the response:
def scrape_page(self, response):
item=response.meta['item']
In this specific example the item passed item['url'] is obsolete as you can get the current url with response.url
Also,
It's a bad idea to use Beautiful soup in scrapy as it just slows you down, the scrapy library is really well developed to the extent that you don't need anything else to extract data!

Resources