Unable to perform login for scraping with Scrapy - python-3.x

my first time asking a question here so please bear with me if I'm not providing everything that is needed.
I'm trying to build a spider that goes to this website (https://newslink.sg/user/Login.action), logs in (I have a valid set of username and password) and then scrape some pages.
I'm unable to get past the login stage.
I suspect it has to do with the formdata and what I enter inside, as there are "login.x" and "login.y" fields when I check the form data. The login.x and login.y fields seem to change whenever I log in again.
This question and answer seems to provide a hint of how I can fix things but I don't know how to go about extracting the correct values.
Python scrapy - Login Authenication Issue
Below is my code with some modification.
import scrapy
from scrapy.selector import Selector
from scrapy.http import Request
class BtscrapeSpider(scrapy.Spider):
name = "btscrape"
#allowed_domains = [""]
start_urls = [
"https://newslink.sg/user/Login.action"
]
def start_requests(self):
return [scrapy.FormRequest("https://newslink.sg/user/Login.action",
formdata={'IDToken1': 'myusername',
'IDToken2': 'mypassword',
'login.x': 'what do I do here?',
'login.y': 'what do I do here?'
},
callback=self.after_login)]
def after_login(self, response):
return Request(
url="webpage I want to scrape after login",
callback=self.parse_bt
)
def parse_bt(self, response): # Define parse() function.
items = [] # Element for storing scraped information.
hxs = Selector(response) # Selector allows us to grab HTML from the response (target website).
item = BtscrapeItem()
item['headline'] = hxs.xpath("/html/body/h2").extract() # headline.
item['section'] = hxs.xpath("/html/body/table/tbody/tr[1]/td[2]").extract() # section of newspaper that story appeared.
item['date'] = hxs.xpath("/html/body/table/tbody/tr[2]/td[2]/text()").extract()# date of publication
item['page'] = hxs.xpath("/html/body/table/tbody/tr[3]/td[2]/text()").extract() # page that story appeared.
item['word_num'] = hxs.xpath("/html/body/table/tbody/tr[4]/td[2]").extract() # number of words in story.
item['text'] = hxs.xpath("/html/body/div[#id='bodytext']/text()").extract() # text of story.
items.append(item)
return items
If I run the code without the login.x and login.y lines, I get blank scrapes.
Thanks for your help!

Two possible reasons:
You don't send a goto: https://newslink.sg/secure/redirect2.jsp?dest=https://newslink.sg/user/Login.action?login= form parameter
You need cookies for auth part
So I recommend you to rewrite it this way:
start_urls = [
"https://newslink.sg/user/Login.action"
]
def parse(self, response):
yield scrapy.FormRequest.from_response(
formnumber=1,
formdata={
'IDToken1': 'myusername',
'IDToken2': 'mypassword',
'login.x': '2',
'login.y': '6',
},
callback=self.after_login,
)
Scrapy will send goto automatically for you. login.x and login.y is just cursor coordinates when you click on a Login button.

Related

I am trying to join a URL in scrapy but unable to do so

I am trying to fetch (id and name)i.e name from one website and want to append the variable to another link. for eg in the name variable i get - /in/en/books/1446502-An-Exciting-Day.(There are many records) and then i want to append the name variable to 'https://www.storytel.com' to fetch data specific to the book. Also I want to put a condition for a_name i.e if response.css('span.expandAuthorName::text') is not available than put '-' else fetch the name.
import scrapy
class BrickSetSpider(scrapy.Spider):
name = 'brickset-spider'
start_urls = ['https://www.storytel.com/in/en/categories/1-Children?pageNumber=100']
def parse(self, response):
# for quote in response.css('div.gridBookTitle'):
# item = {
# 'name': quote.css('a::attr(href)').extract_first()
# }
# yield item
urls = response.css('div.gridBookTitle > a::attr(href)').extract()
for url in urls:
url = ['https://www.storytel.com'].urljoin(url)
yield scrapy.Request(url=url, callback=self.parse_details)
def parse_details(self, response):
yield {
'a_name': response.css('span.expandAuthorName::text').extract_first()
}
I am trying to append "https://www.storytel.com".urljoin(url) but i am getting error for the same. Being new to scrapy I tried many thing but unable to resolve the issue. I am getting error - in line 15 list object has no attribute urljoin. Any leads on how to overcome this. Thanks in advance.
Check with this solution.
for url in urls:
url = 'https://www.storytel.com'+ url
yield scrapy.Request(url=url, callback=self.parse_details)
it helps let me know.
url = ['https://www.storytel.com'].urljoin(url)
Here you are trying to "join" a string to a list of strings. If you want to append a given url (which is a string) to the base string (https://etc...), you could do it by:
full_url = "https://www.storytel.com".join(url)
# OR
full_url = "https://www.storytel.com" + url
You can check the docs about strings (specifically 'join') here: https://docs.python.org/3.8/library/stdtypes.html#str.join
EDIT: also, I'm not sure that urljoin exists...

Scrape Multiple articles from one page with each article with separate href

I am new to scrapy and writing my first spider make a scrapy spider for website similar to https://blogs.webmd.com/diabetes/default.htm
I want to scrape Headlines and then navigate to each article scrape the text content for each article.
I have tried by using rules and linkextractor but it's not able to navigate to next page and extract. i get the ERROR: Spider error processing https://blogs.webmd.com/diabetes/default.htm> (referer: None)
Below is my code
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
class MedicalSpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['https://blogs.webmd.com/diabetes/default.htm']
start_urls = ['https://blogs.webmd.com/diabetes/default.htm']
Rules = (Rule(LinkExtractor(allow=(), restrict_css=('.posts-list-post-content a ::attr(href)')), callback="parse", follow=True),)
def parse(self, response):
headline = response.css('.posts-list-post-content::text').extract()
body = response.css('.posts-list-post-desc::text').extract()
print("%s : %s" % (headline, body))
next_page = response.css('.posts-list-post-content a ::attr(href)').extract()
if next_page:
next_href = next_page[0]
next_page_url = next_href
request = scrapy.Request(url=next_page_url)
yield request
Please guide a newbie in scrapy to get this spider right for multiple articles on each page.
Usually when using scrapy each response is parsed by parse callback. The main parse method is the callback for the initial response obtained for each of the start_urls.
The goal for that parse function should then be to "Identify article links", and issue requests for each of them. Those responses would then be parsed by another callback, say parse_article that would then extract all the contents from that particular article.
You don't even need that LinkExtractor. Consider:
import scrapy
class MedicalSpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['blogs.webmd.com'] # Only the domain, not the URL
start_urls = ['https://blogs.webmd.com/diabetes/default.htm']
def parse(self, response):
article_links = response.css('.posts-list-post-content a ::attr(href)')
for link in article_links:
url = link.get()
if url:
yield response.follow(url=url, callback=self.parse_article)
def parse_article(self, response):
headline = 'some-css-selector-to-get-the-headline-from-the-aticle-page'
# The body is trickier, since it's spread through several tags on this particular site
body = 'loop-over-some-selector-to-get-the-article-text'
yield {
'headline': headline,
'body': body
}
I've not pasted the full code because I believe you still want some excitement learning how to do this, but you can find what I came up with on this gist
Note that the parse_article method is returning dictionaries. These are using Scrapy's items pipelines. You can get a neat json output by running your code using: scrapy runspider headlines/spiders/medical.py -o out.json

Scrapy - xpath - extract returns null

My goal is to build a scraper that extract data from a table from this site.
Initially I followed the tutorial of Scrapy where I succeeded in extracting data from the test site. When I try to replicate it for Bitinfocharts, first issue is I need to use xpath, which the tutorial doesn't cover in detail (they use css only). I have been able to scrape the specific data I want through shell.
My current issue is understanding how I can scrape them all from my code and at the same time write the results to a .csv / .json file?
I'm probably missing something completely obvious. If you can have a look at my code and let me know I'm doing wrong, I would deeply appreciate it.
Thanks!
First attempt:
import scrapy
class RichlistTestItem(scrapy.Item):
# overview details
wallet = scrapy.Field()
balance = scrapy.Field()
percentage_of_coins = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domain = ['https://bitinfocharts.com/']
start_urls = [
'https://bitinfocharts.com/top-100-richest-vertcoin-addresses.html'
]
def parse(self, response):
for sel in response.xpath("//*[#id='tblOne']/tbody/tr/"):
scrapy.Item in RichlistTestItem()
scrapy.Item['wallet'] = sel.xpath('td[2]/a/text()').extract()[0]
scrapy.Item['balance'] = sel.xpath('td[3]/a/text').extract()[0]
scrapy.Item['percentage_of_coins'] = sel.xpath('/td[4]/a/text').extract()[0]
yield('wallet', 'balance', 'percentage_of_coins')
Second attempt: (probably closer to 50th attempt)
import scrapy
class RichlistTestItem(scrapy.Item):
# overview details
wallet = scrapy.Field()
balance = scrapy.Field()
percentage_of_coins = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domain = ['https://bitinfocharts.com/']
start_urls = [
'https://bitinfocharts.com/top-100-richest-vertcoin-addresses.html'
]
def parse(self, response):
for sel in response.xpath("//*[#id='tblOne']/tbody/tr/"):
wallet = sel.xpath('td[2]/a/text()').extract()
balance = sel.xpath('td[3]/a/text').extract()
percentage_of_coins = sel.xpath('/td[4]/a/text').extract()
print(wallet, balance, percentage_of_coins)
I have fixed your second trial, specifically the code snippet below
for sel in response.xpath("//*[#id=\"tblOne\"]/tbody/tr"):
wallet = sel.xpath('td[2]/a/text()').extract()
balance = sel.xpath('td[3]/text()').extract()
percentage_of_coins = sel.xpath('td[4]/text()').extract()
The problems, I found are
there was a trailing "/" for the table row selector.
for balance the
value was inside td not inside a link inside td
for percetag.. again
the value was inside td.
Also there is a data-val property for each of the td. Scraping those might be little easier than getting the value from inside of td.

Crawler skipping content of the first page

I've created a crawler which is parsing certain content from a website.
Firstly, it scrapes links to the category from left-sided bar.
secondly, it harvests the whole links spread through pagination connected to the profile page
And finally, going to each profile page it scrapes name, phone and web address.
So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[#class='name-info']"):
links = titles.xpath(".//a[#class='pro-title']/#href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[#class='container']"):
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
print(name,phone,web)
category_links(url)
The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href") returns an empty list and the profile_pagelink function never gets executed.
As a quick fix you can handle this case separately in the category_links function :
def category_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
if mainurl == "https://www.houzz.com/professionals/":
profile_pagelink("https://www.houzz.com/professionals/")
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles)
Also i noticed that the target_pagelink prints a lot of empty strings as a result of if_exist returning "" . You can skip those cases if you add a condition in the for loop :
for titles in tree.xpath("//div[#class='container']"): # use class='profile-cover' if you get douplicates #
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
if name+phone+web :
print(name,phone,web)
Finally requests.Session is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get and have the same results.

Pass values into scrapy callback

I'm trying to get started crawling and scraping a website to disk but having trouble getting the callback function working as I would like.
The code below will visit the start_url and find all the "a" tags on the site. For each 1 of them it will make a callback which is to save the text response to disk and use the crawerItem to store some metadata about the page.
I was hoping someone could help me figure out how to pass
a unique id to each callback so it can be used as the filename when saving the file
Pass the url of the originating page so it can be added to the metadata via the Items
Follow the links on the child pages to go another level deeper into the site
Below is my code thus far
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from mycrawler.items import crawlerItem
class CrawlSpider(scrapy.Spider):
name = "librarycrawler"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
rules = (
Rule(LinkExtractor(),callback='scrape_page', follow=True)
)
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = LibrarycrawlerItem()
item['title'] =ScrapedPageTitle
item['file_urls'] = response.url
yield item
In Settings.py
ITEM_PIPELINES = [
'librarycrawler.files.FilesPipeline',
]
FILES_STORE = 'C:\Documents\Spider\crawler\ExtractedText'
In items.py
import scrapy
class LibrarycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
Files = scrapy.Field()
I'm not 100% sure but I think you can't rename the scrapy image files however you want, scrapy does that.
What you want to do looks like a job for CrawlSpider instead of Spider.
CrawlSpider by itself follows every link it finds in every page recursively and you can set rules on what pages you want to scrap. Here are the docs.
If you are stubborn enough to keep Spider you can use the meta tag on requests to pass the items and save links in them.
for link in soup.find_all("a"):
item=crawlerItem()
item['url'] = response.urljoin(link.get('href'))
request=scrapy.Request(url,callback=self.scrape_page)
request.meta['item']=item
yield request
To get the item just go look for it in the response:
def scrape_page(self, response):
item=response.meta['item']
In this specific example the item passed item['url'] is obsolete as you can get the current url with response.url
Also,
It's a bad idea to use Beautiful soup in scrapy as it just slows you down, the scrapy library is really well developed to the extent that you don't need anything else to extract data!

Resources