Scrapy: How to use init_request and start_requests together? - python-3.x

I need to make an initial call to a service before I start my scraper (the initial call, gives me some cookies and headers), I decided to use InitSpider and override the init_request method to achieve this. however I also need to use start_requests to build my links and add some meta values like proxies and whatnot to that specific spider, but I'm facing a problem. whenever I override start_requests, my crawler doesn't call init_request anymore and I can not do the initialization and in order to get init_request working is to not override the start_requests method which is impossible in my case. any suggestions or possible solutions to my code:
class SomethingSpider(InitSpider):
name = 'something'
allowed_domains = ['something.something']
aod_url = "https://something?="
start_urls = ["id1","id2","id3"]
custom_settings = {
'DOWNLOAD_FAIL_ON_DATALOSS' : False,
'CONCURRENT_ITEMS': 20,
'DOWNLOAD_TIMEOUT': 10,
'CONCURRENT_REQUESTS': 3,
'COOKIES_ENABLED': True,
'CONCURRENT_REQUESTS_PER_DOMAIN': 20
}
def init_request(self):
yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})
def check_temp_cookie(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if response.status == 200:
print("H2")
# Now the crawling can begin..
return self.initialized()
else:
print("H3")
# Something went wrong, we couldn't log in, so nothing happens.
def start_requests(self):
print("H4")
proxies = ["xyz:0000","abc:1111"]
for url in self.start_urls:
yield scrapy.Request(url=self.aod_url+url, callback=self.parse, meta={'proxy': random.choice(proxies)})
def parse(self, response):
#some processing happens
yield {
#some data
}
except Exception as err:
print("Connecting to...")

Spiders page (generic spiders section) on official scrapy docs doesn't have any mention of InitSpider You are trying to use.
InitSpider class from https://github.com/scrapy/scrapy/blob/2.5.0/scrapy/spiders/init.py written ~10 years ago (at that... ancient versions of scrapy start_requests method worked completely differently).
From this perspective I recommend You to not use undocumented and probably outdated InitSpider.
On current versions of scrapy required functionality can be implemented using regular Spider class:
import scrapy
class SomethingSpider(scrapy.Spider):
...
def start_requests(self):
yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})
def check_temp_cookie(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if response.status == 200:
print("H2")
# Now the crawling can begin..
...
#Schedule next requests here:
for url in self.start_urls:
yield scrapy.Request(url=self.aod_url+url, callback=self.parse, ....})
else:
print("H3")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
...

If you are looking speicfically at incorporating logging in then I would reccomend you look at Using FormRequest.from_response() to simulate a user login in the scrapy docs.
Here is the spider example they give:
import scrapy
def authentication_failed(response):
# TODO: Check the contents of the response and return True if it failed
# or False if it succeeded.
pass
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
if authentication_failed(response):
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
finally, you can have a look at how too add proxies to your scrapy middleware as per this example (zyte are the guys who wrote scrapy) "How to set up a custom proxy in Scrapy?"

Related

Attempting login with Scrapy-Splash

Since i am not able to login to https://www.duif.nl/login, i tried many different methods like selenium, which i successfully logged in, but didnt manage to start crawling.
Now i tried my luck with scrapy-splash, but i cant login :(
If i render the loginpage with splash, i see following picture:
Well, there should be a loginform, like username and password, but scrapy cant see it?
Im sitting here like a week in front of that loginform and losing my will to live..
My last question didnt even get one answer, now i try it again.
here is the html code of the login-form:
When i login manual, i get redirected to "/login?returnUrl=", where i only have these form_data:
My Code
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from scrapy.spiders import CrawlSpider, Rule
from ..items import ScrapysplashItem
from scrapy.http import FormRequest, Request
import csv
class DuifSplash(CrawlSpider):
name = "duifsplash"
allowed_domains = ['duif.nl']
login_page = 'https://www.duif.nl/login'
with open('duifonlylinks.csv', 'r') as f:
reader = csv.DictReader(f)
start_urls = [items['Link'] for items in reader]
def start_requests(self):
yield SplashRequest(
url=self.login_page,
callback=self.parse,
dont_filter=True
)
def parse(self, response):
return FormRequest.from_response(
response,
formdata={
'username' : 'not real',
'password' : 'login data',
}, callback=self.after_login)
def after_login(self, response):
accview = response.xpath('//div[#class="c-accountbox clearfix js-match-height"]/h3')
if accview:
print('success')
else:
print(':(')
for url in self.start_urls:
yield response.follow(url=url, callback=self.parse_page)
def parse_page(self, response):
productpage = response.xpath('//div[#class="product-details col-md-12"]')
if not productpage:
print('No productlink', response.url)
for a in productpage:
items = ScrapysplashItem()
items['SKU'] = response.xpath('//p[#class="desc"]/text()').get()
items['Title'] = response.xpath('//h1[#class="product-title"]/text()').get()
items['Link'] = response.url
items['Images'] = response.xpath('//div[#class="inner"]/img/#src').getall()
items['Stock'] = response.xpath('//div[#class="desc"]/ul/li/em/text()').getall()
items['Desc'] = response.xpath('//div[#class="item"]/p/text()').getall()
items['Title_small'] = response.xpath('//div[#class="left"]/p/text()').get()
items['Price'] = response.xpath('//div[#class="price"]/span/text()').get()
yield items
In my "prework", i crawled every internal link and saved it to a .csv-File, where i analyse which of the links are product links and which are not.
Now i wonder, if i open a link of my csv, it opens an authenticated session or not?
I cant find no cookies, this is also strange to me
UPDATE
I managed to login successfully :-) now i only need to know where the cookies are stored
Lua Script
LUA_SCRIPT = """
function main(splash, args)
splash:init_cookies(splash.args.cookies),
splash:go("https://www.duif.nl/login"),
splash:wait(0.5),
local title = splash.evaljs("document.title"),
return {
title=title,
cookies = splash:get_cookies(),
},
end
"""
I don't think using Splash here is the way to go, as even with a normal Request the form is there: response.xpath('//form[#id="login-form"]')
There are multiple forms available on the page, so you have to specify which form you want to base yourself on to make a FormRequest.from_response. Best specify the clickdata as well (so it goes to 'Login', not to 'forgot password'). In summary it would look something like this:
req = FormRequest.from_response(
response,
formid='login-form',
formdata={
'username' : 'not real',
'password' : 'login data'},
clickdata={'type': 'submit'}
)
If you don't use Splash, you don't have to worry about passing cookies - this is taken care of by Scrapy. Just make sure you don't put COOKIES_ENABLED=False in your settings.py

Logging in to Steam using Scrapy, login and 2FA

I would like to log into Steam to try my hand at some data collection but I don't really know how to go about logging in and getting past 2FA. My current code tries to log in and is supposed to save the result of that into an html file so I could see what was achieved. It currently returns a blank html file.
import scrapy
def authentication_failed(response):
pass
class LoginSpider(scrapy.Spider):
name = 'loginSpider'
start_urls = ['https://steamcommunity.com/login/home/?goto=']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'user', 'password': 'pwd'},
callback=self.after_login
)
def after_login(self, response):
if authentication_failed(response):
self.logger.error("Login failed")
return
html = open("test.html", "w")
html.write(response.body.decode("utf-8"))
html.close()
To spare me asking another question, would getting through the Steam Guard 2FA system be as simple as asking the user to type the code in and then sending another FormRequest?

How to dynamically change page number in FormRequest on POST request when crawling multiple page using Scrapy

The website that I wanna crawl is using POST method to get data, instead of navigating to paginating url. Getting the first page looks great now, using this method:
def start_requests(self):
formdata = {
...
'PageIndex': '0',
...
}
return [
FormRequest('my-url', formdata=formdata, callback=self.parse)
]
I checked the next page and tried to yield the next page as following code:
current_page = 0
....
def parse(self, response):
next_page = Selector(response).css('a.viewmore').extract_first()
if next_page is not None:
self.current_page = self.current_page + 1
formdata = {
...
'PageIndex': self.current_page,
...
}
yield FormRequest('my-url', formdata=formdata, callback=self.parse)
This is where it is broken. I got the error log here, and I can only assume that the way I assign self.current_page causes the broken result.
I am using macOS, python3 (version 3.8.1), scrapy 1.8.0.Could anyone guide me on this one and help me to assign dynamic page number on POST request like this? Thanks in advance!
Update: I have figured out that the line self.current_page should be cast as following: 'PageIndex': str(self.current_page). This problem is solved!

How to crawl dynamically generated data on google's webstore search results

I want to crawl a web page which shows the results of a search in google's webstore and the link is static for that particular keyword.
I want to find the ranking of an extension periodically.
Here is the URL
Problem is that I can't render the dynamic data generated by Javascript code in response from server.
I tried using Scrapy and Scrapy-Splash to render the desired page but I was still getting the same response. I used Docker to run an instance of scrapinghub/splash container on port 8050. I even visited the webpage http://localhost:8050 and entered my URL manually but it couldn't render the data although the message showed success.
Here's the code I wrote for the crawler. It actually does nothing and its only job is to fetch the HTML contents of the desired page.
import scrapy
from scrapy_splash import SplashRequest
class WebstoreSpider(scrapy.Spider):
name = 'webstore'
def start_requests(self):
yield SplashRequest(
url='https://chrome.google.com/webstore/search/netflix%20vpn?utm_source=chrome-ntp-icon&_category=extensions',
callback=self.parse,
args={
"wait": 3,
},
)
def parse(self, response):
print(response.text)
and the contents of the settings.py of my Scrapy project:
BOT_NAME = 'webstore_cralwer'
SPIDER_MODULES = ['webstore_cralwer.spiders']
NEWSPIDER_MODULE = 'webstore_cralwer.spiders'
ROBOTSTXT_OBEY = False
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
And for the result I always get nothing.
Any help is appreciated.
Works for me with a small custom lua script:
lua_source = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(5.0))
return {
html = splash:html(),
}
end
"""
You can then change your start_requests as follows:
def start_requests(self):
yield SplashRequest(
url='https://chrome.google.com/webstore/search/netflix%20vpn?utm_source=chrome-ntp-icon&_category=extensions',
callback=self.parse,
args={'lua_source': self.lua_source},
)

Authenticated spider pagination. 302 redirect. reqvalidation.asps - page not found

I have a scrapy sider that can log into ancestry.com successfully. I then use that authenticated session to return a new link and can scrape the first page of the new link successfully. The issue happens when I try to go to the second page. I get a 302 redirect debug message, and this url: https://secure.ancestry.com/error/reqvalidation.aspx?aspxerrorpath=http%3a%2f%2fsearch.ancestry.com%2ferror%2fPageNotFound&msg=&ti=0>.
I followed the documentation and have follow some recommendations here to get me this far. Do I need a session token for each page? if so how do I got about doing that?
import scrapy
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import FormRequest
from scrapy.loader import ItemLoader
from ..items import AncItem
class AncestrySpider(CrawlSpider):
name = 'ancestry'
def start_requests(self):
return[
FormRequest(
'https://www.ancestry.com/account/signin?returnUrl=https%3A%2F%2Fwww.ancestry.com',
formdata={"username": "foo", "password": "bar"},
callback=self.after_login
)
]
def after_login(self, response):
if "authentication failed".encode() in response.body:
self.log("Login failed", level=log.ERROR)
return
else:
return Request(url='https://www.ancestry.com/search/collections/nypl/?name=_Wang&count=50&name_x=_1',
callback=self.parse)
def parse(self, response):
all_products = response.xpath("//tr[#class='tblrow record']")
for product in all_products:
loader = ItemLoader(item=AncItem(), selector=product, response=response)
loader.add_css('Name', '.srchHit')
loader.add_css('Arrival_Date', 'td:nth-child(3)')
loader.add_css('Birth_Year', 'td:nth-child(4)')
loader.add_css('Port_of_Departure', 'td:nth-child(5)')
loader.add_css('Ethnicity_Nationality', 'td:nth-child(6)')
loader.add_css('Ship_Name', 'td:nth-child(7)')
yield loader.load_item()
next_page = response.xpath('//a[#class="ancBtn sml green icon iconArrowRight"]').extract_first()
if next_page is not None:
next_page_link = response.urljoin(next_page)
yield scrapy.Request( url=next_page_link, callback=self.parse)
I tired adding some request header information. I tried adding the cookie information to the request header, but that did not work. I've tried using only the USER agents that are listed in the POST packages.
Right now I only get 50 results. I should be getting hundreds after crawling all the pages.
Found the solution. It had nothing to do with the authentication to the website. I needed to find a different way to approach pagination. I resorted to using the page url for pagination instead of following the "next page" button link.

Resources