Logging in to Steam using Scrapy, login and 2FA - python-3.x

I would like to log into Steam to try my hand at some data collection but I don't really know how to go about logging in and getting past 2FA. My current code tries to log in and is supposed to save the result of that into an html file so I could see what was achieved. It currently returns a blank html file.
import scrapy
def authentication_failed(response):
pass
class LoginSpider(scrapy.Spider):
name = 'loginSpider'
start_urls = ['https://steamcommunity.com/login/home/?goto=']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'user', 'password': 'pwd'},
callback=self.after_login
)
def after_login(self, response):
if authentication_failed(response):
self.logger.error("Login failed")
return
html = open("test.html", "w")
html.write(response.body.decode("utf-8"))
html.close()
To spare me asking another question, would getting through the Steam Guard 2FA system be as simple as asking the user to type the code in and then sending another FormRequest?

Related

How to scrape a website with multiple pages with the same url adress using scrapy-playwright

I am trying to scrape a website with multiple pages with the same url using scrapy-playwright.
the following script returned only the data of the second page and did not continue to the rest of the pages.
can anyone suggest how I can fix it?
import scrapy
from scrapy_playwright.page import PageMethod
from scrapy.crawler import CrawlerProcess
class AwesomeSpideree(scrapy.Spider):
name = "awesome"
def start_requests(self):
# GET request
yield scrapy.Request(
url=f"https://www.cia.gov/the-world-factbook/countries/" ,
callback = self.parse,
meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = {
"click" : PageMethod('click',selector = 'xpath=//div[#class="pagination-controls col-lg-6"]//span[#class="pagination__arrow-right"]'),
"screenshot": PageMethod("screenshot", path=f"step1.png", full_page=True)
},
)
)
async def parse(self, response):
page = response.meta["playwright_page"]
await page.close()
print("-"*80)
CountryLst = response.xpath("//div[#class='col-lg-9']")
for Country in CountryLst:
yield {
"country_link": Country.xpath(".//a/#href").get()
}
I see you are trying to fetch URLs of countries from above mentioned URL.
if you inspect the Network tab you can see there is one request to one JSON data API. You can fetch all countries URL's from this url
after that if you still want scrap more data from scraped URL's then you can easily scrap because that data is static so there will be no need to use playwright.
Have a good day :)

Scrapy: How to use init_request and start_requests together?

I need to make an initial call to a service before I start my scraper (the initial call, gives me some cookies and headers), I decided to use InitSpider and override the init_request method to achieve this. however I also need to use start_requests to build my links and add some meta values like proxies and whatnot to that specific spider, but I'm facing a problem. whenever I override start_requests, my crawler doesn't call init_request anymore and I can not do the initialization and in order to get init_request working is to not override the start_requests method which is impossible in my case. any suggestions or possible solutions to my code:
class SomethingSpider(InitSpider):
name = 'something'
allowed_domains = ['something.something']
aod_url = "https://something?="
start_urls = ["id1","id2","id3"]
custom_settings = {
'DOWNLOAD_FAIL_ON_DATALOSS' : False,
'CONCURRENT_ITEMS': 20,
'DOWNLOAD_TIMEOUT': 10,
'CONCURRENT_REQUESTS': 3,
'COOKIES_ENABLED': True,
'CONCURRENT_REQUESTS_PER_DOMAIN': 20
}
def init_request(self):
yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})
def check_temp_cookie(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if response.status == 200:
print("H2")
# Now the crawling can begin..
return self.initialized()
else:
print("H3")
# Something went wrong, we couldn't log in, so nothing happens.
def start_requests(self):
print("H4")
proxies = ["xyz:0000","abc:1111"]
for url in self.start_urls:
yield scrapy.Request(url=self.aod_url+url, callback=self.parse, meta={'proxy': random.choice(proxies)})
def parse(self, response):
#some processing happens
yield {
#some data
}
except Exception as err:
print("Connecting to...")
Spiders page (generic spiders section) on official scrapy docs doesn't have any mention of InitSpider You are trying to use.
InitSpider class from https://github.com/scrapy/scrapy/blob/2.5.0/scrapy/spiders/init.py written ~10 years ago (at that... ancient versions of scrapy start_requests method worked completely differently).
From this perspective I recommend You to not use undocumented and probably outdated InitSpider.
On current versions of scrapy required functionality can be implemented using regular Spider class:
import scrapy
class SomethingSpider(scrapy.Spider):
...
def start_requests(self):
yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})
def check_temp_cookie(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if response.status == 200:
print("H2")
# Now the crawling can begin..
...
#Schedule next requests here:
for url in self.start_urls:
yield scrapy.Request(url=self.aod_url+url, callback=self.parse, ....})
else:
print("H3")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
...
If you are looking speicfically at incorporating logging in then I would reccomend you look at Using FormRequest.from_response() to simulate a user login in the scrapy docs.
Here is the spider example they give:
import scrapy
def authentication_failed(response):
# TODO: Check the contents of the response and return True if it failed
# or False if it succeeded.
pass
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
if authentication_failed(response):
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
finally, you can have a look at how too add proxies to your scrapy middleware as per this example (zyte are the guys who wrote scrapy) "How to set up a custom proxy in Scrapy?"

Attempting login with Scrapy-Splash

Since i am not able to login to https://www.duif.nl/login, i tried many different methods like selenium, which i successfully logged in, but didnt manage to start crawling.
Now i tried my luck with scrapy-splash, but i cant login :(
If i render the loginpage with splash, i see following picture:
Well, there should be a loginform, like username and password, but scrapy cant see it?
Im sitting here like a week in front of that loginform and losing my will to live..
My last question didnt even get one answer, now i try it again.
here is the html code of the login-form:
When i login manual, i get redirected to "/login?returnUrl=", where i only have these form_data:
My Code
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from scrapy.spiders import CrawlSpider, Rule
from ..items import ScrapysplashItem
from scrapy.http import FormRequest, Request
import csv
class DuifSplash(CrawlSpider):
name = "duifsplash"
allowed_domains = ['duif.nl']
login_page = 'https://www.duif.nl/login'
with open('duifonlylinks.csv', 'r') as f:
reader = csv.DictReader(f)
start_urls = [items['Link'] for items in reader]
def start_requests(self):
yield SplashRequest(
url=self.login_page,
callback=self.parse,
dont_filter=True
)
def parse(self, response):
return FormRequest.from_response(
response,
formdata={
'username' : 'not real',
'password' : 'login data',
}, callback=self.after_login)
def after_login(self, response):
accview = response.xpath('//div[#class="c-accountbox clearfix js-match-height"]/h3')
if accview:
print('success')
else:
print(':(')
for url in self.start_urls:
yield response.follow(url=url, callback=self.parse_page)
def parse_page(self, response):
productpage = response.xpath('//div[#class="product-details col-md-12"]')
if not productpage:
print('No productlink', response.url)
for a in productpage:
items = ScrapysplashItem()
items['SKU'] = response.xpath('//p[#class="desc"]/text()').get()
items['Title'] = response.xpath('//h1[#class="product-title"]/text()').get()
items['Link'] = response.url
items['Images'] = response.xpath('//div[#class="inner"]/img/#src').getall()
items['Stock'] = response.xpath('//div[#class="desc"]/ul/li/em/text()').getall()
items['Desc'] = response.xpath('//div[#class="item"]/p/text()').getall()
items['Title_small'] = response.xpath('//div[#class="left"]/p/text()').get()
items['Price'] = response.xpath('//div[#class="price"]/span/text()').get()
yield items
In my "prework", i crawled every internal link and saved it to a .csv-File, where i analyse which of the links are product links and which are not.
Now i wonder, if i open a link of my csv, it opens an authenticated session or not?
I cant find no cookies, this is also strange to me
UPDATE
I managed to login successfully :-) now i only need to know where the cookies are stored
Lua Script
LUA_SCRIPT = """
function main(splash, args)
splash:init_cookies(splash.args.cookies),
splash:go("https://www.duif.nl/login"),
splash:wait(0.5),
local title = splash.evaljs("document.title"),
return {
title=title,
cookies = splash:get_cookies(),
},
end
"""
I don't think using Splash here is the way to go, as even with a normal Request the form is there: response.xpath('//form[#id="login-form"]')
There are multiple forms available on the page, so you have to specify which form you want to base yourself on to make a FormRequest.from_response. Best specify the clickdata as well (so it goes to 'Login', not to 'forgot password'). In summary it would look something like this:
req = FormRequest.from_response(
response,
formid='login-form',
formdata={
'username' : 'not real',
'password' : 'login data'},
clickdata={'type': 'submit'}
)
If you don't use Splash, you don't have to worry about passing cookies - this is taken care of by Scrapy. Just make sure you don't put COOKIES_ENABLED=False in your settings.py

Python authorization

I need to write a script that is included in the personal account of my Internet provider and transmits information about the current balance.
At the moment I am stuck at the time of authorization. I found and edited such a script for myself:
import requests
url = 'https://bill.tomtel.ru/login.html'
USERNAME, PASSWORD, = 'mylogin', 'mypass'
resp = requests.get(url, auth=(USERNAME, PASSWORD))
r = requests.post(url)
print(r.content)
But this does not help to pass authorization...
I can enter this link through a browser and go to a page of this type:
https://bill.tomtel.ru/fastcom/!w3_p_main.showform?FORMNAME=QFRAME&CONFIG=CONTRACT&SID=BLABLABLA&NLS=WR
I can go through browser authorization through both links, but why can't I do this through a script?
Please help with this.
Your browser probably has a session token/cookie stored and that is why you can access it through the browser. There are a couple issues here:
It looks like you need to login to the site first -- through a POST method, not a GET. The GET is what loads the page. But once you submit the form it's going to do a POST request.
Actually, using requests to login to a site is not as easy as it looks. Usually you have to find the url it's posting to (examine the developer toolbar to see the url), and you often have to pass information in addition to your username/password, such as a csrf token, a cookie, or something else.
I would suggest using a browser-automator for this, perhaps something like selenium Webdriver. It makes logging into a site much simpler than using HTTP in a raw request, as it emulates a browser. I would suggest this -- it's much simpler and faster!
Another thing to note: auth=(USERNAME, PASSWORD) is not quite the username/password in the form (it's something else) but I don't think understanding that is too relevant to what you're trying to do.
Here is the url and required form data to log in:
I think you should try this:
import requests
url = 'https://bill.tomtel.ru/signin.php'
USERNAME = input('Enter your username: ')
PASSWORD = input('Enter your password: ')
d = {
'USERNAME' : USERNAME,
'PASSWORD' : PASSWORD,
'FORMNAME' : 'QFRAME'}
session = requests.Session()
resp = session.post(url, data=d).text
if not '<TITLE>' in resp:
print('Incorrect username or password!')
quit()
print('Logging in ... ')
for line in resp.split('\n'):
if 'location' in line:
red = 'https://bill.tomtel.ru/fastcom/!w3_p_main.showform%s' % line.replace(' if (P>0) self.location.replace("', '').replace('");', '')
if not red:
print('An error has occured')
quit()
print('Redirecting to %s' % red)
page = session.get(red).text
print('')
print(' MAIN PAGE')
print(page)

How to logout in Python Bottle?

I have the following test snippet:
def check(username, password):
if username == "b" and password == "password":
return True
return False
#route('/logout')
#route('/logout', method="POST")
def logout():
# template with a logout button
# this does redirect successfully, but this shouldn't happen
redirect('/after-login')
#route('/after-login')
#auth_basic(check)
def after_login():
return "hello"
#route('/login')
#route('/login', method="POST")
def login():
return template("views/login/login_page")
username = post_get('username')
password = post_get('password')
I'm attempting to log out of the system, but I haven't been able to find any resources on how to do this. Basically, I tried dir(response) and dir(request) and haven't found any functions that appears to set the session off (mostly attempting to reset cookies), unless I close the browser.
Issued the same problem. Well, The decision I found in docs and used is response.delete_cookie('<cookiename>')
So every time I enter the page with setting any cookies, first I delete all possibble to change cookies.
You want to log out of HTTP Basic Auth, which is not really what it was designed for. But there does seem to be a way: return an HTTP 401.
from bottle import abort
#route('/logout', method=["GET", "POST"])
def logout():
abort(401, "You're no longer logged in")
Does this work?

Resources