Fill out online form - python-3.x

I'm trying to fill out the form located at https://idp.ncedcloud.org/idp/AuthnEngine#/authn with a username and password. I want to know if went through successfully or not. I tried it ith python2, but I couldn't get it to work.
#!/usr/bin/env python
import urllib
import urllib2
name = "username"
name2 = "password"
data = {
"description" : name,
"ember501": name2
}
encoded_data = urllib.urlencode(data)
content = urllib2.urlopen("https://idp.ncedcloud.org/idp/AuthnEngine#/authn",
encoded_data)
print(content)
error:
It prints the content of the same webpage, and I want it to print the new webpage content.
desired behavior:
python3 solution that goes to the next webpage, or why my code isn't working

Basic example of posting data to a webpage like this using requests library. I would suggest using session, so that your login info is saved for any following requests:
import requests
url = "http://example.com"
name = "username"
name2 = "password"
data = {
"description" : name,
"ember501": name2
}
s = requests.Session()
# If it supports basic auth you could just do
# s.auth(username, password) here before a request
req = s.post(url, data=data)
print(req.text)
Should then be able to do following requests with the s session object

Related

Attempting login with Scrapy-Splash

Since i am not able to login to https://www.duif.nl/login, i tried many different methods like selenium, which i successfully logged in, but didnt manage to start crawling.
Now i tried my luck with scrapy-splash, but i cant login :(
If i render the loginpage with splash, i see following picture:
Well, there should be a loginform, like username and password, but scrapy cant see it?
Im sitting here like a week in front of that loginform and losing my will to live..
My last question didnt even get one answer, now i try it again.
here is the html code of the login-form:
When i login manual, i get redirected to "/login?returnUrl=", where i only have these form_data:
My Code
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from scrapy.spiders import CrawlSpider, Rule
from ..items import ScrapysplashItem
from scrapy.http import FormRequest, Request
import csv
class DuifSplash(CrawlSpider):
name = "duifsplash"
allowed_domains = ['duif.nl']
login_page = 'https://www.duif.nl/login'
with open('duifonlylinks.csv', 'r') as f:
reader = csv.DictReader(f)
start_urls = [items['Link'] for items in reader]
def start_requests(self):
yield SplashRequest(
url=self.login_page,
callback=self.parse,
dont_filter=True
)
def parse(self, response):
return FormRequest.from_response(
response,
formdata={
'username' : 'not real',
'password' : 'login data',
}, callback=self.after_login)
def after_login(self, response):
accview = response.xpath('//div[#class="c-accountbox clearfix js-match-height"]/h3')
if accview:
print('success')
else:
print(':(')
for url in self.start_urls:
yield response.follow(url=url, callback=self.parse_page)
def parse_page(self, response):
productpage = response.xpath('//div[#class="product-details col-md-12"]')
if not productpage:
print('No productlink', response.url)
for a in productpage:
items = ScrapysplashItem()
items['SKU'] = response.xpath('//p[#class="desc"]/text()').get()
items['Title'] = response.xpath('//h1[#class="product-title"]/text()').get()
items['Link'] = response.url
items['Images'] = response.xpath('//div[#class="inner"]/img/#src').getall()
items['Stock'] = response.xpath('//div[#class="desc"]/ul/li/em/text()').getall()
items['Desc'] = response.xpath('//div[#class="item"]/p/text()').getall()
items['Title_small'] = response.xpath('//div[#class="left"]/p/text()').get()
items['Price'] = response.xpath('//div[#class="price"]/span/text()').get()
yield items
In my "prework", i crawled every internal link and saved it to a .csv-File, where i analyse which of the links are product links and which are not.
Now i wonder, if i open a link of my csv, it opens an authenticated session or not?
I cant find no cookies, this is also strange to me
UPDATE
I managed to login successfully :-) now i only need to know where the cookies are stored
Lua Script
LUA_SCRIPT = """
function main(splash, args)
splash:init_cookies(splash.args.cookies),
splash:go("https://www.duif.nl/login"),
splash:wait(0.5),
local title = splash.evaljs("document.title"),
return {
title=title,
cookies = splash:get_cookies(),
},
end
"""
I don't think using Splash here is the way to go, as even with a normal Request the form is there: response.xpath('//form[#id="login-form"]')
There are multiple forms available on the page, so you have to specify which form you want to base yourself on to make a FormRequest.from_response. Best specify the clickdata as well (so it goes to 'Login', not to 'forgot password'). In summary it would look something like this:
req = FormRequest.from_response(
response,
formid='login-form',
formdata={
'username' : 'not real',
'password' : 'login data'},
clickdata={'type': 'submit'}
)
If you don't use Splash, you don't have to worry about passing cookies - this is taken care of by Scrapy. Just make sure you don't put COOKIES_ENABLED=False in your settings.py

Trying to scrape a website that requires login

so i m new to this and been at it for almost a week now trying to scrape a website i use to collect analytics data (think of it like google analytics).
I tried playing around with xpath to figure out what this script is able to pull but all i get is "[]" as an output after running it.
Please help me find what i'm missing.
import requests
from lxml import html
#credentials
payload = {
'username': '<my username>',
'password': '<my password>',
'csrf-token': '<auth token>'
}
#open a session with login
session_requests = requests.session()
login_url = '<my website>'
result = session_requests.get(login_url)
#passing the auth token
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath('//input[#name=\'form_token\']/#value')))[0]
result = session_requests.post(
login_url,
data=payload,
headers=dict(referer=login_url)
)
#scrape the analytics dashboard from this event link
url = '<my analytics webpage url>'
result = session_requests.get(
url,
headers=dict(referer=url)
)
#print output using xpath to find and load what i need
trees = html.fromstring(result.content)
bucket_names = trees.xpath("//*[#id='statistics_dashboard']/div[1]/text()")
print(bucket_names)
print(result.ok)
print(result.status_code)
..........
this is what i get as a result
[]
True
200
Process finished with exit code 0
which is a big step for me because i've been getting so many errors to get to this point.

keeping a session open with urllib in python 3.6

I'm trying to login scrape tumblr but as when you log into the website normally through a browser, it kind of has two steps (enter email first and it check whether there is an account associated with that email then you can enter your password if the email is correct). Unfortunately, this flags up some problems when trying to automate this login without using the requests module (I'm trying to do it using urllib.request and urllib.parse which are already available in python 3.6) as there is no explicit way to start a session so you can keep the same session for the email verification and then entering the email.
Do I need to use cookies to do this or will I have to install the requests module? My code so far looks a bit like this:
import urllib.request
import urllib.parse
from html.parser import HTMLParser
input_tags = []
class myHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == "input":
for i in range(len(attrs)):
if attrs[i][0] == "name" and attrs[i][1] == "form_key":
input_tags.append(attrs[i+1][1])
parser = myHTMLParser()
form_key = ""
def get_form_key():
global form_key
global input_tags
url = "https://www.tumblr.com/login"
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req)
resp = resp.read()
parser.feed(str(resp))
print(input_tags)
form_key = input_tags
print("form key is : ", form_key)
if len(form_key) > 1:
form_key = form_key[:1]
print("\nform key should be one value long now: ", form_key)
get_form_key()
headers = {}
headers["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36"
url = "https://www.tumblr.com/login"
login_data = {
"determine_email" : "my.email#email.com",
"user[email]" : "my.email#email.com",
"user[password]" : "secretpassword",
"tumblrlog[name]" : "",
"user[age]" : "",
"http_referer" : "https://www.tumblr.com/logout",
"form_key" : form_key
}
encoded_data = urllib.parse.urlencode(data)
encoded_data = encoded_data.encode("utf-8")
request = urllib.request.Request(url, headers = headers, data = encoded_data)
response = urllib.request.urlopen(request)
response_url = response.geturl()
print(response_url)
This prints out the form key twice (not that important that's just from me bug checking) and then it returns the url:
https://www.tumblr.com/login
which indicates that the loin was not successful.
Any idea how to fix this?

Create Google Shortened URLs, Update My CSV File

I've got a list of ~3,000 URLs I'm trying to create Google shortened links of, the idea is this CSV has a list of links and I want my code to output the shortened links in the column next to the original URLs.
I've been trying to modify the code found on this site here but I'm not skilled enough to get it to work.
Here's my code (I would not normally post an API key but the original person who asked this already posted it publicly on this site) :
import json
import pandas as pd
df = pd.read_csv('Links_Test.csv')
def shorternUrl(my_URL):
API_KEY = "AIzaSyCvhcU63u5OTnUsdYaCFtDkcutNm6lIEpw"
apiUrl = 'https://www.googleapis.com/urlshortener/v1/url'
longUrl = my_URL
headers = {"Content-type": "application/json"}
data = {"longUrl": longUrl}
h = httplib2.Http('.cache')
headers, response = h.request(apiUrl, "POST", json.dumps(data), headers)
return response
for url in df['URL']:
x = shorternUrl(url)
# Then I want it to write x into the column next to the original URL
But I only get errors at this point, before I even started figuring out how to write the new URLs to the CSV file.
Here's some sample data:
URL
www.apple.com
www.google.com
www.microsoft.com
www.linux.org
Thank you for your help,
Me
I think the issue is that you didnot include the API key in the request. By the way, the certifi package allows you to secure a connection to a link. You can get it using pip install certifi or pip urllib3[secure].
Here I create my own API key, so you might want to replace it with yours.
from urllib3 import PoolManager
import json
import certifi
sampleURL = 'http://www.apple.com'
APIkey = 'AIzaSyD8F41CL3nJBpEf0avqdQELKO2n962VXpA'
APIurl = 'https://www.googleapis.com/urlshortener/v1/url?key=' + APIkey
http = PoolManager(cert_reqs = 'CERT_REQUIRED', ca_certs=certifi.where())
def shortenURL(url):
data = {'key': APIkey, 'longUrl' : url}
response = http.request("POST", APIurl, body=json.dumps(data), headers= {'Content-Type' : 'application/json'}).data.decode('utf-8')
r = json.loads(response)
return (r['id'])
The decoding part converts the response object into a string so that we can convert it to a JSON and retrieve data.
From there on, you can store the data into another column and so on.
For the sampleUrl, I got back https(goo.gl/nujb) from the function.
I found a solution here:
https://pypi.python.org/pypi/pyshorteners
Example copied from linked page:
from pyshorteners import Shortener
url = 'http://www.google.com'
api_key = 'YOUR_API_KEY'
shortener = Shortener('Google', api_key=api_key)
print "My short url is {}".format(shortener.short(url))

python requests (form filling)

This is the link to a page
filling forms...
https://anotepad.com/notes/2yrwpi
where I have to enter the content in the text area (say)("hello world") and then press save, but all this is to be done with python request module (get, post etc)
and without the use of selenium and beautifulsoup module.
I tried something like:
url="https://anotepad.com/notes/2yrwpi"
txt = "Hello World"
#construct the POST request
form_data = {'btnSaveNote':'Save', 'notecontent' : txt}
post = requests.post(url,data=form_data)
But that doesn't seem to be working
Please help!
You need to login and post to the save url, you also need to pass the note number in the form data:
import requests
save = "https://anotepad.com/note/save"
txt = "Hello World"
login = "https://anotepad.com/create_account"
data = {"action": "login",
"email": "you#whatever.com",
"password": "xxxxxx",
"submit": ""}
# construct the POST request
with requests.session() as s: # Use a Session object.
s.post(login, data) # Login.
form_data = {"number": "2yrwpi",
"notetype": "PlainText",
"noteaccess": "2",
"notequickedit": "false",
"notetitle": "whatever",
"notecontent": txt}
r = s.post(save, data=form_data) # Save note.
r.json() will give you {"message":"Saved"} on success. Also if you want to see what notes you have, after logging in, run s.post("https://anotepad.com/note/list").text.

Resources