Python 3.6.3 - Send MozillaCookieJar File and read HTML source code - python-3.x

I'm very fresh about python (i'm learning just about 1 day long).
I need to send cookies (i got them from my Google Chrome browser to a *.text file) and be redirected after login to my account page, to after read a source HTML code do what i wanna do. With much searches allong internet, i already have this piece of code:
import os
import time
import urllib.request
import http.cookiejar
while 1:
cj = http.cookiejar.MozillaCookieJar('cookies.txt')
cj.load()
print(len(cj)) # output: 9
print(cj) # output: <MozillaCookieJar[<Cookie .../>, <Cookie .../>, ... , <Cookie .../>]>
for cookie in cj:
cookie.expires = time.time() + 14 * 24 * 3600
cookieProcessor = urllib.request.HTTPCookieProcessor(cj)
opener = urllib.request.build_opener(cookieProcessor)
request = urllib.request.Request(url='https://.../')
response = opener.open(request, timeout=100)
s = str(response.read(), 'utf-8')
print(s)
if 'class' in s:
os.startfile('test.mp3')
time.sleep(5)
With this code i believe, hope i'm not be mistaken, have sending the cookies correctly. My main question is: How can i wait and catch the source HTML code after server redirect my login to personal page? I can't call again my Request with the same URL.
Thank you in advance.

Related

i want download files with python using wget(FTP). but error occured. please help to download

I want down load "*_ice.nc" files in ftp. so..
library
import wget
import math
import re
from urllib import request
adress and file list
url = "ftp://ftp.hycom.org/datasets/GLBy0.08/expt_93.0/data/hindcasts/2021/" #url
html = request.urlopen(url) #open url
html_contents = str(html.read().decode("cp949"))
url_list = re.findall(r"(ftp)(.+)(_ice.nc)", html_contents)
loop for download
for url in url_list: #loop
url_full="".join(url) #tuple to string
file_name=url_full.split("/")[-1]
print('\nDownloading ' + file_name)
wget.download(url_full) #down with wget
but error messege occured like this
(ValueError: unknown url type: 'ftp%20%20%20%20%20%20ftp%20%20%20%20%20%20382663848%20Jan%2002%20%202021%20hycom_GLBy0.08_930_2021010112_t000_ice.nc')
could i get some help?
After decoding
ftp%20%20%20%20%20%20ftp%20%20%20%20%20%20382663848%20Jan%2002%20%202021%20hycom_GLBy0.08_930_2021010112_t000_ice.nc
is
ftp ftp 382663848 Jan 02 2021 hycom_GLBy0.08_930_2021010112_t000_ice.nc
which clearly is not legal ftp address. You need alter your code so it will be
ftp://ftp.hycom.org/datasets/GLBy0.08/expt_93.0/data/hindcasts/2021/hycom_GLBy0.08_930_2021010112_t000_ice.nc
I suggest temporarily replacing wget.download(url_full) using print(url_full), then apply changes to get desired output and then reverting to wget.download(url_full).

How do you find a url from a input button (web scraping)

I'm webscraping a asp.net website, and there is a input button that links to a page I need. I'm wondering how I can get the url to the site without using automation like Selenium.
Note: I don't need to scrape the actual page, the url contains all the info I need.
This is the code I used to get to the website but I don't know where to start with scraping the button url:
select_session_url = 'http://alisondb.legislature.state.al.us/Alison/SelectSession.aspx'
session = requests.Session()
session_payload = {"__EVENTTARGET":"ctl00$ContentPlaceHolder1$gvSessions", "__EVENTARGUMENT": "$3"}
session.post(select_session_url, session_payload, headers)
senate_payload = {"__EVENTTARGET":"ctl00$ContentPlaceHolder1$btnSenate", "__EVENTARGUMENT": "Senate"}
session.post('http://alisondb.legislature.state.al.us/Alison/SessPrefiledBills.aspx', senate_payload, headers)
page = session.get('http://alisondb.legislature.state.al.us/Alison/SESSBillsList.aspx?SELECTEDDAY=1:2019-03-05&BODY=1753&READINGTYPE=R1&READINGCODE=B&PREFILED=Y')
member_soup = BeautifulSoup(page.text, 'lxml')
member = member_soup.find_all('input', value='Jones')
The html for the button is below:
<input type="button" value="Jones" onclick="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gvBills','SponsorName$47')" style="background-color:Transparent;border-color:Silver;border-style:Outset;font-size:Small;height:30px;width:100px;">
How to find the inputs onclick?
You were close by but should replace your line with:
member_soup.find('input', {"value" : "Jones"})['onclick']
Example
import requests
from bs4 import BeautifulSoup
select_session_url = 'http://alisondb.legislature.state.al.us/Alison/SelectSession.aspx'
session = requests.Session()
session_payload = {"__EVENTTARGET":"ctl00$ContentPlaceHolder1$gvSessions", "__EVENTARGUMENT": "$3"}
session.post(select_session_url, session_payload, headers)
senate_payload = {"__EVENTTARGET":"ctl00$ContentPlaceHolder1$btnSenate", "__EVENTARGUMENT": "Senate"}
session.post('http://alisondb.legislature.state.al.us/Alison/SessPrefiledBills.aspx', senate_payload, headers)
page = session.get('http://alisondb.legislature.state.al.us/Alison/SESSBillsList.aspx?SELECTEDDAY=1:2019-03-05&BODY=1753&READINGTYPE=R1&READINGCODE=B&PREFILED=Y')
member_soup = BeautifulSoup(page.text, 'lxml')
member = member_soup.find('input', {"value" : "Jones"})['onclick']
member
Output
"javascript:__doPostBack('ctl00$ContentPlaceHolder1$gvBills','SponsorName$39')"
Edit
You may interested how to start with selenium ...
from selenium import webdriver
from time import sleep
browser = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
browser.get('http://alisondb.legislature.state.al.us/Alison/SelectSession.aspx')
sleep(0.9)
browser.find_element_by_link_text('Regular Session 2019').click()
sleep(0.9)
browser.find_element_by_link_text('Prefiled Bills').click()
sleep(2)
browser.find_element_by_css_selector('input[value="Senate"]').click()
sleep(2)
browser.find_element_by_css_selector('input[value="Jones"]').click()
sleep(2)
print(browser.current_url)
browser.close()
Output
http://alisondb.legislature.state.al.us/Alison/Member.aspx?SPONSOR=Jones&BODY=1753&SPONSOR_OID=100453

Selenium to submit recaptcha using 2captcha Python

I am trying to submit Recaptcha on a search form using Python3, Selenium, and 2captcha.
Everything is working fine except submitting the Recaptcha after sending google-tokin in the text-area of Recaptcha.
Please guide me what am I missing?
When I look into my Selenium Webdriver window it shows Recaptcha text-area filled with google-tokin but I am not able to submit it to continue for search result.
Thankyou.
from selenium import webdriver
from time import sleep
from datetime import datetime
from twocaptcha import TwoCaptcha
import requests
## Launching webdriver
driverop = webdriver.ChromeOptions()
driverop.add_argument("--start-maximized")
driver = webdriver.Chrome("chromedriver/chromedriver",options=driverop)
url = "https://app.skipgenie.com/Account/Login"
sleep(randint(5,10))
email = "..."
password = ".."
input_data = pd.read_excel("input_data.xlsx")
user_Data = []
driver.get(url)
driver.find_element_by_id("Email").send_keys(email)
driver.find_element_by_id("Password").send_keys(password)
driver.find_element_by_class_name("btn-lg").click()
driver.find_element_by_id("firstName").send_keys(input_data.iloc[0][0])
driver.find_element_by_id("lastName").send_keys(input_data.iloc[0][1])
driver.find_element_by_id("street").send_keys(input_data.iloc[0][2])
driver.find_element_by_id("city").send_keys(input_data.iloc[0][3])
driver.find_element_by_id("state").send_keys(input_data.iloc[0][4])
driver.find_element_by_id("zip").send_keys(int(input_data.iloc[0][5]))
# 2Captcha service
service_key = 'ec.....' # 2captcha service key
google_site_key = '6LcxZtQZAAAAAA7gY9-aUIEkFTnRdPRob0Dl1k8a'
pageurl = 'https://app.skipgenie.com/Search/Search'
url = "http://2captcha.com/in.php?key=" + service_key + "&method=userrecaptcha&googlekey=" + google_site_key + "&pageurl=" + pageurl
resp = requests.get(url)
if resp.text[0:2] != 'OK':
quit('Service error. Error code:' + resp.text)
captcha_id = resp.text[3:]
fetch_url = "http://2captcha.com/res.php?key="+ service_key + "&action=get&id=" + captcha_id
for i in range(1, 10):
sleep(5) # wait 5 sec.
resp = requests.get(fetch_url)
if resp.text[0:2] == 'OK':
break
driver.execute_script('var element=document.getElementById("g-recaptcha-response"); element.style.display="";')
driver.execute_script("""
document.getElementById("g-recaptcha-response").innerHTML = arguments[0]
""", resp.text[3:])
Answering the question so the people who encounter situations like this could get help from this answer.
I was missing that after you get google token you need to display recaptcha text-area and send google-token to text-area like this
To display text-area of recaptcha.
driver.execute_script('var element=document.getElementById("g-recaptcha-response"); element.style.display="";')
after that send google token like this:
driver.execute_script("""
document.getElementById("g-recaptcha-response").innerHTML = arguments[0]
""", resp.text[3:])
then you need to make text-area display to none so the search button near repcatcha is clickable.
driver.execute_script('var element=document.getElementById("g-recaptcha-response"); element.style.display="none";')
then you need to click on the search button to get the search result.

Login to a website then open it in browser

I am trying to write a Python 3 code that logins in to a website and then opens it in a web browser to be able to take a screenshot of it.
Looking online I found that I could do webbrowser.open('example.com')
This opens the website, but cannot login.
Then I found that it is possible to login to a website using the request library, or urllib.
But the problem with both it that they do not seem to provide the option of opening a web page.
So how is it possible to login to a web page then display it, so that a screenshot of that page could be taken
Thanks
Have you considered Selenium? It drives a browser natively as a user would, and its Python client is pretty easy to use.
Here is one of my latest works with Selenium. It is a script to scrape multiple pages from a certain website and save their data into a csv file:
import os
import time
import csv
from selenium import webdriver
cols = [
'ies', 'campus', 'curso', 'grau_turno', 'modalidade',
'classificacao', 'nome', 'inscricao', 'nota'
]
codigos = [
96518, 96519, 96520, 96521, 96522, 96523, 96524, 96525, 96527, 96528
]
if not os.path.exists('arquivos_csv'):
os.makedirs('arquivos_csv')
options = webdriver.ChromeOptions()
prefs = {
'profile.default_content_setting_values.automatic_downloads': 1,
'profile.managed_default_content_settings.images': 2
}
options.add_experimental_option('prefs', prefs)
# Here you choose a webdriver ("the browser")
browser = webdriver.Chrome('chromedriver', chrome_options=options)
for codigo in codigos:
time.sleep(0.1)
# Here is where I set the URL
browser.get(f'http://www.sisu.mec.gov.br/selecionados?co_oferta={codigo}')
with open(f'arquivos_csv/sisu_resultados_usp_final.csv', 'a') as file:
dw = csv.DictWriter(file, fieldnames=cols, lineterminator='\n')
dw.writeheader()
ies = browser.find_element_by_xpath('//div[#class ="nome_ies_p"]').text.strip()
campus = browser.find_element_by_xpath('//div[#class ="nome_campus_p"]').text.strip()
curso = browser.find_element_by_xpath('//div[#class ="nome_curso_p"]').text.strip()
grau_turno = browser.find_element_by_xpath('//div[#class = "grau_turno_p"]').text.strip()
tabelas = browser.find_elements_by_xpath('//table[#class = "resultado_selecionados"]')
for t in tabelas:
modalidade = t.find_element_by_xpath('tbody//tr//th[#colspan = "4"]').text.strip()
aprovados = t.find_elements_by_xpath('tbody//tr')
for a in aprovados[2:]:
linha = a.find_elements_by_class_name('no_candidato')
classificacao = linha[0].text.strip()
nome = linha[1].text.strip()
inscricao = linha[2].text.strip()
nota = linha[3].text.strip().replace(',', '.')
dw.writerow({
'ies': ies, 'campus': campus, 'curso': curso,
'grau_turno': grau_turno, 'modalidade': modalidade,
'classificacao': classificacao, 'nome': nome,
'inscricao': inscricao, 'nota': nota
})
browser.quit()
In short, you set preferences, choose a webdriver (I recommend Chrome), point to the URL and that's it. The browser is automatically opened and start executing your instructions.
I have tested using it to log in and it works fine, but never tried to take screenshot. It theoretically should do.

Scrape website with login/pass (with static url?)

I am doing a self-project to keep learning and practicing with python3. I have done some other scraping proyects using BS4 and selenium but in this project I would like to do it with BS4.
In this project, I want to scrape some data from this site. The first problem I am facing is that I need to be logged in to get the data. For this test I am using a usser and password provided by the website, so you could use the same credentials. Also you must select a "race" from the form ( I choosed Manilla - Calbayog).
With the inspector I detect the the info I need to pass to the post function:
<input name="boat" type="text" />
<input name="key" type="password" />
<select name="race">
<option value="1159">Manilla - Calbayog</option> 'This is the one I want to check for the test
And this is my code:
from bs4 import BeautifulSoup
import requests
login_data = {'boat':'sol','key':'sol','race':'1159'}
s = requests.session()
post = s.post('http://sol.brainaid.de/sailonline/toolbox/', login_data)
r = requests.get('http://sol.brainaid.de/sailonline/toolbox/')
page = r.content
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())
When I check the print output I can see that I am in the same login place.
Assuming that I could login correctly would come the second problem...When you are logged in, a new menu appears in button shapes. The one where the data I need to scrape is in "Navigation". The thing is that when you press the button the new info appears in the browser but the url does not change, no matter where you click, the url is always the same. So, how do I get to there?
And final problem. I assume I am in the "Navigation" section (without using a url). I need to refresh that info at least every 30 sec. How can I do that if there is no url to request?
¿Is there any way to do this without using selenium?
This page loads data dynamically through Ajax, the url with XML data of boat is http://sol.brainaid.de/sailonline/toolbox/getBoatData.php, you can check it in Firefox/Chrome network inspector. All you need is token, which is stored in cookies upon login:
from bs4 import BeautifulSoup
import requests
login_data = {'boat':'sol','key':'sol','race':'1159'}
login_url = 'http://sol.brainaid.de/sailonline/toolbox/login.php'
boat_data_url = 'http://sol.brainaid.de/sailonline/toolbox/getBoatData.php'
with requests.session() as s:
post = s.post(login_url, login_data)
data = {'boat': 'sol', 'race': '1159', 'token': s.cookies.get_dict()['sailonline[1159][sol]']}
boat_data = BeautifulSoup(s.post(boat_data_url, data=data).text, 'xml')
print(boat_data.prettify())
This will print:
<?xml version="1.0" encoding="utf-8"?>
<BOAT>
<LAT>
N 14°35.4000'
</LAT>
<LON>
E 120°57.0000'
</LON>
<DTG>
381.84
</DTG>
<DBL>
107.68
</DBL>
<TWD>
220.48
</TWD>
<TWS>
4.76
</TWS>
<WPT>
0
</WPT>
<RANK>
-
</RANK>
<lCOG>
COG
</lCOG>
<lTWA>
<u>TWA</u>
</lTWA>
<COG>
220.48
</COG>
<TWA>
000.00
</TWA>
<SOG>
0.00
</SOG>
<PERF>
100.00
</PERF>
<VMG>
0.00
</VMG>
<DATE>
2018-07-25
</DATE>
<TIME>
12:47:11
</TIME>
</BOAT>

Resources