Watir & Nokogiri not loading content in frame - watir

Using watir & nokogiri to parse the content I can find in my bank account. The line browser.div(:id => 'main_layout_v2_1_cell_1:0').wait_until_present tells watir to wait until the div that is loaded by js appears. (I disabled javascript in Chrome to check if the content is loaded by javascript, and indeed it is.)
Nonetheless when Nokogiri puts browser.html it shows all the content except the section loaded by js.
require 'rubygems'
require 'watir'
require 'watir-webdriver'
require "watir-webdriver/wait"
require 'nokogiri'
browser = Watir::Browser.new
browser.goto 'https://particulares.gruposantander.es/SUPFPA_ENS/BtoChannelDriver.ssobto?dse_operationName=NavLoginSupernet&dse_parentContextName=&dse_processorState=initial&dse_nextEventName=start'
#Login
browser.select_list(:name => 'tipoDocumento').select 'NIF'
browser.text_field(:name => 'numeroDocumento').set 'xxx'
browser.text_field(:name => 'password').set 'xxx'
browser.button(:value => 'Entrar').click
#Select account
browser.link(:title => 'Cuentas').when_present.click
browser.div(:id => 'main_layout_v2_1_cell_1:0').wait_until_present
#Parse what you see, Noko
page = Nokogiri::HTML.parse(browser.html)
puts page
Things I've tried:
If I'm parsing the whole HTML through Nokogiri is because first I have tried to get those links I want to click finding by ID, title, text. None of them worked because as Nokogiri shows in the output, that part of code is not present.
Extending the timeout and rescueing the error to give the browser more time to make sure the code is there.
Code here:
begin
Timeout::timeout(40) do
#Parse what you see, Noko
page = Nokogiri::HTML.parse(browser.html)
puts page
end
puts 'done'
rescue Timeout::Error => e
puts 'not done :/'
end
Wait_until a div present in the content loaded by js is present > browser.wait_until{browser.div(:id => 'main_layout_v2_1_cell_1:0').exist?} > Timeout error.
Notes: the content I'm trying to get is wrapped in a body tag with this structure <body scroll="auto" bgcolor="F4F6F7" onload="main.onload();">
The code parsed by Nokogiri only outputs the content that is not loaded by js. How to load that content too?

The html method only does not include the contents of frames and iframes. As a result, if the desired content is within a frame, you need to explicitly tell Watir to return the frame HTML.
Assuming there is only 1 iframe on the page, you would do:
page = Nokogiri::HTML.parse(browser.iframe.html)

Related

How can I retrieve data from a web page with a loading screen?

I am using the requests library to retrieve data from nitrotype.com/racer/insert_name_here about a user's progress using the following code:
import requests
base_url = 'https://www.nitrotype.com/racer/'
name = 'test'
url = base_url + name
page = requests.get(url)
print(page.text)
However my problem is that this retrieves data from the loading screen, I want the data after the loading screen.
Is it possible to do this and how?
This is likely because of dynamic loading and can easily be navigated by using selenium or pyppeteer.
In my example, I have used pyppeteer to spawn a browser and load the javascript so that I can attain the required information.
Example:
import pyppeteer
import asyncio
async def main():
# launches a chromium browser, can use chrome instead of chromium as well.
browser = await pyppeteer.launch(headless=False)
# creates a blank page
page = await browser.newPage()
# follows to the requested page and runs the dynamic code on the site.
await page.goto('https://www.nitrotype.com/racer/tupac')
# provides the html content of the page
cont = await page.content()
return cont
# prints the html code of the user profiel: tupac
print(asyncio.get_event_loop().run_until_complete(main()))

How to get download link to the file, in the firefox downloads via selenium and python3

I can't get link in webpage, it's generat automaticly using JS. But I can get firefox download window after clicking on href (it's a JS script, that returns href).
How can I get the link in this window using selenium. If i can't do this, is there any other way to get link (no explicit link in the HTML DOM)
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # 2 means custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp') # location is tmp
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
browser = webdriver.Firefox(profile)
browser.get("yourwebsite")
element = browser.find_element_by_id('yourLocator')
href = element.get_attribute("href")
Now you have website in your href.
Use below code to navigate to the URL
browser.navigate().to(href)
You can go for the following approach:
Get WebElement's href attribute using WebElement.get_attribute() function
href = your_element.get_attribute("href")
Use WebDriver.execute_script() function to evaluate the JavaScript and return the real URL
url = driver.execute_script("return " + href + ";")
Now you should be able to use urllib or requests library to download the file. If your website assumes authentication - don't forget to obtain Cookies from the browser instance and add the relevant Cookie header to the request downloading the file

How to set window.alert when redirecting

I'm crawlling some web pages for my research.
I want to inject javascript code below when redirecting to other page:
window.alert = function() {};
I tried to inject the javascript code below using WebDriverWait, so that selenium may execute the code as soon as the driver redirect to new page. But It doesn't work.
while (some conditions) :
try:
WebDriverWait(browser, 5).until(
lambda driver: original_url != browser.current_url)
browser.execute_script("window.alert = function() {};")
except:
//do sth
original_url = browser.current_url
It seems that the driver execute javascript code after the page loaded because the alert that made in the redirected page is showing.
Chrome 14+ blocks alerts inside onunload (https://stackoverflow.com/a/7080331/3368011)
But, I think the following questions may help you:
JavaScript before leaving the page
How to call a function before leaving page with Javascript
JavaScript before leaving the page
I solved my problem in other way.
I tried and tried again with browser.switch_to_alert but it didn't work. Then I found that it was deprecated, so not works correctly. I checked the alert and dismiss it in every 1 second with following code :
while *some_condition* :
try:
Alert(browser).dismiss()
except:
print("no alert")
continue
This works very fine, in Windows 10, python 3.7.4

Selenium firefox webdriver: set download.dir for a pdf

I tried several solution around nothing really works or they are simply outdated.
Here my webdriver profile
let firefox = require('selenium-webdriver/firefox');
let profile = new firefox.Profile();
profile.setPreference("pdfjs.disabled", true);
profile.setPreference("browser.download.dir", 'C:\\MYDIR');
profile.setPreference("browser.helperApps.neverAsk.saveToDisk", "application/pdf","application/x-pdf", "application/acrobat", "applications/vnd.pdf", "text/pdf", "text/x-pdf", "application/vnd.cups-pdf");
I simply want to download a file and set the destination path. It looks like browser.download.dir is ignored.
That's the way I download the file:
function _getDoc(i){
driver.sleep(1000)
.then(function(){
driver.get(`http://mysite/pdf_showcase/${i}`);
driver.wait(until.titleIs('here my pdf'), 5000);
})
}
for(let i=1;i<5;i++){
_getDoc(i);
}
The page contains an iframe with a pdf. I can gathers the src attribute of it, but with the iframe and pdfjs.disabled=true simply visits the page driver.get() causes the download (so I'm ok with it).
The only problem is the download dir is ignored and the file is saved in the default download firefox dir.
Side question: if I wrap _getDoc() in a for loop for that parameter i how can I be sure I won't flood the server? If I use the same driver instance (just like everyone usually does) the requests are sequentials?

Loading chrome with extension using watir-webdriver gives Timeout error

I have HTML Tidy extension installed in chrome.
When I do:
b = Watir::Browser.new :chrome
The Chrome browser opens up but without extension.
So I used following:
b = Watir::Browser.new (:chrome, :switches => %w[--load-extension=
"C:\Users\.....\AppData\Local\Google\Chrome\UserData\Default\Extensions\gljdonhfjnfdklljmfaabfpjlonflfnm"])
This opens the Chrome browser with an extension
But after few seconds it gives me error:
[0402/141412:ERROR:proxy_launcher.cc(551)] Failed to wait for testing
channel presence. test\automation\proxy_launcher.cc(477): error: Value
of: automation_proxy_.get()
Actual: false Expected: true Timeout::Error: Timeout::Error
I did a search and it looks like a chromedriver bug.
I am using Chromedriver version : 26.0.1383.0
Has anyone come across this issue? Can some one please suggest a work around if one is available?
The Ruby library Nokogiri can check for well formed markup. Then you are not dependant on browser things. You can capture the html from Watir-Webdriver. Or you can capture it from net/http.
require "net/http"
require "uri"
require "nokogiri"
uri = URI.parse("http://www.google.com")
response = Net::HTTP.get_response(uri)
puts parse_body(response.body)
def parse_body(response)
begin
return Nokogiri::XML(response) { |config| config.strict }
rescue Nokogiri::XML::SyntaxError => e
return "caught exception: #{e}"
end
end

Resources