Download page with javascript executed - python-3.x

I want to download a page with javascript executed using python. QT is one of solutions and here is the code:
class Downloader(QApplication):
__event = threading.Event()
def __init__(self):
QApplication.__init__(self, [])
self.webView = QWebView()
self.webView.loadFinished.connect(self.loadFinished)
def load(self, url):
self.__event.clear()
self.webView.load(QUrl(url))
while not self.__event.wait(.05): self.processEvents()
return self.webView.page().mainFrame().documentElement() if self.__ok else None
def loadFinished(self, ok):
self.__ok = ok
self.__event.set()
downloader = Downloader()
page = downloader.load(url)
The problem is that sometimes downloader.load() return a page without javascript executed. Downloader.loadStarted() and Downloader.loadFinished() are called only once.
What is the proper way to wait for a complete page download?
EDIT
If add self.webView.page().networkAccessManager().finished.connect(request_ended) into __init__() and define
def request_ended(reply):
print(reply.error(), reply.url().toString())
then it turns out that sometimes reply.error()==QNetworkReply.UnknownNetworkError. This behaviour stands when unreliable proxy is used, that fails to download some of the resources (part of which are js files), hence some of js not being executed. When proxy is not used (== connection is stable), every reply.error()==QNetworkReply.NoError.
So, the updated question is:
Is it possible to retry getting reply.request() and apply it to the self.webView?

JavaScript requires a runtime to be executed with (python alone won't do) a popular one is PhantomJS these days.
Unfortuantely, PhantomJs has no python support anymore so you could resort to e.g. Ghost.py to do this job for you which allows you to selectively execute JS you want.

You should use Selenium
It provides different WebDriver, for example, PhantomJS, or other common browsers, like firefox.

Related

Python Playwright Download only certain files from a page

I'm attempting to download files from a page that's constructed almost entirely in JS. Here's the setup of the situation and what I've managed to accomplish.
The page itself takes upward of 5 minutes to load. Once loaded, there are 45,135 links (JS buttons). I need a subset of 377 of those. Then, one at a time (or using ASYNC), click those buttons to initiate the download, rename the download, and save it to a place that will keep the download even after the code has completed.
Here's the code I have and what it manages to do:
import asyncio
from playwright.async_api import async_playwright
from pathlib import Path
path = Path().home() / 'Downloads'
timeout = 300000 # 5 minute timeout
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://my-fun-page.com", timeout=timeout)
await page.wait_for_selector('ul.ant-list-items', timeout=timeout) # completely load the page
downloads = page.locator("button", has=page.locator("span", has_text="_Insurer_")) # this is the list of 377 buttons I care about
# texts = await downloads.all_text_contents() # just making sure I got what I thought I got
count = await downloads.count() # count = 377.
# Starting here is where I can't follow the API
for i in range(count):
print(f"Starting download {i}")
await downloads.nth(i).click(timeout=0)
page.on("download", lambda download: download.save_as(path / download.suggested_filename))
print("\tDownload acquired...")
await browser.close()
asyncio.run(main())
UPDATE: 2022/07/15 15:45 CST - Updated code above to reflect something that's closer to working than previously but still not doing what I'm asking.
The code above is actually iterating over the locator object elements and performing the downloads. However, the page.on("download") step isn't working. The files are not showing up in my Downloads folder after the task is completed. Thoughts on where I'm missing the mark?
Python 3.10.5
Current public version of playwright
First of all, download.save_as returns a coroutine which you need
to await. Since there is no such thing as an "aysnc lambda", and
that coroutines can only be awaited inside async functions, you
cannot use lambda here. You need to create a separate async function,
and await download.save_as.
Secondly, you do not need to repeatedly call page.on. After registering it once, the callable will be called automatically for all "download" events.
Thirdly, you need to call page.on before the download actually happens (or before the event fires, in general). It's often best to shift these calls right after you create the page using .new_page().
A Better Solution
These were the obvious mistakes in your approach, fixing them should make it work. However, since you know exactly when the downloads are going to take place (after you click downloads.nth(i)), I would suggest using expect_download instead. This will make sure that the file is completely downloaded before your main program continues (callables registered with events using page.on are not awaited). Your code will somewhat become like this:
for i in range(count):
print(f"Starting download {i}")
async with page.expect_download() as download_handler:
await downloads.nth(i).click(timeout=0)
download = await download_handler.value
await download.save_as(path + f'\\{download.suggested_filename}')
print("\tDownload acquired...")

insted of "autoload_server" I want to use "server_document"

I want to make bokeh embedded web app.
resources said "autoload_server" works but it does not.
session=pull_session(url=url,app_path="/random_generator")
bokeh_script=autoload_server(None,app_path="/random_generator",session_id=session.id, url=url)
I think autoload_server can not be used anymore
so instead of this, I want to use server_document
I wrote this code but still does not work
how should I write this code?
session=pull_session(url=url,app_path="/random_generator")
bokeh_script=server_document("/random_generator")
server_document is for creating and embedding new sessions from a Bokeh server. It is not useful for interacting with already existing sessions, i.e. it is not useful together with pull_session. For that, you want to use server_session, as described in the documentation. For example, in a Flask app you would have something like:
#app.route('/', methods=['GET'])
def bkapp_page():
with pull_session(url="http://localhost:5006/sliders") as session:
# update or customize that session
session.document.roots[0].children[1].title.text = "Special Sliders!"
# generate a script to load the customized session
script = server_session(session_id=session.id,
url='http://localhost:5006/sliders')
# use the script in the rendered page
return render_template("embed.html", script=script, template="Flask")

trouble getting the current url on selenium

I want to get the current url when I am running Selenium.
I looked at this stackoverflow page: How do I get current URL in Selenium Webdriver 2 Python?
and tried the things posted but it's not working. I am attaching my code below:
from selenium import webdriver
#launch firefox
driver = webdriver.Firefox()
url1='https://poshmark.com/search?'
# search in a window a window
driver.get(url1)
xpath='//input[#id="user-search-box"]'
searchBox=driver.find_element_by_xpath(xpath)
brand="freepeople"
style="top"
searchBox.send_keys(' '.join([brand,"sequin",style]))
from selenium.webdriver.common.keys import Keys
#EQUIValent of hitting enter key
searchBox.send_keys(Keys.ENTER)
print(driver.current_url)
my code prints https://poshmark.com/search? but it should print: https://poshmark.com/search?query=freepeople+sequin+top&type=listings&department=Women because that is what selenium goes to.
The issue is that there is no lag between your searchBox.send_keys(Keys.ENTER) and print(driver.current_url).
There should be some time lag, so that the statement can pick the url change. If your code fires before url has actually changed, it gives you old url only.
The workaround would be to add time.sleep(1) to wait for 1 second. A hard coded sleep is not a good option though. You should do one of the following
Keep polling url and wait for the change to happen or the url
Wait for a object that you know would appear when the new page comes
Instead of using Keys.Enter simulate the operation using a .click() on search button if it is available
Usually when you use click method in selenium it takes cared of the page changes, so you don't see such issues. Here you press a key using selenium, which doesn't do any kind of waiting for page load. That is why you see the issue in the first place
I had the same issue and I came up with solution that uses default explicit wait (see how explicit wait works in documentation).
Here is my solution
class UrlHasChanged:
def __init__(self, old_url):
self.old_url = old_url
def __call__(self, driver):
return driver.current_url != self.old_url:
#contextmanager
def url_change(driver):
current_url = driver.current_url
yield
WebDriverWait(driver, 10).until(UrlHasChanged(current_url))
Explanation:
At first, I created my own wait condition (see here) that takes old_url as a parameter (url from before action was made) and checks whether old url is the same like current_url after some action. It returns false when both urls are the same and true otherwise.
Then, I created context manager to wrap action that I wanted to make, and I saved url before action was made, and after that I used WebDriverWait with created before wait condition.
Thanks to that solution I can now reuse this function with any action that changes url to wait for the change like that:
with url_change(driver):
login_panel.login_user(normal_user['username'], new_password)
assert driver.current_url == dashboard.url
It is safe because WebDriverWait(driver, 10).until(UrlHasChanged(current_url)) waits until current url will change and after 10 seconds it will stop waiting by throwing an exception.
What do you think about this?
I fixed this problem by clicking on the button by using href. Then do driver.get(hreflink). Click() was not working for me!

Is there a way to slow down execution of Watir Webdriver under Cucumber?

Is there any way we can slow down the execution of Watir WebDriver under Cucumber?
I would like to visually track the actions performed by Watir. At the moment, it goes too fast for my eyes.
While Watir itself does not have an API for slowing down the execution, you could use the underlying Selenium-WebDriver's AbstractEventListener to add pauses before/after certain types of actions.
Given you want to see the result of actions, you probably want to pause after changing values and clicking elements. This would be done by creating the following AbstractEventListener and passing it in when creating the browser:
class ActionListener < Selenium::WebDriver::Support::AbstractEventListener
def after_change_value_of(element, driver)
sleep(5)
end
def after_click(element, driver)
sleep(5)
end
end
browser = Watir::Browser.new :firefox, :listener => ActionListener.new
For a full list of events that you can listen for, see the
Selenium::WebDriver::Support::AbstractEventListener documentation.
Not universally. You could Monkey Patch the element_call method to add a sleep after every interaction with a Selenium Element. Import this code after requiring watir-webdriver.
module Watir
class Element
alias_method :watir_element_call, :element_call
def element_call &block
watir_element_call &block
sleep 1
end
end
end
Also note, that Monkey Patching is generally a bad idea, and when I change the implementation (which I plan to), this code will break.

How can I ignore tests when under a particular browser?

My suite of cucumbers gets run on both Firefox and Chrome. Some of them require a browser resize, which is horrible to deal with in Chrome. Since the behaviors that need the resize don't require cross browser testing, I'd like some way to ignore them when the detected browser is Chrome. Is there a way to do this? Perhaps with hooks or in the steps? I'm currently doing the resizing in Before and After hooks.
I don't know which web-driver you are using, but for watir-webdriver you can do the following:
You can determine which browser it is in the steps that you want to skip using the code in the below URL.
http://watirwebdriver.com/determining-which-browser/
Once you determine that it is chrome you can just skip that particular step.
In your test helper, you can add those methods :
def use_chrome_driver
Capybara.register_driver :selenium_chrome do |app|
Capybara::Selenium::Driver.new(app, :browser => :chrome)
end
Capybara.current_driver = :selenium_chrome
end
def setup
Capybara.current_driver = :selenium
end
All your tests will use the selenium default webdriver, then when you need to use Chrome, just call the method use_chrome_driver at the beginning of your test like that :
def test_login_with_chrome
use_chrome_driver
...
end
You may also add into your helper your firefox driver with the correct browser size you need, and make it the default selenium browser.

Resources