Python Playwright Download only certain files from a page

Python Playwright Download only certain files from a page - python-3.x

I'm attempting to download files from a page that's constructed almost entirely in JS. Here's the setup of the situation and what I've managed to accomplish.
The page itself takes upward of 5 minutes to load. Once loaded, there are 45,135 links (JS buttons). I need a subset of 377 of those. Then, one at a time (or using ASYNC), click those buttons to initiate the download, rename the download, and save it to a place that will keep the download even after the code has completed.
Here's the code I have and what it manages to do:
import asyncio
from playwright.async_api import async_playwright
from pathlib import Path
path = Path().home() / 'Downloads'
timeout = 300000 # 5 minute timeout
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://my-fun-page.com", timeout=timeout)
await page.wait_for_selector('ul.ant-list-items', timeout=timeout) # completely load the page
downloads = page.locator("button", has=page.locator("span", has_text="_Insurer_")) # this is the list of 377 buttons I care about
# texts = await downloads.all_text_contents() # just making sure I got what I thought I got
count = await downloads.count() # count = 377.
# Starting here is where I can't follow the API
for i in range(count):
print(f"Starting download {i}")
await downloads.nth(i).click(timeout=0)
page.on("download", lambda download: download.save_as(path / download.suggested_filename))
print("\tDownload acquired...")
await browser.close()
asyncio.run(main())
UPDATE: 2022/07/15 15:45 CST - Updated code above to reflect something that's closer to working than previously but still not doing what I'm asking.
The code above is actually iterating over the locator object elements and performing the downloads. However, the page.on("download") step isn't working. The files are not showing up in my Downloads folder after the task is completed. Thoughts on where I'm missing the mark?
Python 3.10.5
Current public version of playwright

First of all, download.save_as returns a coroutine which you need
to await. Since there is no such thing as an "aysnc lambda", and
that coroutines can only be awaited inside async functions, you
cannot use lambda here. You need to create a separate async function,
and await download.save_as.
Secondly, you do not need to repeatedly call page.on. After registering it once, the callable will be called automatically for all "download" events.
Thirdly, you need to call page.on before the download actually happens (or before the event fires, in general). It's often best to shift these calls right after you create the page using .new_page().
A Better Solution
These were the obvious mistakes in your approach, fixing them should make it work. However, since you know exactly when the downloads are going to take place (after you click downloads.nth(i)), I would suggest using expect_download instead. This will make sure that the file is completely downloaded before your main program continues (callables registered with events using page.on are not awaited). Your code will somewhat become like this:
for i in range(count):
print(f"Starting download {i}")
async with page.expect_download() as download_handler:
await downloads.nth(i).click(timeout=0)
download = await download_handler.value
await download.save_as(path + f'\\{download.suggested_filename}')
print("\tDownload acquired...")

Related

Ensure a Callback is Complete in Mongo Node Driver

I am a bit new to JavaScript web dev, and so am still getting my head around the flow of asynchronous functions, which can be a bit unexpected to the uninitiated. In my particular use case, I want execute a routine on the list of available databases before moving into the main code. Specifically, in order to ensure that a test environment is always properly initialized, I am dropping a database if it already exists, and then building it from configuration files.
The basic flow I have looks like this:
let dbAdmin = client.db("admin").admin();
dbAdmin.listDatabases(function(err, dbs){/*Loop through DBs and drop relevant one if present.*/});
return await buildRelevantDB();
By peppering some console.log() items throughout, I have determined that the listDatabases() call basically puts the callback into a queue of sorts. I actually enter buildRelevantDB() before entering the callback passed to listDatabases. In this particular example, it seems to work anyway, I think because the call that reads the configuration file is also asynchronous and so puts items into the same queue but later, but I find this to be brittle and sloppy. There must be some way to ensure that the listDatabases portion resolves before moving forward.
The closest solution I found is here, but I still don't know how to get the callback I pass to listDatabases to be like a then as in that solution.

Mixing callbacks and promises is a bit more advanced technique, so if you are new to javascript try to avoid it. In fact, try to avoid it even if you already learned everything and became a js ninja.
Dcumentation for listDatabases says it is async, so you can just await it without messing up with callbacks:
const dbs = await dbAdmin.listDatabases();
/*Loop through DBs and drop relevant one if present.*/
The next thing, there is no need to await before return. If you can await within a function, it is async and returns a promise anyway, so just return the promise from buildRelevantDB:
return buildRelevantDB();
Finally, you can drop database directly. No need to iterate over all databases to pick one you want to drop:
await client.db(<db name to drop>).dropDatabase();

Download multiple GCS files in parallel (into memory) using Python

I have a storage bucket with a lot of large files (500mb each). At times I need to load multiple files, referenced by name. I have been using the blob.download_as_string() function to download the files one-by-one, but it's extremely slow so I would like to try and download them in parallel instead.
I found the gcloud-aio-storage package, however the documentation is a bit sparse, especially for the download function.
I would prefer to download / store the files in memory instead of downloading to local machine then upload to script.
This is what I've pieced together, though I can't seem to get this to work. I keep getting a timeout error. What am I doing wrong?
Note: Using python 3.7, and latest of all other packages.
test_download.py
from gcloud.aio.storage import Storage
import aiohttp
import asyncio
async def gcs_download(session, bucket_name, file, storage):
async with session:
bucket = storage.get_bucket(bucket_name)
blob = await bucket.get_blob(file)
return await blob.download()
async def get_gcsfiles_async(bucket_name, gcs_files):
async with aiohttp.ClientSession() as session:
storage = Storage(session=session)
coros = (gcs_download(session, bucket_name, file, storage) for file in gcs_files)
return await asyncio.gather(*coros)
Then the way I'm calling / passing in values are as follows:
import test_download as test
import asyncio
bucket_name = 'my_bucket_name'
project_name = 'my_project_name' ### Where do I reference this???
gcs_files = ['bucket_folder/some-file-2020-10-06.txt',
'bucket_folder/some-file-2020-10-07.txt',
'bucket_folder/some-file-2020-10-08.txt']
result = asyncio.run(test.get_gcsfiles_async(bucket_name, gcs_files))
Any help would be appreciated!
Here is related question, although there are two things to note: Google Storage python api download in parallel
When I run the code from the approved answer it ends up getting stuck and never downloads
It's from before the gcloud-aio-storage package was released and might not be leveraging the "best" current methods.

It looks like the documentation for that library is lacking, but I could get something running, and it is working on my tests. Something I found out by looking at the code is that you don’t need to use blob.download(), since it calls storage.download() anyways. I based the script below on the usage section, which deals with uploads, but can be rewritten for downloading. storage.download() does not write to a file, since that is done by storage.download_to_filename(). You can check the available download methods here.
async_download.py
import asyncio
from gcloud.aio.auth import Token
from gcloud.aio.storage import Storage
# Used a token from a service account for authentication
sa_token = Token(service_file="../resources/gcs-test-service-account.json", scopes=["https://www.googleapis.com/auth/devstorage.full_control"])
async def async_download(bucket, obj_names):
async with Storage(token=sa_token) as client:
tasks = (client.download(bucket, file) for file in obj_names) # Used the built in download method, with required args
res = await asyncio.gather(*tasks)
await sa_token.close()
return res
main.py
import async_download as dl_test
import asyncio
bucket_name = "my-bucket-name"
obj_names = [
"text1.txt",
"text2.txt",
"text3.txt"
]
res = asyncio.run(dl_test.async_download(bucket_name, obj_names))
print(res)
If you want to use a Service Account Token instead, you can follow this guide and use the relevant auth scopes. Since Service Accounts are project-wise, specifying a project is not needed, but I did not see any project name references for a Session either. While the GCP Python library for GCS does not yet support parallel downloads, there is a feature request open for this. There is no ETA for a release of this yet.

How to download a directory's content via ftp using nodejs?

So, i am trying to download the contents of a directory via sftp using nodejs, and so far I am getting stuck with an error.
I am using the ssh2-sftp-client npm package and for the most part it works pretty well as i am able to connect to the server and list the files in a particular remote directory.
Using the fastGet method to download a file also works without any hassles, and since all the methods are promise based i assumed i could easily download all the files in the directory simply enough, by doing something like:
let main = async () => {
await sftp.connect(config.sftp);
let data = await sftp.list(config.remote_dir);
if (data.length) data.map(async x => {
await sftp.fastGet(`${config.remote_dir}/${x.name}`, config.base_path + x.name);
});
}
So it turns out the code above successfully downloads the first file, but then crashes with the following error message:
Error: Failed to get sandbox/demo2.txt: The requested operation cannot be performed because there is a file transfer in progress.
This seems to indicate that the promise from fastGet is resolving too early as the file transfer is supposed to be over when the next element of the file list is processed.
I tried to use the more traditional get() instead but it is using streams, and it fails with a different error. After researching it seems there's been a breaking change regarding streams in node 10.x. well in my case calling get simply fails (not even downloading the first file).
Does anyone know a workaround to this? or else, another package that can download several files by sftp?
Thanks!

I figured out, since the issue was concurrent download attempts on one client connection, i could try to manage it with one client per file download. I ended up with the following recursive function.
let getFromFtp = async (arr) => {
if (arr.length == 0) return (processFiles());
let x = arr.shift();
conns.push(new Client());
let idx = conns.length - 1;
await conns[idx].connect(config.sftp.auth);
await conns[idx]
.fastGet(`${config.sftp.remote_dir}/${x.name}`, `${config.dl_dir}${x.name}`);
await connections[idx].end();
getFromFtp(arr);
};
Notes about this function:
The array parameter is a list of files to download, presumably fetched using list() beforehand
conns was declared as an empty array and is used to contain our clients.
using array.prototype.shift(), to gradually deplete the array as we go through the file list
the processFiles() method is fired once all the files were downloaded.
this is just the POC version. of couse we need to add the error management to that.

Python How to send multiple files

How to make bot upload multiple image with content description. i can add multiple lines it works, but bot slows posting when it has more than 5 await bot.send line. i need to add few images so how to do it if possible in same line.
#bot.command(pass_context=True)
async def ping(ctx):
await bot.send_file(ctx.message.channel, "Image1.png", content="Image1")

You want to run multiple asynchronous tasks in one await.
You should use asyncio.wait:
import asyncio
#bot.command(pass_context=True)
async def ping(ctx):
files = ... # Set the 5 files (or more ?) you want to upload here
await asyncio.wait([bot.send_file(ctx.message.channel, f['filename'], content=f['content'] for f in files)])
(See Combine awaitables like Promise.all)

Download page with javascript executed

I want to download a page with javascript executed using python. QT is one of solutions and here is the code:
class Downloader(QApplication):
__event = threading.Event()
def __init__(self):
QApplication.__init__(self, [])
self.webView = QWebView()
self.webView.loadFinished.connect(self.loadFinished)
def load(self, url):
self.__event.clear()
self.webView.load(QUrl(url))
while not self.__event.wait(.05): self.processEvents()
return self.webView.page().mainFrame().documentElement() if self.__ok else None
def loadFinished(self, ok):
self.__ok = ok
self.__event.set()
downloader = Downloader()
page = downloader.load(url)
The problem is that sometimes downloader.load() return a page without javascript executed. Downloader.loadStarted() and Downloader.loadFinished() are called only once.
What is the proper way to wait for a complete page download?
EDIT
If add self.webView.page().networkAccessManager().finished.connect(request_ended) into __init__() and define
def request_ended(reply):
print(reply.error(), reply.url().toString())
then it turns out that sometimes reply.error()==QNetworkReply.UnknownNetworkError. This behaviour stands when unreliable proxy is used, that fails to download some of the resources (part of which are js files), hence some of js not being executed. When proxy is not used (== connection is stable), every reply.error()==QNetworkReply.NoError.
So, the updated question is:
Is it possible to retry getting reply.request() and apply it to the self.webView?

JavaScript requires a runtime to be executed with (python alone won't do) a popular one is PhantomJS these days.
Unfortuantely, PhantomJs has no python support anymore so you could resort to e.g. Ghost.py to do this job for you which allows you to selectively execute JS you want.

You should use Selenium
It provides different WebDriver, for example, PhantomJS, or other common browsers, like firefox.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python Playwright Download only certain files from a page - python-3.x

Related

Ensure a Callback is Complete in Mongo Node Driver

Download multiple GCS files in parallel (into memory) using Python

How to download a directory's content via ftp using nodejs?

Python How to send multiple files

Download page with javascript executed

Categories

Resources