Download multiple GCS files in parallel (into memory) using Python - python-3.x

I have a storage bucket with a lot of large files (500mb each). At times I need to load multiple files, referenced by name. I have been using the blob.download_as_string() function to download the files one-by-one, but it's extremely slow so I would like to try and download them in parallel instead.
I found the gcloud-aio-storage package, however the documentation is a bit sparse, especially for the download function.
I would prefer to download / store the files in memory instead of downloading to local machine then upload to script.
This is what I've pieced together, though I can't seem to get this to work. I keep getting a timeout error. What am I doing wrong?
Note: Using python 3.7, and latest of all other packages.
test_download.py
from gcloud.aio.storage import Storage
import aiohttp
import asyncio
async def gcs_download(session, bucket_name, file, storage):
async with session:
bucket = storage.get_bucket(bucket_name)
blob = await bucket.get_blob(file)
return await blob.download()
async def get_gcsfiles_async(bucket_name, gcs_files):
async with aiohttp.ClientSession() as session:
storage = Storage(session=session)
coros = (gcs_download(session, bucket_name, file, storage) for file in gcs_files)
return await asyncio.gather(*coros)
Then the way I'm calling / passing in values are as follows:
import test_download as test
import asyncio
bucket_name = 'my_bucket_name'
project_name = 'my_project_name' ### Where do I reference this???
gcs_files = ['bucket_folder/some-file-2020-10-06.txt',
'bucket_folder/some-file-2020-10-07.txt',
'bucket_folder/some-file-2020-10-08.txt']
result = asyncio.run(test.get_gcsfiles_async(bucket_name, gcs_files))
Any help would be appreciated!
Here is related question, although there are two things to note: Google Storage python api download in parallel
When I run the code from the approved answer it ends up getting stuck and never downloads
It's from before the gcloud-aio-storage package was released and might not be leveraging the "best" current methods.

It looks like the documentation for that library is lacking, but I could get something running, and it is working on my tests. Something I found out by looking at the code is that you don’t need to use blob.download(), since it calls storage.download() anyways. I based the script below on the usage section, which deals with uploads, but can be rewritten for downloading. storage.download() does not write to a file, since that is done by storage.download_to_filename(). You can check the available download methods here.
async_download.py
import asyncio
from gcloud.aio.auth import Token
from gcloud.aio.storage import Storage
# Used a token from a service account for authentication
sa_token = Token(service_file="../resources/gcs-test-service-account.json", scopes=["https://www.googleapis.com/auth/devstorage.full_control"])
async def async_download(bucket, obj_names):
async with Storage(token=sa_token) as client:
tasks = (client.download(bucket, file) for file in obj_names) # Used the built in download method, with required args
res = await asyncio.gather(*tasks)
await sa_token.close()
return res
main.py
import async_download as dl_test
import asyncio
bucket_name = "my-bucket-name"
obj_names = [
"text1.txt",
"text2.txt",
"text3.txt"
]
res = asyncio.run(dl_test.async_download(bucket_name, obj_names))
print(res)
If you want to use a Service Account Token instead, you can follow this guide and use the relevant auth scopes. Since Service Accounts are project-wise, specifying a project is not needed, but I did not see any project name references for a Session either. While the GCP Python library for GCS does not yet support parallel downloads, there is a feature request open for this. There is no ETA for a release of this yet.

Related

Python Playwright Download only certain files from a page

I'm attempting to download files from a page that's constructed almost entirely in JS. Here's the setup of the situation and what I've managed to accomplish.
The page itself takes upward of 5 minutes to load. Once loaded, there are 45,135 links (JS buttons). I need a subset of 377 of those. Then, one at a time (or using ASYNC), click those buttons to initiate the download, rename the download, and save it to a place that will keep the download even after the code has completed.
Here's the code I have and what it manages to do:
import asyncio
from playwright.async_api import async_playwright
from pathlib import Path
path = Path().home() / 'Downloads'
timeout = 300000 # 5 minute timeout
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://my-fun-page.com", timeout=timeout)
await page.wait_for_selector('ul.ant-list-items', timeout=timeout) # completely load the page
downloads = page.locator("button", has=page.locator("span", has_text="_Insurer_")) # this is the list of 377 buttons I care about
# texts = await downloads.all_text_contents() # just making sure I got what I thought I got
count = await downloads.count() # count = 377.
# Starting here is where I can't follow the API
for i in range(count):
print(f"Starting download {i}")
await downloads.nth(i).click(timeout=0)
page.on("download", lambda download: download.save_as(path / download.suggested_filename))
print("\tDownload acquired...")
await browser.close()
asyncio.run(main())
UPDATE: 2022/07/15 15:45 CST - Updated code above to reflect something that's closer to working than previously but still not doing what I'm asking.
The code above is actually iterating over the locator object elements and performing the downloads. However, the page.on("download") step isn't working. The files are not showing up in my Downloads folder after the task is completed. Thoughts on where I'm missing the mark?
Python 3.10.5
Current public version of playwright
First of all, download.save_as returns a coroutine which you need
to await. Since there is no such thing as an "aysnc lambda", and
that coroutines can only be awaited inside async functions, you
cannot use lambda here. You need to create a separate async function,
and await download.save_as.
Secondly, you do not need to repeatedly call page.on. After registering it once, the callable will be called automatically for all "download" events.
Thirdly, you need to call page.on before the download actually happens (or before the event fires, in general). It's often best to shift these calls right after you create the page using .new_page().
A Better Solution
These were the obvious mistakes in your approach, fixing them should make it work. However, since you know exactly when the downloads are going to take place (after you click downloads.nth(i)), I would suggest using expect_download instead. This will make sure that the file is completely downloaded before your main program continues (callables registered with events using page.on are not awaited). Your code will somewhat become like this:
for i in range(count):
print(f"Starting download {i}")
async with page.expect_download() as download_handler:
await downloads.nth(i).click(timeout=0)
download = await download_handler.value
await download.save_as(path + f'\\{download.suggested_filename}')
print("\tDownload acquired...")

TypeScript Ghostscript import changed to gs_

Okay so a bit of a weird one, I have a cloud function running in the Google Cloud environment, written on my local machine in TypeScript which is using a Ghostscript reference i've put in the package.json as the following:
"gs": "https://github.com/sina-masnadi/node-gs/tarball/master",
Imported into my functions file like:
import gs from "gs";
With my TypeScript function looks like this:
await new Promise((res, rej) => {
gs()
.batch()
.nopause()
...follows on from this
However when I deploy my functions with firebase deploy --only functions the compiler then puts together the .js file and uploads that (as you'd expect). Only in the .js funciton it's uploading, it's changing the references to the Ghostscript import to the following:
const gs_1 = require("gs");
And then the function code to:
yield new Promise((res, rej) => {
gs_1.default()
.batch()
.nopause()
...follows on from this
When this function is then ran in the cloud, I get the following error printed:
TypeError: gs_1.default is not a function
Does anyone know why it's changing the name over, is it a reserved name possibly?
UPDATE & FIX
Changing the import for the ghostscript library to:
const gs = require("gs")
has fixed the issue thanks to help from #Nicholas Tower
I am creating a Community Wiki answer to have an answer available for future reference to similar issues. According to the comments by Nicholas Tower, the solution to this issue was to refactor the import of the gs library from:
import gs from "gs";
To:
const gs = require("gs")
For further information, there is a documentation page dealing with Firebase functions dependency handling.

Boto3 client in multiprocessing pool fails with "botocore.exceptions.NoCredentialsError: Unable to locate credentials"

I'm using boto3 to connect to s3, download objects and do some processing. I'm using a multiprocessing pool to do the above.
Following is a synopsis of the code I'm using:
session = None
def set_global_session():
global session
if not session:
session = boto3.Session(region_name='us-east-1')
def function_to_be_sent_to_mp_pool():
s3 = session.client('s3', region_name='us-east-1')
list_of_b_n_o = list_of_buckets_and_objects
for bucket, object in list_of_b_n_o:
content = s3.get_object(Bucket=bucket, Key=key)
data = json.loads(content['Body'].read().decode('utf-8'))
write_processed_data_to_a_location()
def main():
pool = mp.Pool(initializer=set_global_session, processes=40)
pool.starmap(function_to_be_sent_to_mp_pool, list_of_b_n_o_i)
Now, when processes=40, everything works good. When processes = 64, still good.
However, when I increases to processes=128, I get the following error:
botocore.exceptions.NoCredentialsError: Unable to locate credentials
Our machine has the required IAM roles for accessing S3. Moreover, the weird thing that happens is that for some processes, it works fine, whereas for some others, it throws the credentials error. Why is this happening, and how to resolve this?
Another weird thing that happens is that I'm able to trigger two jobs in 2 separate terminal tabs (each of which has a separate ssh login shell to the machine). Each job spawns 64 processes, and that works fine as well, which means there are 128 processes running simultaneously. But 80 processes in one login shell fails.
Follow up:
I tried creating separate sessions for separate processes in one approach. In the other, I directly created s3-client using boto3.client. However, both of them throw the same error with 80 processes.
I also created separate clients with the following extra config:
Config(retries=dict(max_attempts=40), max_pool_connections=800)
This allowed me to use 80 processes at once, but anything > 80 fails with the same error.
Post follow up:
Can someone confirm if they've been able to use boto3 in multiprocessing with 128 processes?
This is actually a race condition on fetching the credentials. I'm not sure how fetching credentials under the hood works, but the I saw this question in Stack Overflow and this ticket in github.
I was able to resolve this by keeping a random wait time for each of the processes. The following is the updated code which works for me:
client_config = Config(retries=dict(max_attempts=400), max_pool_connections=800)
time.sleep(random.randint(0, num_processes*10)/1000) # random sleep time in milliseconds
s3 = boto3.client('s3', region_name='us-east-1', config=client_config)
I tried keeping the range for sleep time lesser than num_processes*10, but that failed again with the same issue.
#DenisDmitriev, since you are getting the credentials and storing them explicitly, I think that resolves the race condition and hence the issue is resolved.
PS: values for max_attempts and max_pool_connections don't have a logic. I was plugging several values until the race condition was figured out.
I suspect that AWS recently reduced throttling limits for metadata requests because I suddenly started running into the same issue. The solution that appears to work is to query credentials once before creating the pool and have the processes in the pool use them explicitly instead of making them query credentials again.
I am using fsspec with s3fs, and here's what my code for this looks like:
def get_aws_credentials():
'''
Retrieve current AWS credentials.
'''
import asyncio, s3fs
fs = s3fs.S3FileSystem()
# Try getting credentials
num_attempts = 5
for attempt in range(num_attempts):
credentials = asyncio.run(fs.session.get_credentials())
if credentials is not None:
if attempt > 0:
log.info('received credentials on attempt %s', 1 + attempt)
return asyncio.run(credentials.get_frozen_credentials())
time.sleep(15 * (random.random() + 0.5))
raise RuntimeError('failed to request AWS credentials '
'after %d attempts' % num_attempts)
def process_parallel(fn_d, max_processes):
# [...]
c = get_aws_credentials()
# Cache credentials
import fsspec.config
prev_s3_cfg = fsspec.config.conf.get('s3', {})
try:
fsspec.config.conf['s3'] = dict(prev_s3_cfg,
key=c.access_key,
secret=c.secret_key)
num_processes = min(len(fn_d), max_processes)
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=num_processes) as pool:
for data in pool.map(process_file, fn_d, chunksize=10):
yield data
finally:
fsspec.config.conf['s3'] = prev_s3_cfg
Raw boto3 code will look essentially the same, except instead of the whole fs.session and asyncio.run() song and dance, you'll work with boto3.Session itself and call its get_credentials() and get_frozen_credentials() methods directly.
I get the same problem with multi process situation. I guess there is a client init problem when you use multi process. So I suggest that you can use get function to get s3 client. It works for me.
g_s3_cli = None
def get_s3_client(refresh=False):
global g_s3_cli
if not g_s3_cli or refresh:
g_s3_cli = boto3.client('s3')
return g_s3_cli

Uploading a JSON file with Pyrebase to Firebase Storage from Google App Engine

I have a pretty simple Flask web app running in GAE that downloads a JSON file from Firebase Storage and replaces it with the updated one if necessary. Everything works ok but GAE throws an IOError exception whenever I try to create the a new file. I'm using Firebase Storage because I know it isn't possible to read/write files in a GAE environment but how I'm suppose to use Pyrebase storage.child('foo.json').put('foo.json') function then? What I'm doing wrong? Please, check my code below.
firebase_config = {my_firebase_config_dict}
pyrebase_app = pyrebase.initialize_app(firebase_config)
storage = pyrebase_app.storage()
#app.route('/')
def check_for_updates() :
try :
json_feeds = json.loads(requests.get('http://my-firebase-storage-url/example.json').text()
# Here I check if I need to update example.json
# ...
with open("example.json", "w") as file:
json.dump(info, file)
file.close()
storage.child('example.json').put('example.json')
return 'finished successfully!'
except IOError :
return 'example.json doesn't exists'
If I understand correctly you just need this file temporary in GAE and put it in cloud storage afterwards. According to this doc you can do it as in normal OS, but in /tmp folder:
if your app only needs to write temporary files, you can use standard
Python 3.7 methods to write files to a directory named /tmp
I hope it will help!
I finally did it like this but I don't know if this is better, worst or simply equivalent to #vitooh solution. Please, let me know:
firebase_config = {my_firebase_config_dict}
pyrebase_app = pyrebase.initialize_app(firebase_config)
storage = pyrebase_app.storage()
#app.route('/')
def check_for_updates() :
try :
blob = bucket.blob('example.json')
example = json.loads(blob.download_as_string()
# Here I check if I need to update example.json
# ...
if something_changed :
blob.upload_from_string(example, content_type = 'application/json')
return 'finished successfully!'
except IOError :
return 'example.json doesn't exists'

How to download a directory's content via ftp using nodejs?

So, i am trying to download the contents of a directory via sftp using nodejs, and so far I am getting stuck with an error.
I am using the ssh2-sftp-client npm package and for the most part it works pretty well as i am able to connect to the server and list the files in a particular remote directory.
Using the fastGet method to download a file also works without any hassles, and since all the methods are promise based i assumed i could easily download all the files in the directory simply enough, by doing something like:
let main = async () => {
await sftp.connect(config.sftp);
let data = await sftp.list(config.remote_dir);
if (data.length) data.map(async x => {
await sftp.fastGet(`${config.remote_dir}/${x.name}`, config.base_path + x.name);
});
}
So it turns out the code above successfully downloads the first file, but then crashes with the following error message:
Error: Failed to get sandbox/demo2.txt: The requested operation cannot be performed because there is a file transfer in progress.
This seems to indicate that the promise from fastGet is resolving too early as the file transfer is supposed to be over when the next element of the file list is processed.
I tried to use the more traditional get() instead but it is using streams, and it fails with a different error. After researching it seems there's been a breaking change regarding streams in node 10.x. well in my case calling get simply fails (not even downloading the first file).
Does anyone know a workaround to this? or else, another package that can download several files by sftp?
Thanks!
I figured out, since the issue was concurrent download attempts on one client connection, i could try to manage it with one client per file download. I ended up with the following recursive function.
let getFromFtp = async (arr) => {
if (arr.length == 0) return (processFiles());
let x = arr.shift();
conns.push(new Client());
let idx = conns.length - 1;
await conns[idx].connect(config.sftp.auth);
await conns[idx]
.fastGet(`${config.sftp.remote_dir}/${x.name}`, `${config.dl_dir}${x.name}`);
await connections[idx].end();
getFromFtp(arr);
};
Notes about this function:
The array parameter is a list of files to download, presumably fetched using list() beforehand
conns was declared as an empty array and is used to contain our clients.
using array.prototype.shift(), to gradually deplete the array as we go through the file list
the processFiles() method is fired once all the files were downloaded.
this is just the POC version. of couse we need to add the error management to that.

Resources