How can I extract a tar.gz file in a Google Cloud Storage bucket from a Colab Notebook? - python-3.x

As the question states, I'm trying to figure out how I can extract a .tar.gz file that is stored in a GCS Bucket from a Google Colab notebook.
I am able to connect to my bucket via:
auth.authenticate_user()
project_id = 'my-project'
!gcloud config set project {project_id}
However, when I try running a command such as:
!gsutil tar xvzf my-bucket/compressed-files.tar.gz
I get an error. I know that gsutil probably has limited functionality and maybe isn't meant to do what I'm trying to do, so is there a different way to do it?
Thanks!

Google Cloud Storage - GCS does not natively support unpacking a tar archive. You will have to do this yourself either on your local machine or from a Compute Engine VM, for instance

You can create a Dataflow process from a template to decompress a file in your Bucket
The template is called Bulk decompress Cloud Storage files
You have to specify file location, output location, failure log, and tmp location

This worked for me. I'm new to colab and python itself so I'm not certain this is the solution.
!sudo tar -xvf my-bucket/compressed-files.tar.gz

Related

Upload youtube-dl transcript into Google Cloud storage

I am using Youtube-dl to download transcript, it works fine on my machine (local server) where I provide the __Dirname into the Options params to upload files. But I want to use Google Cloud functions, so how can I substitute __dirname with Cloud storage ??
Thank you !!
Upload from Youtube-dl it's not possible. To upload files into Google Cloud storage is possible if you upload a file already in your disk.
You will need to download the file from whichever program you mention (as mentioned in the comments, you can download it to a temporal folder), upload the file to GCS and then delete it from your temporal folder.
What you can actually do? you can for example run a script inside of a Google Cloud Instance with a gsutil command to upload the files into a bucket.

How can we save or upload .py file on dbfs/filestore

We have few .py files on my local needs to stored/saved on fileStore path on dbfs. How can I achieve this?
Tried with dbUtils.fs module copy actions.
I tried the below code but did not work, I know something is not right with my source path. Or is there any better way of doing this? please advise
'''
dbUtils.fs.cp ("c:\\file.py", "dbfs/filestore/file.py")
'''
It sounds like you want to copy a file on local to the dbfs path of servers of Azure Databricks. However, due to the interactive interface of Notebook of Azure Databricks based on browser, it could not directly operate the files on local by programming on cloud.
So the solutions as below that you can try.
As #Jon said in the comment, you can follow the offical document Databricks CLI to install the databricks CLI via Python tool command pip install databricks-cli on local and then copy a file to dbfs.
Follow the offical document Accessing Data to import data via Drop files into or browse to files in the Import & Explore Data box on the landing page, but also recommended to use CLI, as the figure below.
Upload your specified files to Azure Blob Storage, then follow the offical document Data sources / Azure Blob Storage to do the operations include dbutils.fs.cp.
Hope it helps.

How to download a folder to my local PC from Google Cloud console

I have a folder I want to download from Google Cloud Console using the Linux Ubuntu command terminal. I have logged in to my SSH console and so far I can only list the contents of my files as follows.
cd /var/www/html/staging
Now I want to download all the files from that staging folder.
Sorry, if I'm missing the point. Anyway, I came here seeking a way to download files from Google Cloud Console. I didn't have the ability to create an additional bucket as the author above suggested. But I accidently noticed that there is a button for exactly what I needed.
Seek keebab-style menu button. In the appearing dropdown you should find Download button.
If you mean cloud shell, then I typically use the gcp storage tool suite.
In summary, I transfer from cloud shell to gcp storage, then from storage to my workstation.
First, have the Google cloud ask installed on your system.
Make a bucket to transfer it into with gsutil mb gs://MySweetBucket
From within cloud shell, Move the file I to the bucket. gsutil cp /path/to/file gs://MySweetBucket/
On your local system pull the file down. gsutil cp gs://MySweetBucket/filename
Done!

What is the meaning of each part of this luminoth command?

I am trying to train on a dataset using luminosity. However, as my computer has a poor GPU I am planning to use glcoud. It seems that luminoth has gcloud integration according to the doc(https://media.readthedocs.org/pdf/luminoth/latest/luminoth.pdf).
Here is what I have done.
Create a Google Cloud project.
Install Google Cloud SDK on your machine.
gcloud auth login
Enable the following APIs:
• Compute Engine
• Cloud Machine Learning Engine
• Google Cloud Storage
I did it through the webconsole.
Now here is where I am stuck.
5. Upload your dataset’s TFRecord files to a Cloud Storage bucket:
the command for this is;
gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp -r /path/to/dataset/˓→tfrecords gs://your_bucket/path
I have the tfrecords file in my local drive and the data that I need to train on. However, I am not sure what each command in gsutil is trying to say. For /path/to/dataset/ do I simply input the directory my data is in? And I have uploaded the files to a bucket. Do I simply provide the path for it?
Additionally, I am currently getting
does not have permission to access project (or it may not exist)
Apologies if this may be a stupid question.

How to download all the files from S3 bucket irrespective of file key using python

I am working on an automation piece where I need to download all files from a folder inside a S3 bucket irrespective of the file name. I understand that the using boto3 in python I can download a file like:
s3BucketObj = boto3.client('s3', region_name=awsRegion, aws_access_key_id=s3AccessKey, aws_secret_access_key=s3SecretKey)
s3BucketObj.download_file(bucketName, "abc.json", "/tmp/abc.json")
but I was then trying to download all files irrespective of what filename to be specified in this way:
s3BucketObj.download_file(bucketName, "test/*.json", "/test/")
I know the syntax above could be totally wrong but is there a simple way to do that?
I did find a thread which helps here but seems a bit complex: Boto3 to download all files from a S3 Bucket
There is no API call to Amazon S3 that can download multiple files.
The easiest way is to use the AWS Command-Line Interface (CLI), which has aws s3 cp --recursive and aws s3 sync commands. It will do everything for you.
If you choose to program it yourself, then Boto3 to download all files from a S3 Bucket is a good way to do it. This is because you need to do several things:
Loop through every object (there is no S3 API to copy multiple files)
Create a local directory if it doesn't exist
Download the object to the appropriate local directory
The task can be made simpler if you do not wish to reproduce the directory structure (eg if all objects are in the same path). In that case, you can simply loop through the objects and download each of them to the same directory.

Resources