Cannot read .json from a google cloud bucket - python-3.x

I have a folder structure within a bucket of google cloud storage
bucket_name = 'logs'
json_location = '/logs/files/2018/file.json'
I try to read this json file in jupyter notebook using this code
from google.cloud import storage
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "logs/files/2018/file.json"
def download_blob(source_blob_name, bucket_name, destination_file_name):
"""Downloads a blob from the bucket."""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print('Blob {} downloaded to {}.'.format(
source_blob_name,
destination_file_name))
Then calling the function
download_blob('file.json', 'logs', 'file.json')
And I get this error
DefaultCredentialsError: File /logs/files/2018/file.json was not found.
I have looked at all the similar question asked on stackoverflow and cannot find a solution.
The json file is present and can be open or downloaded in the json_location on google cloud storage.

There are two different perspectives regarding the json file you refer:
1) The json file used for authenticating to GCP.
2) The json you want to download from a bucket to your local machine.
For the first one, if you are accessing remotely to you Jupyter server, most probably the json doesn't exist in such remote machine, but in your local machine. If this is your scenario try to upload the json to the Jupyter server. Executing ls -l /logs/files/2018/file.json in the remote machine could help to verify its correctness. Then, os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "JSON_PATH_ON_JUPYTER_SERVER" should work.
On the other hand, I executed your code and got:
>>> download_blob('static/upload_files_CS.png', 'bucketrsantiago', 'file2.json')
Blob static/upload_files_CS.png downloaded to file2.json.
The file gs://bucketrsantiago/static/upload_files_CS.png was downloaded to my local machine with the name file2.json. This helps to clarify that the only problem is regarding the authentication json file.

GOOGLE_APPLICATION_CREDENTIALS is supposed to point to a file on the local disk where you are running jupyter. You need the credentials in order to call GCS, so you can't fetch them from GCS.
In fact, you are best off not messing around with credentials at all in your program, and leaving the client library to it. Don't touch GOOGLE_APPLICATION_CREDENTIALS in our application. Instead:
If you are running on GCE, just make sure your GCE instances [has a service account with the right scopes and permissions]. Applications running in that instance will automatically have the permissions of that service account.
If you are running locally, install google cloud SDK and run gcloud auth application-default login. Your program will then automatically use whichever account you log in as.
Complete instructions here

Related

MainThread: Vaex: Error while Opening Azure Data Lake Parquet file

I tried to open a parquet on an Azure data lake gen 2 storage using SAS URL generated (with the datetime limit and token embedded in the url) using vaex by doing:
vaex.open(sas_url)
and I got the error
ERROR:MainThread:vaex:error opening 'the path which was also the sas_url(can't post it for security reasons)'
ValueError: Do not know how to open (can't publicize the sas url) , no handler for https is known
How do I get vaex to read the file or is there another azure storage that works better with vaex?
I finally found a solution! Vaex can read files in Azure blob storage with this:
import vaex
import adlfs
storage_account = "..."
account_key = "..."
container = "..."
object_path = "..."
fs = adlfs.AzureBlobFileSystem(account_name=storage_account, account_key=account_key)
df = vaex.open(f"abfs://{container}/{object_path}", fs=fs)
for more details, I found the solution in https://github.com/vaexio/vaex/issues/1272
Vaex is not capable to read the data using https source, that's the reason you are getting error "no handler for https is known".
Also, as per the document, vaex supports data input from Amazon S3 buckets and Google cloud storage.
Cloud support:
Amazon Web Services S3
Google Cloud Storage
Other cloud storage options
They mentioned that other cloud storages are also supported but there is no supporting document anywhere with any example where they are fetching the data from Azure storage account, that also using SAS URL.
Also please visit API document for vaex library for more info.

Delete images from a folder created on a Google Cloud run API

I have a flask api that is running on google cloud run. For the sake of the question let it be called https://objdetect-bbzurq6giq-as.a.run.app/objdetect.
Using this API, a person uploads an image, the api highlights objects in the image and then stores the new image in a folder called static. The location of that folder is https://objdetect-bbzurq6giq-as.a.run.app/static/.
Now, that I am testing the API on tons of images, the capacity of the server is running out. I want to delete all the images from the static folder.
I tried the below python script but it didn't work for me, maybe thats not the right solution:
from google.cloud import storage
import os
os.environ["GCLOUD_PROJECT"] = "my-project-1234"
bucket_name = 'https://objdetect-bbzurq6giq-as.a.run.app/objdetect'
directory_name = 'https://objdetect-bbzurq6giq-as.a.run.app/static/'
client = storage.Client()
bucket = client.get_bucket(bucket_name)
# list all objects in the directory
blobs = bucket.list_blobs(prefix=directory_name)
for blob in blobs:
blob.delete()
Is there a way to achieve this using a python script?
Cloud Run is not Cloud Storage. Use the Linux file system APIs to delete files stored in Cloud Run.
Use the function os.unlink()
path = '/static'
with os.scandir(path) as it:
for entry in it:
if entry.is_file():
unlink(os.path.join(path, entry.name))
Cloud run and Cloud storage are 2 different services.
Cloud run runs within a container, we can say within a stateless machine/VM. if the images are created within the container it will get deleted once the container shuts down.
Cloud storage is like a file storage and the files within GCS will persist till explicitly deleted or deleted by a lifecycle.
So files created within cloud run are not stored within a storage rather in the in the server run in the container. If you want to delete the files inside cloud run then you would need to delete those using the code(Python in your case).

How to ship Airflow logs to Azure Blob Store

I'm having trouble following this guide section 3.6.5.3 "Writing Logs to Azure Blob Storage"
The documentation states you need an active hook to Azure Blob storage. I'm not sure how to create this. Some sources say you need to create the hook in the UI, and some say you can use an environment variable. Either way, none of my logs are getting written to blob store and I'm at my wits end.
Azure Blob Store hook(or any hook for that matter) tells overflow how to write to into Azure Blob Store. This is already included in recent versions of airflow, wasb_hook.
You will need to make sure that the hook is able to write to Azure Blob Store. Just mention the REMOTE_BASE_LOG_FOLDER bucket should be named like wasb-xxx. Once you take care of these two things instructions works without a hitch,
I achieved writing logs to blob using below steps
Create folder named config inside airflow folder
Create empty __init__.py and log_config.py files inside config folder
Search airflow_local_settings.py in your machine
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.pyc
run
cp /home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py config/log_config.py
Edit airflow.cfg [core] section
remote_logging = True
remote_log_conn_id = log_sync
remote_base_log_folder=wasb://airflow-logs#storage-account.blob.core.windows.net/logs/
logging_config_class =log_config.DEFAULT_LOGGING_CONFIG
Add log_sync connection object as below
install airflow azure dependency
pip install apache-airflow[azure]
Restart webserver and scheduler

How can I read public files from google cloud storage python remotely?

I need to reed some CSV files that were shared which are in google Cloud Storage. My Script will run from another server outside from Google Cloud.
I am using this code:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('/stats/installs')
blob = storage.Blob('installs_overview.csv', bucket)
content = blob.download_as_string()
print(content)
Getting this error: Apparently I haven't specified the project but I don't have one
OSError: Project was not passed and could not be determined from the environment.
There are some wrong assumptions in the previous answers in this topic.
If it is a public bucket you do not have to worry about what project it is connected to. It is well documented how you, for example, can use a bucket to host a public website that browsers can access. Obviously the browser does not have to worry about what project it belongs to.
The code samples are a bit lacking on using public buckets & files,-- in all the examples you supply a project and credentials, which will
1) Bill bucket egress on the project you supply instead of the project the bucket is connected to
2) Assumes that you need to authenticate and authorise.
For a public file or bucket, however, all you have to worry about is the bucket name and file location.
You can
from google.cloud import storage
source="path/to/file/in/bucket.txt"
target="/your/local/file.txt"
client = storage.Client.create_anonymous_client()
# you need to set user_project to None for anonymous access
# If not it will attempt to put egress bill on the project you specify,
# and then you need to be authenticated to that project.
bucket = client.bucket(bucket_name="your-bucket", user_project=None)
blob = storage.Blob(source, bucket)
blob.download_to_filename(filename=target, client=client)
It is important that your file in the bucket has read access to "AllUsers"
First of all, I think there might be some confusion regarding Cloud Storage and how to access it. Cloud Storage is a Google Cloud Platform product, and therefore, to use it, a GCP Project must exist. You can find the project number and project ID for your project in the Home page of the Console, as explained in this documentation page.
That being said, let me refer you to the documentation page about the Python Cloud Storage Client Library. When you create the client to use the service, you can optionally specify the project ID and/or the credentials files to use:
client = storage.Client(project="PROJECT_ID",credentials="OAUTH2_CREDS")
If you do not specify the Project ID, it will be inferred from the environment.
Also, take into account that you must set up authentication in order to use the service. If you were running the application inside another GCP service (Compute Engine, App Engine, etc.), the recommended approach would be using the Application Default Credentials. However, given that that is not your case, you should instead follow this guide to set up authentication, downloading the key for the Service Account having permission to work with Cloud Storage and pointing to it in the environment variable GOOGLE_APPLICATION_CREDENTIALS.
Also, it looks like the configuration in your code is not correct, given that the bucket name you are using ('/stats/installs') is not valid:
Bucket names must be between 3 and 63 characters. A bucket name can
contain lowercase alphanumeric characters, hyphens, and underscores.
It can contain dots (.) if it forms a valid domain name with a
top-level domain (such as .com). Bucket names must start and end with
an alphanumeric character.
Note that you can see that the given bucket does not exist by working with exceptions, specifically google.cloud.exceptions.NotFound. Also, given that the files you are trying to access are public, I would not recommend to share the bucket and file names, you can just obfuscate them with a code such as <BUCKET_NAME>, <FILE_NAME>.
So, as a summary, the course of action should be:
Identify the project to which the bucket you want to work with belongs.
Obtain the right credentials to work with GCS in that project.
Add the project and credentials to the code.
Fix the code you shared with the correct bucket and file name. Note that if the file is inside a folder (even though in GCS the concept of directories itself does not exist, as I explained in this other question), the file name in storage.Blob() should include the complete path like path/to/file/file.csv.
I am not a google-cloud expert, but as some of the commentators have said, I think the problem will be that you haven't explicitly told the storage client which Project you are talking about. The error message implies that the storage client tries to figure out for itself which project you are referring to, and if it can't figure it out, it gives that error message. When I use the storage Client I normally just provide the project name as an argument and it seems to do the trick, e.g.:
client = storage.Client(project='my-uber-project')
Also, I just saw your comment that your bucket "doesn't have a project" - I don't understand how this is possible. If you log in to the google cloud console area and go to storage, surely your bucket is listed there and you can see your project name at the top of the page?
As #Mangu said, the bucket name in your code is presumably just to hide the real bucket name, as forward-slashes are not allowed in bucket names (but are allowed in blob names and can be used to represent 'folders').

Azure Drive addressing using local emulated blob store

I am unable to get a simple tech demo working for Azure Drive using a locally hosted service running the storage/compute emulator. This is not my first azure project, only my first use of the Azure Drive feature.
The code:
var localCache = RoleEnvironment.GetLocalResource("MyAzureDriveCache");
CloudDrive.InitializeCache(localCache.RootPath, localCache.MaximumSizeInMegabytes);
var creds = new StorageCredentialsAccountAndKey("devstoreaccount1", "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==");
drive = new CloudDrive(new Uri("http://127.0.0.1:10000/devstoreaccount1/drive"), creds);
drive.CreateIfNotExist(16);
drive.Mount(0, DriveMountOptions.None);
With local resource configuration:
LocalStorage name="MyAzureDriveCache" cleanOnRoleRecycle="false" sizeInMB="220000"
The exception:
Uri http://127.0.0.1:10000/devstoreaccount1/drive is Invalid
Information on how to address local storage can be found here: https://azure.microsoft.com/en-us/documentation/articles/storage-use-emulator/
I have used the storage emulator UI to create the C:\Users...\AppData\Local\dftmp\wadd\devstoreaccount1 folder which I would expect to act as the container in this case.
However, I am following those guidelines (as far as I can tell) and yet still I receive the exception. Is anyone able to identify what I am doing wrong in this case? I had hoped to be able to resolve this easily using a working sample where someone else is using CloudDrive with 127.0.0.1 or localhost but was unable to find such on Google.
I think you have passed several required steps before mounting.
You have to initialize the local cache for the drive, and the URI of the page blob containing the Cloud Drive before mounting it.
Initializing the cache:
// Initialize the local cache for the Azure drive
LocalResource cache = RoleEnvironment.GetLocalResource("LocalDriveCache");
CloudDrive.InitializeCache(cache.RootPath + "cache", cache.MaximumSizeInMegabytes);
Defining the URI of the page blob, usually made in the configuration file:
// Retrieve URI for the page blob that contains the cloud drive from configuration settings
string imageStoreBlobUri = RoleEnvironment.GetConfigurationSettingValue("< Configuration name>");

Resources