Delete images from a folder created on a Google Cloud run API - python-3.x

I have a flask api that is running on google cloud run. For the sake of the question let it be called https://objdetect-bbzurq6giq-as.a.run.app/objdetect.
Using this API, a person uploads an image, the api highlights objects in the image and then stores the new image in a folder called static. The location of that folder is https://objdetect-bbzurq6giq-as.a.run.app/static/.
Now, that I am testing the API on tons of images, the capacity of the server is running out. I want to delete all the images from the static folder.
I tried the below python script but it didn't work for me, maybe thats not the right solution:
from google.cloud import storage
import os
os.environ["GCLOUD_PROJECT"] = "my-project-1234"
bucket_name = 'https://objdetect-bbzurq6giq-as.a.run.app/objdetect'
directory_name = 'https://objdetect-bbzurq6giq-as.a.run.app/static/'
client = storage.Client()
bucket = client.get_bucket(bucket_name)
# list all objects in the directory
blobs = bucket.list_blobs(prefix=directory_name)
for blob in blobs:
blob.delete()
Is there a way to achieve this using a python script?

Cloud Run is not Cloud Storage. Use the Linux file system APIs to delete files stored in Cloud Run.
Use the function os.unlink()
path = '/static'
with os.scandir(path) as it:
for entry in it:
if entry.is_file():
unlink(os.path.join(path, entry.name))

Cloud run and Cloud storage are 2 different services.
Cloud run runs within a container, we can say within a stateless machine/VM. if the images are created within the container it will get deleted once the container shuts down.
Cloud storage is like a file storage and the files within GCS will persist till explicitly deleted or deleted by a lifecycle.
So files created within cloud run are not stored within a storage rather in the in the server run in the container. If you want to delete the files inside cloud run then you would need to delete those using the code(Python in your case).

Related

Combining google.cloud.storage.blob.Blob and google.appengine.api.images.get_serving_url

I've got images stored as Blobs in Google Cloud Storage and I can see amazing capabilities offered by get_serving_url() which requires a blob_key as its first parameter. But I cannot see anyway to get a key from the Blob - am I mixing up multiple things called Blobs?
I'm using Python3 on GAE.
Try the code below but you'll need to first enable the Bundled Services API
from google.appengine.ext import blobstore
# Create the cloud storage file name
blobstore_filename = f"/gs/{<your_file_name>}"
# Now create the blob_key from the cloud storage file name
blob_key = blobstore.create_gs_key(blobstore_filename)

Cloud build avoid billing by changing eu.artifacts.<project>.appspot.com bucket to single-region

Using app engine standard environment for python 3.7.
When running the app deploy command are container images uploaded to google storage in the bucket eu.artifacts.<project>.appspot.com.
This message is printed during app deploy
Beginning deployment of service [default]...
#============================================================#
#= Uploading 827 files to Google Cloud Storage =#
#============================================================#
File upload done.
Updating service [default]...
The files are uploaded to a multi-region (eu), how do I change this to upload to a single region?
Guessing that it's a configuration file that should be added to the repository to instruct app engine, cloud build or cloud storage that the files should be uploaded to a single region.
Is the eu.artifacts.<project>.appspot.com bucket required, or could all files be ignore using the .gcloudignore file?
The issue is similar to this issue How can I specify a region for the Cloud Storage buckets used by Cloud Build for a Cloud Run deployment?, but for app engine.
I'm triggering the cloud build using a service account.
Tried to implement the changes in the solution in the link above, but aren't able to get rid of the multi region bucket.
substitutions:
_BUCKET: unused
steps:
- name: 'gcr.io/cloud-builders/gcloud'
args: ['app', 'deploy', '--promote', '--stop-previous-version']
artifacts:
objects:
location: 'gs://${_BUCKET}/artifacts'
paths: ['*']
Command gcloud builds submit --gcs-log-dir="gs://$BUCKET/logs" --gcs-source-staging-dir="gs://$BUCKET/source" --substitutions=_BUCKET="$BUCKET"
I delete whole bucket after deploying, which prevents billing
gsutil -m rm -r gs://us.artifacts.<project-id>.appspot.com
-m - multi-threading/multi-processing (instead of deleting object-by-object this arguments will delete objects simultaneously)
rm - command to remove objects
-r - recursive
https://cloud.google.com/storage/docs/gsutil/commands/rm
After investigation a little bit more, I want to mention that this
kind of bucket is created by the “container registry” product when you deploy a new container( when you deploy your App Engine Application)-> When you push an image to a registry with a new hostname, Container Registry creates a storage bucket in the specified multi-regional location.This bucket is the underlying storage for the registry. Within a project, all registries with the same hostname share one storage bucket.
Based on this, it is not accessible by default and itself contains container images which are written when you deploy a new container. It's not recommended to modify it because the artifacts bucket is meant to contain deployment images, which may influence your app.
Finally, something curious that I found is when you create a default bucket (as is the case of the aforementioned bucket), you also get a staging bucket with the same name except that staging. You can use this staging bucket for temporary files used for staging and test purposes; it also has a 5 GB limit, but it is automatically emptied on a weekly basis

How to ship Airflow logs to Azure Blob Store

I'm having trouble following this guide section 3.6.5.3 "Writing Logs to Azure Blob Storage"
The documentation states you need an active hook to Azure Blob storage. I'm not sure how to create this. Some sources say you need to create the hook in the UI, and some say you can use an environment variable. Either way, none of my logs are getting written to blob store and I'm at my wits end.
Azure Blob Store hook(or any hook for that matter) tells overflow how to write to into Azure Blob Store. This is already included in recent versions of airflow, wasb_hook.
You will need to make sure that the hook is able to write to Azure Blob Store. Just mention the REMOTE_BASE_LOG_FOLDER bucket should be named like wasb-xxx. Once you take care of these two things instructions works without a hitch,
I achieved writing logs to blob using below steps
Create folder named config inside airflow folder
Create empty __init__.py and log_config.py files inside config folder
Search airflow_local_settings.py in your machine
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.pyc
run
cp /home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py config/log_config.py
Edit airflow.cfg [core] section
remote_logging = True
remote_log_conn_id = log_sync
remote_base_log_folder=wasb://airflow-logs#storage-account.blob.core.windows.net/logs/
logging_config_class =log_config.DEFAULT_LOGGING_CONFIG
Add log_sync connection object as below
install airflow azure dependency
pip install apache-airflow[azure]
Restart webserver and scheduler

Docker issue with python script in Azure logic app not connecting to current Azure blob storage

Every day, an Excel file is automatically uploaded to my Azure blob storage account. I have a Python script that reads the Excel file, extracts the necessary information, and saves the output as a new blob in the Azure storage account. I set up a Docker container that runs this Python script. It works correctly when run locally.
I pushed the Docker image to the Azure container registry and tried to set up an Azure logic app that starts a container with this Docker image every day at the same time. It runs, however, it does not seem to be working with the most updated version of my Azure storage account.
For example, I pushed an updated version of the Docker image last night. A new Excel file was added to the Azure storage account this morning and the logic app ran one hour later. The container with the Docker image, however, only found the files that were present in Azure storage account yesterday (so it was missing the most recent file, which is the one I needed analyzed).
I confirmed that the issue is not with the logic app as I added a step in the logic app to list the files in the Azure storage account, and this list included the most recent file.
UPDATE: I have confirmed that I am accessing the correct version of the environment variables. The issue remains: the Docker container seems to access Azure blob storage as it was at the time I most recently pushed the Docker image to the container registry. My current work around is to push the same image to the registry everyday, but this is annoying.
ANOTHER UPDATE: Here is the code to get the most recent blob (an Excel file). The date is always contained in the name of the blob. In theory, it finds the blob with the most recent date:
blobs = blob_service.list_blobs(container_name=os.environ.get("CONTAINERNAME"))
blobstring = blob_service.get_blob_to_text(os.environ.get("CONTAINERNAME"),
backup_csv).content
current_df = pd.read_csv(StringIO(blobstring))
add_n = 1
blob_string = re.compile("sales.xls")
for b in blobs:
if blob_string.search(b.name):
dt = b.name[14:24]
dt = datetime.strptime(dt, "%Y-%m-%d").date()
date_list.append(dt)
today = max(date_list)
print(today)
However, the blobs don't seem to update. It returns the most recent blob as of the date that I last pushed the image to the registry.
I also checked print(date.today()) in the same script and this works as expected (it prints the current date).
Figured out that I just needed to make all of the variables in my .env file and add them as environment variables with appropriate values in the 'containers environment' section of the image above. This https://learn.microsoft.com/en-us/azure/container-instances/container-instances-environment-variables was a helpful resource.
ALSO the container group needs to be deleted as the last action in the logic app. I named the wrong container group, so when the logic app ran each day, it used the cached version of the container.

Cannot read .json from a google cloud bucket

I have a folder structure within a bucket of google cloud storage
bucket_name = 'logs'
json_location = '/logs/files/2018/file.json'
I try to read this json file in jupyter notebook using this code
from google.cloud import storage
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "logs/files/2018/file.json"
def download_blob(source_blob_name, bucket_name, destination_file_name):
"""Downloads a blob from the bucket."""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print('Blob {} downloaded to {}.'.format(
source_blob_name,
destination_file_name))
Then calling the function
download_blob('file.json', 'logs', 'file.json')
And I get this error
DefaultCredentialsError: File /logs/files/2018/file.json was not found.
I have looked at all the similar question asked on stackoverflow and cannot find a solution.
The json file is present and can be open or downloaded in the json_location on google cloud storage.
There are two different perspectives regarding the json file you refer:
1) The json file used for authenticating to GCP.
2) The json you want to download from a bucket to your local machine.
For the first one, if you are accessing remotely to you Jupyter server, most probably the json doesn't exist in such remote machine, but in your local machine. If this is your scenario try to upload the json to the Jupyter server. Executing ls -l /logs/files/2018/file.json in the remote machine could help to verify its correctness. Then, os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "JSON_PATH_ON_JUPYTER_SERVER" should work.
On the other hand, I executed your code and got:
>>> download_blob('static/upload_files_CS.png', 'bucketrsantiago', 'file2.json')
Blob static/upload_files_CS.png downloaded to file2.json.
The file gs://bucketrsantiago/static/upload_files_CS.png was downloaded to my local machine with the name file2.json. This helps to clarify that the only problem is regarding the authentication json file.
GOOGLE_APPLICATION_CREDENTIALS is supposed to point to a file on the local disk where you are running jupyter. You need the credentials in order to call GCS, so you can't fetch them from GCS.
In fact, you are best off not messing around with credentials at all in your program, and leaving the client library to it. Don't touch GOOGLE_APPLICATION_CREDENTIALS in our application. Instead:
If you are running on GCE, just make sure your GCE instances [has a service account with the right scopes and permissions]. Applications running in that instance will automatically have the permissions of that service account.
If you are running locally, install google cloud SDK and run gcloud auth application-default login. Your program will then automatically use whichever account you log in as.
Complete instructions here

Resources