Docker issue with python script in Azure logic app not connecting to current Azure blob storage - python-3.x

Every day, an Excel file is automatically uploaded to my Azure blob storage account. I have a Python script that reads the Excel file, extracts the necessary information, and saves the output as a new blob in the Azure storage account. I set up a Docker container that runs this Python script. It works correctly when run locally.
I pushed the Docker image to the Azure container registry and tried to set up an Azure logic app that starts a container with this Docker image every day at the same time. It runs, however, it does not seem to be working with the most updated version of my Azure storage account.
For example, I pushed an updated version of the Docker image last night. A new Excel file was added to the Azure storage account this morning and the logic app ran one hour later. The container with the Docker image, however, only found the files that were present in Azure storage account yesterday (so it was missing the most recent file, which is the one I needed analyzed).
I confirmed that the issue is not with the logic app as I added a step in the logic app to list the files in the Azure storage account, and this list included the most recent file.
UPDATE: I have confirmed that I am accessing the correct version of the environment variables. The issue remains: the Docker container seems to access Azure blob storage as it was at the time I most recently pushed the Docker image to the container registry. My current work around is to push the same image to the registry everyday, but this is annoying.
ANOTHER UPDATE: Here is the code to get the most recent blob (an Excel file). The date is always contained in the name of the blob. In theory, it finds the blob with the most recent date:
blobs = blob_service.list_blobs(container_name=os.environ.get("CONTAINERNAME"))
blobstring = blob_service.get_blob_to_text(os.environ.get("CONTAINERNAME"),
backup_csv).content
current_df = pd.read_csv(StringIO(blobstring))
add_n = 1
blob_string = re.compile("sales.xls")
for b in blobs:
if blob_string.search(b.name):
dt = b.name[14:24]
dt = datetime.strptime(dt, "%Y-%m-%d").date()
date_list.append(dt)
today = max(date_list)
print(today)
However, the blobs don't seem to update. It returns the most recent blob as of the date that I last pushed the image to the registry.
I also checked print(date.today()) in the same script and this works as expected (it prints the current date).

Figured out that I just needed to make all of the variables in my .env file and add them as environment variables with appropriate values in the 'containers environment' section of the image above. This https://learn.microsoft.com/en-us/azure/container-instances/container-instances-environment-variables was a helpful resource.
ALSO the container group needs to be deleted as the last action in the logic app. I named the wrong container group, so when the logic app ran each day, it used the cached version of the container.

Related

"Azure Blob Source 400 Bad Request" when using Azure Blob Source in SSIS to pull a file from Azure Storage container

My package is very simple. It is loading data from a csv file that I have stored in an Azure storage container, and inserting that data into an Azure SQL database. The issue is stemming from the connection to my Azure storage container. here is an image of the output:
Making this even more odd, while the data flow task is failing:
The individual components within the data flow task all indicate success:
Setting up the package, it seems that the connection to the container is fine (after all, it was able to extract all the column names from the desired file and map them to their destination). Here is an image showing the connection is fine:
So the issue is only realized upon execution.
I will also note that I found this post that was experiencing the exact same issue that I am now. As the top response there instructed, I added the new registry keys, but no cigar.
Any thoughts would be helpful.
First, make sure your blob can be access by public:
And if you don't have requirement to set networking, please make sure:
Then set the container access level:
And make sure the container is correct.

Delete images from a folder created on a Google Cloud run API

I have a flask api that is running on google cloud run. For the sake of the question let it be called https://objdetect-bbzurq6giq-as.a.run.app/objdetect.
Using this API, a person uploads an image, the api highlights objects in the image and then stores the new image in a folder called static. The location of that folder is https://objdetect-bbzurq6giq-as.a.run.app/static/.
Now, that I am testing the API on tons of images, the capacity of the server is running out. I want to delete all the images from the static folder.
I tried the below python script but it didn't work for me, maybe thats not the right solution:
from google.cloud import storage
import os
os.environ["GCLOUD_PROJECT"] = "my-project-1234"
bucket_name = 'https://objdetect-bbzurq6giq-as.a.run.app/objdetect'
directory_name = 'https://objdetect-bbzurq6giq-as.a.run.app/static/'
client = storage.Client()
bucket = client.get_bucket(bucket_name)
# list all objects in the directory
blobs = bucket.list_blobs(prefix=directory_name)
for blob in blobs:
blob.delete()
Is there a way to achieve this using a python script?
Cloud Run is not Cloud Storage. Use the Linux file system APIs to delete files stored in Cloud Run.
Use the function os.unlink()
path = '/static'
with os.scandir(path) as it:
for entry in it:
if entry.is_file():
unlink(os.path.join(path, entry.name))
Cloud run and Cloud storage are 2 different services.
Cloud run runs within a container, we can say within a stateless machine/VM. if the images are created within the container it will get deleted once the container shuts down.
Cloud storage is like a file storage and the files within GCS will persist till explicitly deleted or deleted by a lifecycle.
So files created within cloud run are not stored within a storage rather in the in the server run in the container. If you want to delete the files inside cloud run then you would need to delete those using the code(Python in your case).

Is it possible to append only the newly created/arrived blob in Azure blob storage to the Azure SQL DB?

Here is my problem in steps:
Uploading specific csv file via PowerShell with Az module to a specific Azure blob container. [Done, works fine]
There is a trigger against this container which fires when a new file appears. [Done, works fine]
There is a pipeline connected with this trigger, which is appends the fresh csv to the specific SQL table. [Done, but not good]
I have a problem with step 3. I don't want to append all the csv-s within the container(how is's working now), I just want to append the csv which is just arrived - the newest in the container.
Okay, the solution is:
there is a builtin attribute in pipeline called #triggerBody().fileName
Since I have the file which fired the trigger, I can pass it to the Pipeline.
I think you can use event trigger and check Blob created option.
Here is an official documentation about it. You can refer to this.

How to ship Airflow logs to Azure Blob Store

I'm having trouble following this guide section 3.6.5.3 "Writing Logs to Azure Blob Storage"
The documentation states you need an active hook to Azure Blob storage. I'm not sure how to create this. Some sources say you need to create the hook in the UI, and some say you can use an environment variable. Either way, none of my logs are getting written to blob store and I'm at my wits end.
Azure Blob Store hook(or any hook for that matter) tells overflow how to write to into Azure Blob Store. This is already included in recent versions of airflow, wasb_hook.
You will need to make sure that the hook is able to write to Azure Blob Store. Just mention the REMOTE_BASE_LOG_FOLDER bucket should be named like wasb-xxx. Once you take care of these two things instructions works without a hitch,
I achieved writing logs to blob using below steps
Create folder named config inside airflow folder
Create empty __init__.py and log_config.py files inside config folder
Search airflow_local_settings.py in your machine
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.pyc
run
cp /home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py config/log_config.py
Edit airflow.cfg [core] section
remote_logging = True
remote_log_conn_id = log_sync
remote_base_log_folder=wasb://airflow-logs#storage-account.blob.core.windows.net/logs/
logging_config_class =log_config.DEFAULT_LOGGING_CONFIG
Add log_sync connection object as below
install airflow azure dependency
pip install apache-airflow[azure]
Restart webserver and scheduler

access a file from a directory in azure blob storage through Azure Logic App

I am using LogicApp to import a set of files which are inside the directory(/devcontainer/sample1/abc.csv).
The problem here is that,I could not even located to the azure file from my LogicApp, I am getting the following error as:
verify that the path exists and does not contain the blob name.List Folder is not allowed on blobs.
Screenshots for reference
The problem here is that,I could not even located to the azure file from my LogicApp,
The file explorer will show all the contains and blobs when you choose blob path. And it will cache the data for a period of time to ensure the smoothness of the operation. If a blob is added to the container recently, it will not be seen and chosen from the file explorer. The workaround is by clicking the change connection link and using a new connection to retrieve the data.
Does your blob connection pointing to the correct storage account? one thing you can try to do is instead of providing the path try to browse the path so that you can what are the containers and the blobs that are present in the storage account that you are trying to access.

Resources