Duplicate Blob Created Events When Writing to Azure Blob Storage from Azure Databricks - azure

We are using an Azure Storage Account (Blob, StorageV2) with a single container in it. We are also using Azure Data Factory to trigger data copy pipelines from blobs (.tar.gz) created in the container. The trigger works fine when creating the blobs from an Azure App Service or by manually uploading via the Azure Storage Explorer. But when creating the blob from a Notebook on Azure Databricks, we get two (2) events for every blob created (same parameters for both events). The code for creating the blob from the notebook resembles:
dbutils.fs.cp(
"/mnt/data/tmp/file.tar.gz",
"/mnt/data/out/file.tar.gz"
)
The tmp folder is just used to assemble the package, and the event trigger is attached to the out folder. We also tried with dbutils.fs.mv, but same result. The trigger rules in Azure Data Factory are:
Blob path begins with: out/
Blob path ends with: .tar.gz
The container name is data.
We did find some similar posts relating to zero-length files, but at least we can't see them anywhere (if some kind of by-product to dbutils).
As mentioned, just manually uploading file.tar.gz works fine - a single event is triggered.

We had to revert to uploading the files from Databricks to the Blob Storage using the azure-storage-blob library. Kind of a bummer, but it works now as expected. Just in case anyone else runs into this.
More information:
https://learn.microsoft.com/en-gb/azure/storage/blobs/storage-quickstart-blobs-python

Related

Map Azure Blob Storage files to a custom activity in Azure Data Factory

I have a container with 100 binary files. Is there a way to run a custom activity (can be a .net program or, ideally, a container) for each one of these files using Azure Data Factory?
I created batch account in Azure portal
and created pool in batch account.
Image for reference:
I created pipeline in ADF and I created custom Activity using following details:
AzureBatch2:
Here AzureBlobStorage1 is blob storage linked service where bin files are there.
AzureBlobStorage1:
binaryfile:
Custom1 Settings:
I set command as
cmd
I started debug and it is running successfully.
Custom Activity created adfjobs folder in binaryfile storage account.
Image for reference:
It is working fine from my end kindly check from your end.

Trigger Azure data factory pipeline - Blob upload ADLS Gen2 (programmatically)

We are uploading files into Azure data lake storage using Azure SDK for java. After uploading a file, Azure data factory needs to be triggered. BLOB CREATED trigger is added in a pipeline.
Main problem is after each file upload it gets triggered twice.
To upload a file into ADLS gen2, azure provides different SDK than conventional Blobstorage.
SDK uses package - azure-storage-file-datalake.
DataLakeFileSystemClient - to get container
DataLakeDirectoryClient.createFile - to create a file. //this call may be raising blob created event
DataLakeFileClient.uploadFromFile - to upload file //this call may also be raising blob created event
I think ADF trigger is not upgraded to capture Blob created event appropriately from ADLSGen2.
Any option to achieve this? There are restrictions in my org not to use Azure functions, otherwise Azure functions can be triggered based on Storage Queue message or Service bus message and ADF pipeline can be started using data factory REST API.
You could try Azure Logic Apps with a blob trigger and a data factory action:
Trigger: When a blob is added or modified (properties only):
This operation triggers a flow when one or more blobs are added or
modified in a container. This trigger will only fetch the file
metadata. To get the file content, you can use the "Get file content"
operation. The trigger does not fire if a file is added/updated in a
subfolder. If it is required to trigger on subfolders, multiple
triggers should be created.
Action: Get a pipeline run
Get a particular pipeline run execution
Hope this helps.

Is there a way to continuously pipe data from Azure Blob into BigQuery?

I have a bunch of files in Azure Blob storage and it's constantly getting new ones. I was wondering if there is a way for me to first take all the data I have in Blob and move it over to BigQuery and then keep a script or some job running so that all new data in there gets sent over to BigQuery?
BigQuery offers support for querying data directly from these external data sources: Google Cloud Bigtable, Google Cloud Storage, Google Drive. Not include Azure Blob storage. As Adam Lydick mentioned, as a workaround, you could copy data/files from Azure Blob storage to Google Cloud Storage (or other BigQuery-support external data sources).
To copy data from Azure Blob storage to Google Cloud Storage, you can run WebJobs (or Azure Functions), and BlobTriggerred WebJob can trigger a function when a blob is created or updated, in WebJob function you can access the blob content and write/upload it to Google Cloud Storage.
Note: we can install this library: Google.Cloud.Storage to make common operations in client code. And this blog explained how to use Google.Cloud.Storage sdk in Azure Functions.
I'm not aware of anything out-of-the-box (on Google's infrastructure) that can accomplish this.
I'd probably set up a tiny VM to:
Scan your Azure blob storage looking for new content.
Copy new content into GCS (or local disk).
Kick off a LOAD job periodically to add the new data to BigQuery.
If you used GCS instead of Azure Blob Storage, you could eliminate the VM and just have a Cloud Function that is triggered on new items being added to your GCS bucket (assuming your blob is in a form that BigQuery knows how to read). I presume this is part of an existing solution that you'd prefer not to modify though.

How to achieve Incremental deployment of Blob Files storage files to different environments of windows azure storage?

We are new to Windows azure and are developing a web application. In the beginning of the project , we have deployed complete code to different environments which actually publish complete code and uploaded blob objects to azure storage as we linked sitefinity to hold blob objects in azure storage . But now as we are in the middle of development , we are just required to upload any new blob files created which can be quite less in numbers (1 or 2 or maybe few).Now I would like to know best process to sync these blob files to different azure storage environments which is for each cloud service. So ideally we would like to update staging cloud service and staging storage first and then test there and then once no bugs are found, then will be required to update UAT and production storages as well with the changed or new blob objects.
Please help.
You can use the Azure Storage Explorer to manually upload/download blobs from storage accounts very easily. For one or two blobs, this would be an easy solution, otherwise you will need to write a tool that connects to the blob storage via an API and does the copying for you.

Azure - Check if a new blob is uploaded to a container

Are there ways to check if a container in Azure has a new blob (doesn't matter which blob it is)? LastModifiedUtc does not seem to change if a blob is dropped into the container
You should use a BlobTrigger function in an App Service resource.
Documentation
Windows Azure Blob Storage does not provide this functionality out of the box. You would need to handle this on your end. A few things come to my mind (just thinking out loud):
If the blobs are uploaded using your application (and not through 3rd party tools), after the blob is uploaded, you could just update the container properties (may be add/update a metadata entry with information about the last blob uploaded). You could also make an entry into Azure Table Storage and keep on updating it with the information about last blob uploaded. As I said above, this method will only work if all blobs are uploaded through your application.
You could manually iterate through blobs in the blob container periodically and then sort them by last modified date. This method would work fine for a blob container having lesser number of blobs. If the number of blobs are more (say in tens of thousands), then you would end up fetching a long list because blob storage only sorts the blob by blob name.

Resources