Is there a way to continuously pipe data from Azure Blob into BigQuery? - azure

I have a bunch of files in Azure Blob storage and it's constantly getting new ones. I was wondering if there is a way for me to first take all the data I have in Blob and move it over to BigQuery and then keep a script or some job running so that all new data in there gets sent over to BigQuery?

BigQuery offers support for querying data directly from these external data sources: Google Cloud Bigtable, Google Cloud Storage, Google Drive. Not include Azure Blob storage. As Adam Lydick mentioned, as a workaround, you could copy data/files from Azure Blob storage to Google Cloud Storage (or other BigQuery-support external data sources).
To copy data from Azure Blob storage to Google Cloud Storage, you can run WebJobs (or Azure Functions), and BlobTriggerred WebJob can trigger a function when a blob is created or updated, in WebJob function you can access the blob content and write/upload it to Google Cloud Storage.
Note: we can install this library: Google.Cloud.Storage to make common operations in client code. And this blog explained how to use Google.Cloud.Storage sdk in Azure Functions.

I'm not aware of anything out-of-the-box (on Google's infrastructure) that can accomplish this.
I'd probably set up a tiny VM to:
Scan your Azure blob storage looking for new content.
Copy new content into GCS (or local disk).
Kick off a LOAD job periodically to add the new data to BigQuery.
If you used GCS instead of Azure Blob Storage, you could eliminate the VM and just have a Cloud Function that is triggered on new items being added to your GCS bucket (assuming your blob is in a form that BigQuery knows how to read). I presume this is part of an existing solution that you'd prefer not to modify though.

Related

Blob storage compatibility with Azure Functions

I have some e-mail attachments being saved to Azure Blob.
I am now trying to write a Azure Functions App that would connect to that blob storage, run some scripts and re-save the file.
However, when selecting a storage account for the function, I couldn't select my blob storage account.
I went on the website and it said this:
When creating a function app, you must create or link to a general-purpose Azure Storage account that supports Blob, Queue, and Table storage. Some storage accounts don't support queues and tables. These accounts include blob-only storage accounts and Azure Premium Storage.
I'm wondering, is there any workaround this? and if not, perhaps any other suggestions? I'm becoming a little lost in all the options, and which one to actually choose.
Thanks!
EDIT: Might I add I writing the function Python
I think you are overlooking the fact that you can have multiple storage accounts. In order for an Azure Function to work you need a storage account. That storage account is used to store runtime information of the Azure Function for internal purposes like state management. This storage account is subject to restrictions as you already found out. There is no workaround for that.
However, if the function you are writing needs to access another storage account it is free to do so. You just have to provide details to connect to that specific storage account. In that case you also have a clear seperation between the storage account that is used by the azure function for its internal operations and the storage account your application needs to connect and which you have total control about withouth having to worry that you break things by deleting internal used blobs/tables/queues.
You can have a blob triggered function that gets triggered when changes occur on your specific blob storage. That doesn't need to be the storage account that the azure function internally uses, which is created/selected when creating the azure function.
Here is a sample that shows how to add a blob triggered azure function in Python. MyStorageAccountAppSetting refers to an app setting that holds the connection string to the storage account that you use for storage.
The snippet from the website you are quoting is for storing the function app code itself and any related modules. It does not pertain to what your function can access when the code of your function executes.
When your function executes it will need to use the Azure Blob Storage SDK/modules to connect to your blob storage account and read the email attachments. Here's a quickstart guide for using Azure Storage with Python: Quickstart with Azure Storage Blobs SDK for Python
General-purpose v2 storage accounts support the latest Azure Storage features and incorporate all of the functionality of general-purpose v1 and Blob storage accounts here
There are more integration options with GPv2 accounts including Azure Function Triggers. See: Azure Blob storage bindings for Azure Functions
Further refer: Types of storage accounts
If Blob, based on your need, you can choose an access tier based on the frequency of access for the data (e-mail attachments)Access tiers for Azure Blob Storage - hot, cool, and archive. If General purpose storage account, its standard performance tier.

Which Azure Storage method is best for a temporary file transfer?

I want to automate the transfer of files from a website not hosted in Azure to my client’s premises.
I am considering having an API on the website send the files to Azure Blob Storage , and then having another API running at the client site, download them.
Both would make use of the Azure storage API, which I like because it is easy to implement.
The files do not need to stay in Azure and can be deleted from storage once they are downloaded.
However I am wondering if there is a faster way.
Should I be using Hot Blob Storage or File Storage perhaps?
I looked at https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers but am still unclear as to the fastest method for my use case.
I suggest you can use File share, which can be mapped to local as a mapped drive and can be easily and faster operation like read / delete.
If you choose code only, from the comparison of blob and file, they can be up to Up to 60 MiB/s, I cannot see which is faster. There is a Azure Storage Data Movement Library , which is designed for high-performance uploading, downloading and copying Azure Storage Blob and File, you can use it for your purpose.
I would recommend blob storage for this application. Logic apps can also be used to automate this pipeline based on timer triggers or some other trigger.

Logic Apps - Get Blob Content Using Path

I have a event driven logic app (blob event) which reads a block blob using the path and uploads the content to Azure Data Lake. I noticed the logic app is failing with 413 (RequestEntityTooLarge) reading a large file (~6 GB). I understand that logic apps has the limitation of 1024 MB - https://learn.microsoft.com/en-us/connectors/azureblob/ but is there any work around to handle this type of situation? The alternative solution I am working on is moving this step to Azure Function and get the content from the blob. Thanks for your suggestions!
If you want to use an Azure function, I would suggest you to have a look at this at this article:
Copy data from Azure Storage Blobs to Data Lake Store
There is a standalone version of the AdlCopy tool that you can deploy to your Azure function.
So your logic app will call this function that will run a command to copy the file from blob storage to your data lake factory. I would suggest you to use a powershell function.
Another option would be to use Azure Data Factory to copy file to Data Lake:
Copy data to or from Azure Data Lake Store by using Azure Data Factory
You can create a job that copy file from blob storage:
Copy data to or from Azure Blob storage by using Azure Data Factory
There is a connector to trigger a data factory run from logic app so you may not need azure function but it seems that there is still some limitations:
Trigger Azure Data Factory Pipeline from Logic App w/ Parameter
You should consider using Azure Files connector:https://learn.microsoft.com/en-us/connectors/azurefile/
It is currently in preview, the advantage it has over Blob is that it doesn't have a size limit. The above link includes more information about it.
For the benefit of others who might be looking for a solution of this sort.
I ended up creating an Azure Function in C# as the my design dynamically parses the Blob Name and creates the ADL structure based on the blob name. I have used chunked memory streaming for reading the blob and writing it to ADL with multi threading for adderssing the Azure Functions time out of 10 minutes.

Query blobs in Blob storage

I have serialized text data that is stored in a blob inside Azure blob storage. The text is basically key/value data. I am wondering if there is a way to easily query the blob without exploding the data into another table/database or pulling the blob into memory?
Azure Blob storage has no API to query data within the blob - it's just dumb storage. See here for the Blob Storage API. You're essentially stuck reading, deserializing and grabbing your value(s).
Perhaps Azure table storage would be a better fit for this application? That at least keeps things in the realm of an Azure storage account rather than needing to pull in a SQL Server instance.
One option you could consider is to use Data Lake Analytics, as it supports Azure Blobs as data source.
Depending on what your preferred way of accessing the data is, you can use PowerShell, .NET SDK etc. to query the data...

How to achieve Incremental deployment of Blob Files storage files to different environments of windows azure storage?

We are new to Windows azure and are developing a web application. In the beginning of the project , we have deployed complete code to different environments which actually publish complete code and uploaded blob objects to azure storage as we linked sitefinity to hold blob objects in azure storage . But now as we are in the middle of development , we are just required to upload any new blob files created which can be quite less in numbers (1 or 2 or maybe few).Now I would like to know best process to sync these blob files to different azure storage environments which is for each cloud service. So ideally we would like to update staging cloud service and staging storage first and then test there and then once no bugs are found, then will be required to update UAT and production storages as well with the changed or new blob objects.
Please help.
You can use the Azure Storage Explorer to manually upload/download blobs from storage accounts very easily. For one or two blobs, this would be an easy solution, otherwise you will need to write a tool that connects to the blob storage via an API and does the copying for you.

Resources