Query blobs in Blob storage - azure

I have serialized text data that is stored in a blob inside Azure blob storage. The text is basically key/value data. I am wondering if there is a way to easily query the blob without exploding the data into another table/database or pulling the blob into memory?

Azure Blob storage has no API to query data within the blob - it's just dumb storage. See here for the Blob Storage API. You're essentially stuck reading, deserializing and grabbing your value(s).
Perhaps Azure table storage would be a better fit for this application? That at least keeps things in the realm of an Azure storage account rather than needing to pull in a SQL Server instance.

One option you could consider is to use Data Lake Analytics, as it supports Azure Blobs as data source.
Depending on what your preferred way of accessing the data is, you can use PowerShell, .NET SDK etc. to query the data...

Related

Logic Apps - Get Blob Content Using Path

I have a event driven logic app (blob event) which reads a block blob using the path and uploads the content to Azure Data Lake. I noticed the logic app is failing with 413 (RequestEntityTooLarge) reading a large file (~6 GB). I understand that logic apps has the limitation of 1024 MB - https://learn.microsoft.com/en-us/connectors/azureblob/ but is there any work around to handle this type of situation? The alternative solution I am working on is moving this step to Azure Function and get the content from the blob. Thanks for your suggestions!
If you want to use an Azure function, I would suggest you to have a look at this at this article:
Copy data from Azure Storage Blobs to Data Lake Store
There is a standalone version of the AdlCopy tool that you can deploy to your Azure function.
So your logic app will call this function that will run a command to copy the file from blob storage to your data lake factory. I would suggest you to use a powershell function.
Another option would be to use Azure Data Factory to copy file to Data Lake:
Copy data to or from Azure Data Lake Store by using Azure Data Factory
You can create a job that copy file from blob storage:
Copy data to or from Azure Blob storage by using Azure Data Factory
There is a connector to trigger a data factory run from logic app so you may not need azure function but it seems that there is still some limitations:
Trigger Azure Data Factory Pipeline from Logic App w/ Parameter
You should consider using Azure Files connector:https://learn.microsoft.com/en-us/connectors/azurefile/
It is currently in preview, the advantage it has over Blob is that it doesn't have a size limit. The above link includes more information about it.
For the benefit of others who might be looking for a solution of this sort.
I ended up creating an Azure Function in C# as the my design dynamically parses the Blob Name and creates the ADL structure based on the blob name. I have used chunked memory streaming for reading the blob and writing it to ADL with multi threading for adderssing the Azure Functions time out of 10 minutes.

Is there a way to continuously pipe data from Azure Blob into BigQuery?

I have a bunch of files in Azure Blob storage and it's constantly getting new ones. I was wondering if there is a way for me to first take all the data I have in Blob and move it over to BigQuery and then keep a script or some job running so that all new data in there gets sent over to BigQuery?
BigQuery offers support for querying data directly from these external data sources: Google Cloud Bigtable, Google Cloud Storage, Google Drive. Not include Azure Blob storage. As Adam Lydick mentioned, as a workaround, you could copy data/files from Azure Blob storage to Google Cloud Storage (or other BigQuery-support external data sources).
To copy data from Azure Blob storage to Google Cloud Storage, you can run WebJobs (or Azure Functions), and BlobTriggerred WebJob can trigger a function when a blob is created or updated, in WebJob function you can access the blob content and write/upload it to Google Cloud Storage.
Note: we can install this library: Google.Cloud.Storage to make common operations in client code. And this blog explained how to use Google.Cloud.Storage sdk in Azure Functions.
I'm not aware of anything out-of-the-box (on Google's infrastructure) that can accomplish this.
I'd probably set up a tiny VM to:
Scan your Azure blob storage looking for new content.
Copy new content into GCS (or local disk).
Kick off a LOAD job periodically to add the new data to BigQuery.
If you used GCS instead of Azure Blob Storage, you could eliminate the VM and just have a Cloud Function that is triggered on new items being added to your GCS bucket (assuming your blob is in a form that BigQuery knows how to read). I presume this is part of an existing solution that you'd prefer not to modify though.

Is Azure Blob storage the right place to store many (small) communication logs?

I am working with a program which connects to multiple APIs, the logs for each operation (HTML/XML/Json) need to be stored for possible later review. Is it feasible to store each request/reply in an Azure blob? There can be hundreds of requests per second (all of which need storing) which vary in size and have an average size of 100kB.
Because the logs need to be searchable (by metadata) my plan is to store it in Azure Blob and put metadata (with blob locations, custom application-related request and content identifiers, etc) in an easily-searchable database.
You can store logs in the Azure table storage or Blob storage but Microsoft itself recommends using Blob storage. Azure Storage Analytics stores log data in Blob storage.
This 'Azure Storage Table Design Guide' points out several draw backs of using table storage for logs and also provides details on how to use the blob storage to store logs. Read the 'Log data anti-pattern' section in particular for this use case.

How are BLOBs stored in Azure?

I did some research to compare file storage in SharePoint and file storage in Azure.
As far as I know, SP uses a SQL database to store everything. So in fact, when I put a BLOB into SP, it ends up in a SQL, as it is mentioned here. So there are some disadvantages of storing BLOBs in SP:
Write operations are particularly problematic because the BLOB is written twice—first to the transaction log for transactional consistency then written to the appropriate table in the SQL content database.
Boiling down a lot of data, it’s pretty clear that files greater than 1MB perform better (reads and writes) when the BLOB is externalized,
Now I wonder if there are the same disadvantages with Azure BLOB Storage: Are they also end up inside a database? Do I have the same disadvantages?
In short the answer is no. The BLOB is not stored in a SQL database in Azure Storage.
The paper below gives more insights to the internals of Azure Storage. Do read:
http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf

Connect Azure Event Hubs with Data Lake Store

What is the best way to send data from Event Hubs to Data Lake Store?
I am assuming you want to ingest data from EventHubs to Data Lake Store on a regular basis. Like Nava said, you can use Azure Stream Analytics to get data from EventHub into Azure Storage Blobs. Thereafter you can use Azure Data Factory (ADF) to copy data on a scheduled basis from Blobs to Azure Data Lake Store. More details on using ADF are available here: https://azure.microsoft.com/en-us/documentation/articles/data-factory-azure-datalake-connector/. Hope this helps.
==
March 17, 2016 update.
Support for Azure Data Lake Store as an output for Azure Stream Analytics is now available. https://blogs.msdn.microsoft.com/streamanalytics/2016/03/14/integration-with-azure-data-lake-store/ . This will be the best option for your scenario.
Sachin Sheth
Program Manager, Azure Data Lake
In addition to Nava's reply: you can query data in a Windows Azure Blob Storage container with ADLA/U-SQL as well. Or you can use the Blob Store to ADL Storage copy service (see https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-copy-data-azure-storage-blob/).
One way would be to write a process to read messages from the event hub event hub API and writes them into a Data Lake Store. Data Lake SDK.
Another alternative would be to use Steam Analytics to get data from Event Hub into a Blob, and Azure Automation to run a powershell that would read the data from the blob and write into a data lake store.
Not taking credit for this, but sharing with the community:
It is also possible to archive the Events (look into properties\archive), this leaves an Avro blob.
Then using the AvroExtractor you can convert the records into Json as described in Anthony's blob:
http://anthonychu.ca/post/event-hubs-archive-azure-data-lake-analytics-usql/
One of the ways would be to connect your EventHub to Data Lake using EventHub capture functionality (Data Lake and Blob Storage is currently supported). Event Hub would write to Data Lake every N mins interval or once data size threshold is reached. It is used to optimize storage "write" operations as they are expensive on a high scale.
The data is stored in Avro format, so if you want to query it using USQL you'd have to use an Extractor class. Uri gave a good reference to it https://anthonychu.ca/post/event-hubs-archive-azure-data-lake-analytics-usql/.

Resources