Azure solution to save stream to blob files as parquet - azure

I read about few different azure services - Events hub capture, Azure data factory, events hub, and more. I am trying to find several ways using azure services to do:
Write data to some "endpoint" or place from my application (preferably service of azure)
The data would be batched and saved in files to BLOB
Eventually, the format should be parquet in the BLOB files
My questions are:
I read that events hub capture only saves files as AVRO. So I might also consider second pipeline of copy from original AVRO BLOB to destination parquet BLOB. Is there a service in AZURE that can listen to my BLOB, convert all files to parquet and save again (I'm not sure from the documentation if the data factory can do this)?
What other alternatives would you consider (except Kafka that I know about) to save stream of data to batches of parquet in BLOB?
Thank you!

For the least amount of effort you can look into a combination of an Event Hub as your endpoint and then connect Azure Stream Analytics to that. It can natively write parquet to blob: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-define-outputs#blob-storage-and-azure-data-lake-gen2

Related

Azure-Databricks media files (video) ingestion possible?

Is it possible to bring video files that are on a storage container in Azure and ingest into Databricks DBFS in a way like you would ingest structured data files?
Specifically, is it possible to partition very large video files and ingest into Databricks?
Goal would be to ingest into DBFS or use the video on Azure from external storage to extract metadata and then go from there.

Changing the format of BLOB storage in Azure

How to change the storage format of the BLOB. For me it is always storing AVRO file. I wanted to change as JSON format.
Also, I am seeing the following error msg in BLOB storage :
avro may not render correctly as it contains an unrecognized extension.(Pls refer the error in the attached msg)
Also I am seeing encrypted msg, due to all this reason data explorer could not pull the data.
I am not able to pull the data in to data explorer because of this format issues
This is likely because you might have enabled capturing of events streaming through Azure Event Hubs.
Azure Event Hubs Capture enables you to automatically deliver the streaming data in Event Hubs to an Azure Blob storage or Azure Data Lake Storage Gen1 or Gen 2 account of your choice.
Captured data is written in Apache Avro format: a compact, fast, binary format that provides rich data structures with inline schema. This format is widely used in the Hadoop ecosystem, Stream Analytics, and Azure Data Factory.
More information about working with Avro files is available in this article: Exploring the captured files and working with Avro
#mmking is right in that you cannot change/convert the file formats within Azure Blob Storage, although you can use Avro Tools to convert the file to JSON format and perform other processing.

Azure stream analytics how to create a single parquet file

I have few IoT devices sending telemetry data to Azure Event Hub. I want to write a data to Parquet file in Azure Data Lake so that I can query that data using Azure Synapse.
I have Azure function triggered to Azure event hub, But I did not find a way directly to write a data received from device to Azure data Lake in Parquet format.
So what I am doing, I have Stream Analytics job - which has Input from Event hub and Output to Azure Data lake in Parquet format.
I have configured Stream analytics output path format as different format - but it would create multiple small files within the following folders.
*device-data/{unitNumber}/
device-data/{unitNumber}/{datetime:MM}/{datetime:dd}*
I want to have single parquet file for single device. Can someone help in this?
I have tried to configure Maximum time -> But the data wont get written to parquet file till this time get elapsed. I don't want this as well.
I want simple functionality - as soon as data received from the device to event hub, it should get appended to parquet file in Azure Data lake.

Stream Analytics: Dynamic output path based on message payload

I am working on an IoT analytics solution which consumes Avro formatted messages fired at an Azure IoT Hub and (hopefully) uses Stream Analytics to store messages in Data Lake and blob storage. A key requirement is the Avro containers must appear exactly the same in storage as they did when presented to the IoT Hub, for the benefit of downstream consumers.
I am running into a limitation in Stream Analytics with granular control over individual file creation. When setting up a new output stream path, I can only provide date/day and hour in the path prefix, resulting in one file for every hour instead of one file for every message received. The customer requires separate blob containers for each device and separate blobs for each event. Similarly, the Data Lake requirement dictates at least a sane naming convention that is delineated by device, with separate files for each event ingested.
Has anyone successfully configured Stream Analytics to create a new file every time it pops a message off of the input? Is this a hard product limitation?
Stream Analytics is indeed oriented for efficient processing of large streams.
For your use case, you need an additional component to implement your custom logic.
Stream Analytics can output to Blob, Event Hub, Table Store or Service Bus. Another option is to use the new Iot Hub Routes to route directly to an Event Hub or a Service Bus Queue or Topic.
From there you can write an Azure Function (or, from Blob or Table Storage, a custom Data Factory activity) and use the Data Lake Store SDK to write files with the logic that you need.

Connect Azure Event Hubs with Data Lake Store

What is the best way to send data from Event Hubs to Data Lake Store?
I am assuming you want to ingest data from EventHubs to Data Lake Store on a regular basis. Like Nava said, you can use Azure Stream Analytics to get data from EventHub into Azure Storage Blobs. Thereafter you can use Azure Data Factory (ADF) to copy data on a scheduled basis from Blobs to Azure Data Lake Store. More details on using ADF are available here: https://azure.microsoft.com/en-us/documentation/articles/data-factory-azure-datalake-connector/. Hope this helps.
==
March 17, 2016 update.
Support for Azure Data Lake Store as an output for Azure Stream Analytics is now available. https://blogs.msdn.microsoft.com/streamanalytics/2016/03/14/integration-with-azure-data-lake-store/ . This will be the best option for your scenario.
Sachin Sheth
Program Manager, Azure Data Lake
In addition to Nava's reply: you can query data in a Windows Azure Blob Storage container with ADLA/U-SQL as well. Or you can use the Blob Store to ADL Storage copy service (see https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-copy-data-azure-storage-blob/).
One way would be to write a process to read messages from the event hub event hub API and writes them into a Data Lake Store. Data Lake SDK.
Another alternative would be to use Steam Analytics to get data from Event Hub into a Blob, and Azure Automation to run a powershell that would read the data from the blob and write into a data lake store.
Not taking credit for this, but sharing with the community:
It is also possible to archive the Events (look into properties\archive), this leaves an Avro blob.
Then using the AvroExtractor you can convert the records into Json as described in Anthony's blob:
http://anthonychu.ca/post/event-hubs-archive-azure-data-lake-analytics-usql/
One of the ways would be to connect your EventHub to Data Lake using EventHub capture functionality (Data Lake and Blob Storage is currently supported). Event Hub would write to Data Lake every N mins interval or once data size threshold is reached. It is used to optimize storage "write" operations as they are expensive on a high scale.
The data is stored in Avro format, so if you want to query it using USQL you'd have to use an Extractor class. Uri gave a good reference to it https://anthonychu.ca/post/event-hubs-archive-azure-data-lake-analytics-usql/.

Resources