Changing the format of BLOB storage in Azure - azure

How to change the storage format of the BLOB. For me it is always storing AVRO file. I wanted to change as JSON format.
Also, I am seeing the following error msg in BLOB storage :
avro may not render correctly as it contains an unrecognized extension.(Pls refer the error in the attached msg)
Also I am seeing encrypted msg, due to all this reason data explorer could not pull the data.
I am not able to pull the data in to data explorer because of this format issues

This is likely because you might have enabled capturing of events streaming through Azure Event Hubs.
Azure Event Hubs Capture enables you to automatically deliver the streaming data in Event Hubs to an Azure Blob storage or Azure Data Lake Storage Gen1 or Gen 2 account of your choice.
Captured data is written in Apache Avro format: a compact, fast, binary format that provides rich data structures with inline schema. This format is widely used in the Hadoop ecosystem, Stream Analytics, and Azure Data Factory.
More information about working with Avro files is available in this article: Exploring the captured files and working with Avro
#mmking is right in that you cannot change/convert the file formats within Azure Blob Storage, although you can use Avro Tools to convert the file to JSON format and perform other processing.

Related

Azure stream analytics how to create a single parquet file

I have few IoT devices sending telemetry data to Azure Event Hub. I want to write a data to Parquet file in Azure Data Lake so that I can query that data using Azure Synapse.
I have Azure function triggered to Azure event hub, But I did not find a way directly to write a data received from device to Azure data Lake in Parquet format.
So what I am doing, I have Stream Analytics job - which has Input from Event hub and Output to Azure Data lake in Parquet format.
I have configured Stream analytics output path format as different format - but it would create multiple small files within the following folders.
*device-data/{unitNumber}/
device-data/{unitNumber}/{datetime:MM}/{datetime:dd}*
I want to have single parquet file for single device. Can someone help in this?
I have tried to configure Maximum time -> But the data wont get written to parquet file till this time get elapsed. I don't want this as well.
I want simple functionality - as soon as data received from the device to event hub, it should get appended to parquet file in Azure Data lake.

Azure solution to save stream to blob files as parquet

I read about few different azure services - Events hub capture, Azure data factory, events hub, and more. I am trying to find several ways using azure services to do:
Write data to some "endpoint" or place from my application (preferably service of azure)
The data would be batched and saved in files to BLOB
Eventually, the format should be parquet in the BLOB files
My questions are:
I read that events hub capture only saves files as AVRO. So I might also consider second pipeline of copy from original AVRO BLOB to destination parquet BLOB. Is there a service in AZURE that can listen to my BLOB, convert all files to parquet and save again (I'm not sure from the documentation if the data factory can do this)?
What other alternatives would you consider (except Kafka that I know about) to save stream of data to batches of parquet in BLOB?
Thank you!
For the least amount of effort you can look into a combination of an Event Hub as your endpoint and then connect Azure Stream Analytics to that. It can natively write parquet to blob: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-define-outputs#blob-storage-and-azure-data-lake-gen2

I have about 20 files of type excel /pdf which can be dowloaded from an Http Server.I need to load this file into Azure Storage using Data Factory

I have 20 files of type Excel/pdf located in different https server. i need to validate these file and load into azure storage Using Data Factory.I need to do apply some business logic on this data and load into azure SQL Database.I need to if we have to create a pipe line and store this data in azure blob storage and then load into Azure sql Database
I have tried creating copy data in data factory
My idea as below:
No.1
Step 1: Use Copy Activity to transfer data from http connector source into blob storage connector sink.
Step 2: Meanwhile, configure a blob storage trigger to execute your logic code so that the blob data will be processed as soon as it's collected into blob storage.
Step 3: Use Copy Activity to transfer data from blob storage connector source into SQL database connector sink.
No.2:
Step 1:Use Copy Activity to transfer data from http connector source into SQL database connector sink.
Step 2: Meanwhile, you could configure stored procedure to add your logic steps. The data will be executed before inserted into table.
I think both methods are feasible. The No.1, the business logic is freer and more flexible. The No.2, it is more convenient, but it is limited by the syntax of stored procedures. You could pick the solution as you want.
The excel and pdf are supported yet. Based on the link,only below formats are supported by ADF diectly:
i tested for csv file and get the below random characters:
You could refer to this case to read excel files in ADF:How to read files with .xlsx and .xls extension in Azure data factory?

How to process the telemetry json messages in Azure data lake Gen2?

I have simulated devices which is sending messages to IoT Hub blob storage and from there I am copying data(encoded in JSON format) to Azure Data Lake Gen2 by creating a pipeline using Azure Data Factory.
How to convert these json output file to CSV file to be processed by data lake engine? Can't I process all the incoming json telemetry directly in azure data lake?
There are 3 official built-in extractors that allows you to analyze data contained in CSV, TSV or Text files.
But MSFT also released some additional sample extractors on their Azure GitHub repo that deal with Xml, Json and Avro files. I have used the Json extractor in production as it is really stable and useful.
The JSON Extractor treats the entire input file as a single JSON document. If you have a JSON document per line, see the next section. The columns that you try to extract will be extracted from the document. In this case, I'm extracting out the _id and Revision properties. Note, it's possible that one of these is a further nested object, in which case you can use the JSON UDF's for subsequent processing.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
//Define schema of file, must map all columns
#myRecords =
EXTRACT
_id string,
Revision string
FROM #"sampledata/json/{*}.json"
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();

Connect Azure Event Hubs with Data Lake Store

What is the best way to send data from Event Hubs to Data Lake Store?
I am assuming you want to ingest data from EventHubs to Data Lake Store on a regular basis. Like Nava said, you can use Azure Stream Analytics to get data from EventHub into Azure Storage Blobs. Thereafter you can use Azure Data Factory (ADF) to copy data on a scheduled basis from Blobs to Azure Data Lake Store. More details on using ADF are available here: https://azure.microsoft.com/en-us/documentation/articles/data-factory-azure-datalake-connector/. Hope this helps.
==
March 17, 2016 update.
Support for Azure Data Lake Store as an output for Azure Stream Analytics is now available. https://blogs.msdn.microsoft.com/streamanalytics/2016/03/14/integration-with-azure-data-lake-store/ . This will be the best option for your scenario.
Sachin Sheth
Program Manager, Azure Data Lake
In addition to Nava's reply: you can query data in a Windows Azure Blob Storage container with ADLA/U-SQL as well. Or you can use the Blob Store to ADL Storage copy service (see https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-copy-data-azure-storage-blob/).
One way would be to write a process to read messages from the event hub event hub API and writes them into a Data Lake Store. Data Lake SDK.
Another alternative would be to use Steam Analytics to get data from Event Hub into a Blob, and Azure Automation to run a powershell that would read the data from the blob and write into a data lake store.
Not taking credit for this, but sharing with the community:
It is also possible to archive the Events (look into properties\archive), this leaves an Avro blob.
Then using the AvroExtractor you can convert the records into Json as described in Anthony's blob:
http://anthonychu.ca/post/event-hubs-archive-azure-data-lake-analytics-usql/
One of the ways would be to connect your EventHub to Data Lake using EventHub capture functionality (Data Lake and Blob Storage is currently supported). Event Hub would write to Data Lake every N mins interval or once data size threshold is reached. It is used to optimize storage "write" operations as they are expensive on a high scale.
The data is stored in Avro format, so if you want to query it using USQL you'd have to use an Extractor class. Uri gave a good reference to it https://anthonychu.ca/post/event-hubs-archive-azure-data-lake-analytics-usql/.

Resources