Azure-Databricks media files (video) ingestion possible? - apache-spark

Is it possible to bring video files that are on a storage container in Azure and ingest into Databricks DBFS in a way like you would ingest structured data files?
Specifically, is it possible to partition very large video files and ingest into Databricks?
Goal would be to ingest into DBFS or use the video on Azure from external storage to extract metadata and then go from there.

Related

Moving data from Teradata to Snowflake

Trying to move data from Teradata to Snowflake. Have created a process to run TPT scripts for each table to generate files for each table.
Files are also split to achieve concurrency while running COPY INTO in snowflake.
Need to understand what is the best way to move those Files from On Prem Linux Machine to Azure ADLS. Considering files in Terabyte size.
Does Azure provide any mechanism to move these files or can we directly create files on ADLS from Teradata?
The best approach to load data to snowflake via external table if you have the Azure Blob Storage or ADLS Gen2. Load data to blob storage and create external table and then load data data to snowflake.

Azure stream analytics how to create a single parquet file

I have few IoT devices sending telemetry data to Azure Event Hub. I want to write a data to Parquet file in Azure Data Lake so that I can query that data using Azure Synapse.
I have Azure function triggered to Azure event hub, But I did not find a way directly to write a data received from device to Azure data Lake in Parquet format.
So what I am doing, I have Stream Analytics job - which has Input from Event hub and Output to Azure Data lake in Parquet format.
I have configured Stream analytics output path format as different format - but it would create multiple small files within the following folders.
*device-data/{unitNumber}/
device-data/{unitNumber}/{datetime:MM}/{datetime:dd}*
I want to have single parquet file for single device. Can someone help in this?
I have tried to configure Maximum time -> But the data wont get written to parquet file till this time get elapsed. I don't want this as well.
I want simple functionality - as soon as data received from the device to event hub, it should get appended to parquet file in Azure Data lake.

Azure solution to save stream to blob files as parquet

I read about few different azure services - Events hub capture, Azure data factory, events hub, and more. I am trying to find several ways using azure services to do:
Write data to some "endpoint" or place from my application (preferably service of azure)
The data would be batched and saved in files to BLOB
Eventually, the format should be parquet in the BLOB files
My questions are:
I read that events hub capture only saves files as AVRO. So I might also consider second pipeline of copy from original AVRO BLOB to destination parquet BLOB. Is there a service in AZURE that can listen to my BLOB, convert all files to parquet and save again (I'm not sure from the documentation if the data factory can do this)?
What other alternatives would you consider (except Kafka that I know about) to save stream of data to batches of parquet in BLOB?
Thank you!
For the least amount of effort you can look into a combination of an Event Hub as your endpoint and then connect Azure Stream Analytics to that. It can natively write parquet to blob: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-define-outputs#blob-storage-and-azure-data-lake-gen2

What is the easy and best method to unzip the files in Azure data lake Gen1 without moving the files to Azure Databricks file system?

What is the best method to unzip the files in Azure data lake Gen1 without moving the files to Azure Databricks file system ? Currently, we are using Azure databricks for compute and ADLS for storage.We have a restriction to move the data into DBFS.
Already mounted ADLS in DBFS and not sure how to proceed
Unfortunately in Databricks zip files are not supported, reason is that Hadoop does not have support for zip files as a compression codec. While a text file in GZip, BZip2, and other supported compression formats can be configured to be automatically decompressed in Spark as long as it has the right file extension, you must perform additional steps to read zip files. The sample in the Databricks documentation does the unzip on the driver node using unzip on the OS level (Ubuntu).
If your data source can' t provide the data in a compression codec supported by Spark, best method is using Azure Data Factory copy activity. Azure Data Factory supports more compression codecs, also zip is supported.
Type property definition for the source would look like this:
"typeProperties": {
"compression": {
"type": "ZipDeflate",
"level": "Optimal"
},
You can also use Azure Data Factory to orchestrate your Databricks pipelines with the Databricks activities.

How to read contents of the file in AzureDataLake continuously in the flink streaming?

I am using flink streaming to read the data from the file in AzureDataLake store.Is there any connector available to read the data from the file stored in Azure Data Lake continuously as the file is updated.How to do it?
Azure Data Lake Store (ADLS) supports REST API interface that is compatible with HDFS and is documented here. https://learn.microsoft.com/en-us/rest/api/datalakestore/webhdfs-filesystem-apis.
Currently there are no APIs or connectors available that poll ADLS and notify/read-data as the files/folders are updated. This is something that you could implement in a custom connector using the APIs provided above. Your connector would need to poll the ADLS account/folder on a recurring basis to identify changes.
Thanks,
Sachin Sheth
Program Manager
Azure Data Lake

Resources