Looking for an alternative solution to processing tens of thousands of JSONs from Azure Blob to Azure SQL DB

Looking for an alternative solution to processing tens of thousands of JSONs from Azure Blob to Azure SQL DB - azure

I currently have pipelines developed that leverage Azure Data Factory for orchestration and Azure DataBricks for it's compute to perform the following actions... I receive tens of thousands of single record json files into Azure Blob in a real-time basis and on a 15 minute basis i check the folders for any new files and once found I load them into a dataframe using Databricks and load these into a single file in SQL DB before having other ADF jobs trigger stored procedures which then transform my data into final SQL tables.... We are looking to move away from Databricks as we are not using it for it's true capabilities but are of course paying the Databricks costs. Looking for ideas on other solutions to load tens of thousands of jsons into SQL DB (with minimal to no transformations) on a periodic (i.e. 15 minute) basis. We are a microsoft shop so not looking to necessarily move away from Azure tools.

Here a few ideas:
use Azure Functions + Blob Trigger / Event Grid to process the JSON files in real time (every time a new JSON file arrives, it will trigger your function). Then you could either insert into the final table or on a temporary table.
another idea would be combine Azure Functions + Blob Trigger / Event Grid to sink the data to a data lake. You can use ADF to sink it to SQL final tables.

Azure SQL DB is actually pretty capable as far as JSON goes so you could just use OPENROWSET to import the data directly from blob store and OPENJSON to shred it. You could then use a Logic App running on a schedule to call the proc say every 15 minutes, you wouldn't even need ADF as part of the solution.
I've worked up a couple of similar answers previously, eg here and here, but let me know if you want to progress more down this route and we can work up something more detailed.

Related

Iterating through the parameters in csv file and run pipeline with each parameter

Hi i have a scenario where i have an csv file in azure datalake storage. while running azure pipeline, the parameters from the excel has to be picked up one by one in an iterative manner. Based on each parameter the databricks notebook should be run.
Is there any solution for this - how to iterate through the values in csv file?

If you are in Azure, you should consider Azure Data Factory (ADF) or Azure Synapse Analytics which has pipelines. Both are good for moving data from place to place and data orchestration. For example you could have an ADF pipeline with a Lookup activity which reads your .csv, then calls a For Each activity with a parameterised Databricks notebook inside:
Interestingly, the For Each activity runs in parallel so could deal with multiple lines at once, depending on your Databricks cluster size etc.
You could try and do all this within a single Databricks notebook which I'm sure is possible, but I would say that is a more code heavy approach and you still have questions around the scheduling, doing tasks in parallel, orchestration etc

Is there a simple way to ETL from Azure Blob Storage to Snowflake EDW?

I have the following ETL requirements for Snowflake on Azure and would like to implement the simplest possible solution because of timeline and technology constraints.
Requirements :
Load CSV data (only a few MBs) from Azure Blob Storage into Snowflake Warehouse daily into a staging table.
Transform the loaded data above within Snowflake itself where transformation is limited to just a few joins and aggregations to obtain a few measures. And finally, park this data into our final tables in a Datamart within the same Snowflake DB.
Lastly, automate the above pipeline using a schedule OR using an event based trigger (i.e. steps to kick in as soon as file lands in Blob Store).
Constraints :
We cannot use use Azure Data Factory to achieve this simplest design.
We cannot use Azure Functions to deploy Python Transformation scripts and schedule them either.
And, I found that Transformation using Snowflake SQL is a limited feature where it only allows certain things as part of COPY INTO command but does not support JOINS and GROUP BY. Furthermore, although the following THREAD suggests that scheduling SQL is possible, but that doesn't address my Transformation requirement.
Regards,
Roy
Attaching the following Idea diagram for more clarity.
https://community.snowflake.com/s/question/0D50Z00009Z3O7hSAF/how-to-schedule-jobs-from-azure-cloud-for-loading-data-from-blobscheduling-snowflake-scripts-since-dont-have-cost-for-etl-tool-purchase-for-scheduling
https://docs.snowflake.com/en/user-guide/data-load-transform.html#:~:text=Snowflake%20supports%20transforming%20data%20while,columns%20during%20a%20data%20load.

You can create snowpipe on Azure blob storage, Once snowpipe created on top of your azure blob storage, It will monitor bucket and file will be loaded into your stage table as soon as new file comes in. After copied the data into stage table you can schedule transformation SQL using snowflake task.
You can refer snowpipe creation step for azure blob storage in below link:
Snowpipe on microsoft Azure blob storage

Is it possible to download a million files in parallel from Rest API endpoint using Azure Data Factory into Blob?

I am fairly new to Azure and I have a task in hand to make use of any Azure Service (or group of azure services in integration together for that matter) to o download a million files in parallel from a third party Rest API endpoint, that returns one file at a time, using Azure Data Factory into Blob Storage?
WHAT I RESEARCHED :
From what I researched my task had three requirements in a nutshell :
Parallel runs in millions - For this I deduced Azure Batch would be a good option as it lets run a large number of tasks in parallel on VMs ( it uses that concept for graphic rendering processes or Machine Learning Tasks)
Save response from Rest API to Blob Storage : I found that Azure Data Factory is able to handle such ETL type of operation from a Source/Sink style, where I could set the REST API as source and target as blob.
WHAT I HAVE TRIED:
Here are some things to note:
I added the REST API and Blob as linked services.
The API endpoint takes in a query string param named : fileName
I am passing the whole URL with the query string
The Rest API is protected by Bearer Token, which I am trying to pass using additional headers.
THE MAIN PROBLEM:
I get an error message on publishing pipeline that model is not appropriate, just that one line, and it gives no insight what's wrong
OTHER QUERIES:
It is possible to pass query string values dynamically from a sql table such that each filename can be picked a single row/column item from single columned rows of data from stored procedure/inline query?
Is it possible to make this pipeline run in parallel using Azure Batch somehow? How can we integrate this process ?
Is it possible to achieve the million parallel without data factory just using Batch ?

Hard to help with you main problem - you need to provide more examples of your code
In relation to your other queries:
You can use a "Lookup activity" to fetch a list of files from a database (with either sproc or inline query). The next step would be a ForEach activity that iterates over the array and copies the file from the REST endpoint to the storage account. You can adjust the parallelism on the ForEach activity to match your requirement but around 20 concurrent executions is what you normally see.
Using Azure Batch to just download a file seems a bit overkill as it should be a fairly quick operation. If you want to see an example of a Azure Batch job written in C# I can recommend this example => `https://github.com/Azure-Samples/batch-dotnet-quickstart/blob/master/BatchDotnetQuickstart. In terms of parallelism I think you will manage to achieve a higher degree on Azure Batch compared to Azure Data Factory.
In you need to actually download 1M files in parallel I don't think you have any other option then Azure Batch to get close to such numbers. But you most have a pretty beefy API if it can handle 1M requests within a second or two.

Azure IoT data warehouse updates

I am building Azure IoT solution for my BI project. For now I have an application that once per set time window sends a .csv blob to Azure Blob Storage with incremental number in name. So after some time I will have in my storage files such as 'data1.csv', 'data2.csv', 'data3.csv', etc.
Now I will need to load these data into a database which will be my warehouse with the use of Azure Stream Analytics job. The issue might be that .CSV files will have overlapping data. They will be send every 4h and contain data for past 24h. I need to always read only last file (with highest number) and prepare lookup so it properly updates data in the warehouse. What will be the best approach to make Stream Analytics read only latest file and for updating records in DB?
EDIT:
TO clarify - I am fully aware that ASA is not capable of being an ETL job. My question is what would be best approach for my case with using IoT tools

I would suggest one of these 2 ways:
use ASA to write in a temporary SQL table, and the use a SQL trigger
to update the main table of the DW with the diff.
Or remove duplicates by adding a unique constraint as described here:
https://blogs.msdn.microsoft.com/streamanalytics/2017/01/13/how-to-achieve-exactly-once-delivery-for-sql-output/
Thanks,
JS - Azure Stream Analytics

How to handle Incremental & Full Upload in a Azure Data Factory

We have a Azure Storage Account with 2 blob stores. A Full and a Inc.
In the Full we place the full upload CSV files whenever a Full Upload is needed, in the Inc we just place day by day small incremental CSV Files.
We load all our data first in a staging, then to the ODS en finally to a Edw (Enterprise DW).
A full upload is only needed when there are structural changes to the tables.
Basically the only difference between the two uploads is that the full also cleares all data in the ODS and the EDW, but runs the sames stored procedures in the pipelines, ...
Anybody has tips on how to handle such a situation in a Azure Data Factory.
I would prefer not to double the data-factories, but due to the different avalability/frequency of the output datasets I can't use the same staging logical (in the data-factory) table as output dataset ....
So any hint(s) are appreciated ...

First of all to be clear ADF is just there to invoke other Azure services, it doesn't do any of the work itself. So the question really is; what services in Azure could you call from ADF to do this work and manage this situation?
To answer that...
Option 1: I would suggest you look at Azure Data Lake. I've written simply procedures to what you've described above in USQL where parameters can be passed to the USQL procedures from ADF for different types of behaviour.
The code you create can live in an Azure Data Lake Analytics database, similar to TSQL objects. Then maybe start using Azure Data Lake Storage as well, instead of normal blobs.
Option 2: Break out the C# and create yourself an Azure Data Factory custom activity and create a set of classes to do exactly what you require. Again with params passed by ADF or include logic in the methods to check the 'full' table contents. This will however involve a lot more development work and require an Azure Batch Service for the compute.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string