Iterating through the parameters in csv file and run pipeline with each parameter - azure

Hi i have a scenario where i have an csv file in azure datalake storage. while running azure pipeline, the parameters from the excel has to be picked up one by one in an iterative manner. Based on each parameter the databricks notebook should be run.
Is there any solution for this - how to iterate through the values in csv file?

If you are in Azure, you should consider Azure Data Factory (ADF) or Azure Synapse Analytics which has pipelines. Both are good for moving data from place to place and data orchestration. For example you could have an ADF pipeline with a Lookup activity which reads your .csv, then calls a For Each activity with a parameterised Databricks notebook inside:
Interestingly, the For Each activity runs in parallel so could deal with multiple lines at once, depending on your Databricks cluster size etc.
You could try and do all this within a single Databricks notebook which I'm sure is possible, but I would say that is a more code heavy approach and you still have questions around the scheduling, doing tasks in parallel, orchestration etc

Related

Is there a simple way to ETL from Azure Blob Storage to Snowflake EDW?

I have the following ETL requirements for Snowflake on Azure and would like to implement the simplest possible solution because of timeline and technology constraints.
Requirements :
Load CSV data (only a few MBs) from Azure Blob Storage into Snowflake Warehouse daily into a staging table.
Transform the loaded data above within Snowflake itself where transformation is limited to just a few joins and aggregations to obtain a few measures. And finally, park this data into our final tables in a Datamart within the same Snowflake DB.
Lastly, automate the above pipeline using a schedule OR using an event based trigger (i.e. steps to kick in as soon as file lands in Blob Store).
Constraints :
We cannot use use Azure Data Factory to achieve this simplest design.
We cannot use Azure Functions to deploy Python Transformation scripts and schedule them either.
And, I found that Transformation using Snowflake SQL is a limited feature where it only allows certain things as part of COPY INTO command but does not support JOINS and GROUP BY. Furthermore, although the following THREAD suggests that scheduling SQL is possible, but that doesn't address my Transformation requirement.
Regards,
Roy
Attaching the following Idea diagram for more clarity.
https://community.snowflake.com/s/question/0D50Z00009Z3O7hSAF/how-to-schedule-jobs-from-azure-cloud-for-loading-data-from-blobscheduling-snowflake-scripts-since-dont-have-cost-for-etl-tool-purchase-for-scheduling
https://docs.snowflake.com/en/user-guide/data-load-transform.html#:~:text=Snowflake%20supports%20transforming%20data%20while,columns%20during%20a%20data%20load.
You can create snowpipe on Azure blob storage, Once snowpipe created on top of your azure blob storage, It will monitor bucket and file will be loaded into your stage table as soon as new file comes in. After copied the data into stage table you can schedule transformation SQL using snowflake task.
You can refer snowpipe creation step for azure blob storage in below link:
Snowpipe on microsoft Azure blob storage

Looking for an alternative solution to processing tens of thousands of JSONs from Azure Blob to Azure SQL DB

I currently have pipelines developed that leverage Azure Data Factory for orchestration and Azure DataBricks for it's compute to perform the following actions... I receive tens of thousands of single record json files into Azure Blob in a real-time basis and on a 15 minute basis i check the folders for any new files and once found I load them into a dataframe using Databricks and load these into a single file in SQL DB before having other ADF jobs trigger stored procedures which then transform my data into final SQL tables.... We are looking to move away from Databricks as we are not using it for it's true capabilities but are of course paying the Databricks costs. Looking for ideas on other solutions to load tens of thousands of jsons into SQL DB (with minimal to no transformations) on a periodic (i.e. 15 minute) basis. We are a microsoft shop so not looking to necessarily move away from Azure tools.
Here a few ideas:
use Azure Functions + Blob Trigger / Event Grid to process the JSON files in real time (every time a new JSON file arrives, it will trigger your function). Then you could either insert into the final table or on a temporary table.
another idea would be combine Azure Functions + Blob Trigger / Event Grid to sink the data to a data lake. You can use ADF to sink it to SQL final tables.
Azure SQL DB is actually pretty capable as far as JSON goes so you could just use OPENROWSET to import the data directly from blob store and OPENJSON to shred it. You could then use a Logic App running on a schedule to call the proc say every 15 minutes, you wouldn't even need ADF as part of the solution.
I've worked up a couple of similar answers previously, eg here and here, but let me know if you want to progress more down this route and we can work up something more detailed.

Custom Script in Azure Data Factory & Azure Databricks

I have a requirement to parse a lot of small files and load them into a database in a flattened structure. I prefer to use ADF V2 and SQL Database to accomplish it. The file parsing logic is already available using Python script and I wanted to orchestrate it in ADF. I could see an option of using Python Notebook connector to Azure Databricks in ADF v2. May I ask if I will be able to just run a plain Python script in Azure Databricks through ADF? If I do so, will I just run the script in Databricks cluster's driver only and might not utilize the cluster's full capacity. I am also thinking of calling Azure functions as well. Please advise which one is more appropriate in this case.
Just provide some ideas for your reference.
Firstly, you are talking about Notebook and Databricks which means ADF's own copy activity and Data Flow can't meet your needs, since as i know, ADF could meet just simple flatten feature! If you miss that,please try that first.
Secondly,if you do have more requirements beyond ADF features, why not just leave it?Because Notebook and Databricks don't have to be used with ADF,why you want to pay more cost then? For Notebook, you have to install packages by yourself,such as pysql or pyodbc. For Azure Databricks,you could mount azure blob storage and access those files as File System.In addition,i suppose you don't need many workers for cluster,so just configure it as 2 for max.
Databricks is more suitable for managing as a job i think.
Azure Function also could be an option.You could create a blob trigger and load the files into one container. Surely,you have to learn the basic of azure function if you are not familiar with it.However,Azure Function could be more economical.

Do you have to use Azure Data Factory or can you just Databricks as your ETL tool from your multiple sources?

...Or do i need to add the data into a data lake using data factory first and then use databricks as an ELT?
Depends.
Databricks can connect to datasources and ingest data. However Azure Data Factory(ADF) have more connectors than databricks. So it depends on what you need. If using ADF, you need to land the data somewhere (i.e. Azure storage) so that databricks can pick it up.
Moreover, another main feature of ADF is to orchestrate data movement or activity. Databricks do have Job feature to schedule notebooks or JAR, however it is limited within databricks. If you want to schedule anything outside of databricks (e.g. drop file to SFTP or email on completion or terminate databricks cluster etc...) then ADF is the way to go.
Indeed it depends to the scenario I think. If you have a wide variety of datascources you need to connect to then adf is probably the better option.
If your sources are datafiles (in any format) you could consider using databricks for etl.
I use databricks as a pure etl tool (without adf) by mounting a notebook to a storage container in a blobstorage, take huge xml data from there and write the data to a dataframe in databricks. Then I parse the shape of the dataframe and then writing the data into an azure sql database. Fair to say I’m not really using it for the “e” in etl, as the data has already been extracted from the real source system.
Big advantage is the power you have at your disposal to parse the files.
Best regards.

How to handle Incremental & Full Upload in a Azure Data Factory

We have a Azure Storage Account with 2 blob stores. A Full and a Inc.
In the Full we place the full upload CSV files whenever a Full Upload is needed, in the Inc we just place day by day small incremental CSV Files.
We load all our data first in a staging, then to the ODS en finally to a Edw (Enterprise DW).
A full upload is only needed when there are structural changes to the tables.
Basically the only difference between the two uploads is that the full also cleares all data in the ODS and the EDW, but runs the sames stored procedures in the pipelines, ...
Anybody has tips on how to handle such a situation in a Azure Data Factory.
I would prefer not to double the data-factories, but due to the different avalability/frequency of the output datasets I can't use the same staging logical (in the data-factory) table as output dataset ....
So any hint(s) are appreciated ...
First of all to be clear ADF is just there to invoke other Azure services, it doesn't do any of the work itself. So the question really is; what services in Azure could you call from ADF to do this work and manage this situation?
To answer that...
Option 1: I would suggest you look at Azure Data Lake. I've written simply procedures to what you've described above in USQL where parameters can be passed to the USQL procedures from ADF for different types of behaviour.
The code you create can live in an Azure Data Lake Analytics database, similar to TSQL objects. Then maybe start using Azure Data Lake Storage as well, instead of normal blobs.
Option 2: Break out the C# and create yourself an Azure Data Factory custom activity and create a set of classes to do exactly what you require. Again with params passed by ADF or include logic in the methods to check the 'full' table contents. This will however involve a lot more development work and require an Azure Batch Service for the compute.

Resources