Is there a simple way to ETL from Azure Blob Storage to Snowflake EDW? - azure

I have the following ETL requirements for Snowflake on Azure and would like to implement the simplest possible solution because of timeline and technology constraints.
Requirements :
Load CSV data (only a few MBs) from Azure Blob Storage into Snowflake Warehouse daily into a staging table.
Transform the loaded data above within Snowflake itself where transformation is limited to just a few joins and aggregations to obtain a few measures. And finally, park this data into our final tables in a Datamart within the same Snowflake DB.
Lastly, automate the above pipeline using a schedule OR using an event based trigger (i.e. steps to kick in as soon as file lands in Blob Store).
Constraints :
We cannot use use Azure Data Factory to achieve this simplest design.
We cannot use Azure Functions to deploy Python Transformation scripts and schedule them either.
And, I found that Transformation using Snowflake SQL is a limited feature where it only allows certain things as part of COPY INTO command but does not support JOINS and GROUP BY. Furthermore, although the following THREAD suggests that scheduling SQL is possible, but that doesn't address my Transformation requirement.
Regards,
Roy
Attaching the following Idea diagram for more clarity.
https://community.snowflake.com/s/question/0D50Z00009Z3O7hSAF/how-to-schedule-jobs-from-azure-cloud-for-loading-data-from-blobscheduling-snowflake-scripts-since-dont-have-cost-for-etl-tool-purchase-for-scheduling
https://docs.snowflake.com/en/user-guide/data-load-transform.html#:~:text=Snowflake%20supports%20transforming%20data%20while,columns%20during%20a%20data%20load.

You can create snowpipe on Azure blob storage, Once snowpipe created on top of your azure blob storage, It will monitor bucket and file will be loaded into your stage table as soon as new file comes in. After copied the data into stage table you can schedule transformation SQL using snowflake task.
You can refer snowpipe creation step for azure blob storage in below link:
Snowpipe on microsoft Azure blob storage

Related

Moving Data from Azure Blob Storage to Azure Synapse(Sql dedicated pools)

I have a requirement to move Azure Blob Storage data to Azure Synapse (SQL dedicated pools).
Azure Blob Storage container has around 850 Gb of data(in form of multiple json files). I have created a Synapse pipeline . I have used polybase to move the data from blob storage to SQL dedicated pools. In case of Polybase we would need a staging environment for which i have used a staging blob container.
Azure Blob storage -> staging container -> SQL dedicated pool(Azure Synapse)
I have not kept any restrictions on DIU and parallel processing so it uses 32 DIU and parallel goes processing numbers goes upto 120-130 .
first stage is completed in 5 hrs moving 850gb of data to staging container but the second stage it still runs for 15 hours now but not yet completed and DIU i can see is 2 and parallel processing 1 .
Do i need to explicitly specify the DIU and parallel processing .
Is there any better way to do this except polybase.
There are 3 keys points missing from your question:
Why are you using Staging Table?
Do you need to perform any transformation on the data?
Why your staging table is in Blob Storage?
As per the best approach:
It is best practice to load data into a staging table. Staging tables
allow you to handle errors without interfering with the production
tables. A staging table also gives you the opportunity to use SQL pool
built-in distributed query processing capabilities for data
transformations before inserting the data into production tables.
So, as you haven't mentioned if your approach is Extract, Transform and Load (ETL) or Extract, Load and Transform (ELT); reason for staging table isn't clear. If your approach is ETL, staging is good else it is not required.
Secondly, Staging Tables should be in SQL Pool, not in blob storage.
Before loading the data in Staging Table, you need to define external tables in your data warehouse. PolyBase uses external tables to define and access the data in Azure Storage. An external table is similar to a database view. The external table contains the table schema and points to data that is stored outside the data warehouse.
You also need to take care of resource class and distribution method. Refer this third-party article to know more about these 2 workload management terminologies.
As there are so many parameters you need to consider to find out what exactly the issue is, I suggest you to first go through important official documents and then make appropriate changes in your architecture.
Helpful links:
Design a PolyBase data loading strategy for dedicated SQL pool in Azure Synapse Analytics
Workload management with resource classes in Azure Synapse Analytics
Best practices for loading data into a dedicated SQL pool in Azure Synapse Analytics

When to use Data Factory (copy) over direct pull in SQL synapse

I am just going through some Microsoft Document and doing handOn for Data engineering related things.
I have couple of queries for a scenrerio - "copy CSV file(s) from Blob storage to Synapse analytics (stage table(s)):
I read that we can do direct data pull in Synapse with the process of creating external tables. (https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/load-data-wideworldimportersdw)
If above is possible, then in what cases we do use Azure Data factory Copy or data flow method?
While working with Azure data factory, is it a good idea to use Polybase, because it will use Blob storage again as staging in this scenrerio (i.e. I am copying file from Blob only and again using blob for staging)?
I searched for answers to my queries but haven't found any satisfactory answer yet.
If you're just straight loading data from CSV into DW, use Copy. Polybase is recommended, but not always needed for small files.
If you need to transform that data or perform updates, then use data flows.

SPARK : How to access AzureFileSystemInstrumentation when using azure blob storage with spark cluster?

I am working on a spark project where the storage sink is Azure Blob Storage. I write data in parquet format. I need some metrics around storage, eg. numberOfFilesCreated, writtenBytes etc. On searching for it online I came across a particular metrics that the hadoop-azure package has called the AzureFileSystemInstrumentation. I am not sure about how to access the same from spark and can't find any resources for the same. How would one access this instrumentation for the given spark job?
Based on my experience, I think there are three solution can be used in your current scenario, as below.
Directly use Hadoop API for HDFS to get HDFS Metrics Data in Spark, because hadoop-azure just implements the HDFS APIs for using Azure Blob Storage, so please see the Hadoop offical document for Metrics to know what particular metrics you want to use, such as CreateFileOps or FilesCreated as the figure below to get numberOfFilesCreated. Meanwhile, there is a similar SO thread How do I get HDFS bytes read and write for Spark applications? which you can refer to.
Directly use Azure Storage SDK for Java or other languages you used to write a program to do the statistics for files stored in Azure Blob Storage as blobs ordered by creation timestamp or others, please refer to the offical document Quickstart: Azure Blob storage client library v8 for Java to know how to use its SDK.
Use Azure Function with Blob Trigger to monitor the events of files created in Azure Blob Storage, then you can write the code for statistics on every blob created event, please refer to the offical document Create a function triggered by Azure Blob storage to know how to use Blob Trigger. Even, you can send these metrics what you want to Azure Table Storage or Azure SQL Database or other services for statistics later in the Azure Blob Trigger Function.

Do you have to use Azure Data Factory or can you just Databricks as your ETL tool from your multiple sources?

...Or do i need to add the data into a data lake using data factory first and then use databricks as an ELT?
Depends.
Databricks can connect to datasources and ingest data. However Azure Data Factory(ADF) have more connectors than databricks. So it depends on what you need. If using ADF, you need to land the data somewhere (i.e. Azure storage) so that databricks can pick it up.
Moreover, another main feature of ADF is to orchestrate data movement or activity. Databricks do have Job feature to schedule notebooks or JAR, however it is limited within databricks. If you want to schedule anything outside of databricks (e.g. drop file to SFTP or email on completion or terminate databricks cluster etc...) then ADF is the way to go.
Indeed it depends to the scenario I think. If you have a wide variety of datascources you need to connect to then adf is probably the better option.
If your sources are datafiles (in any format) you could consider using databricks for etl.
I use databricks as a pure etl tool (without adf) by mounting a notebook to a storage container in a blobstorage, take huge xml data from there and write the data to a dataframe in databricks. Then I parse the shape of the dataframe and then writing the data into an azure sql database. Fair to say I’m not really using it for the “e” in etl, as the data has already been extracted from the real source system.
Big advantage is the power you have at your disposal to parse the files.
Best regards.

How to handle Incremental & Full Upload in a Azure Data Factory

We have a Azure Storage Account with 2 blob stores. A Full and a Inc.
In the Full we place the full upload CSV files whenever a Full Upload is needed, in the Inc we just place day by day small incremental CSV Files.
We load all our data first in a staging, then to the ODS en finally to a Edw (Enterprise DW).
A full upload is only needed when there are structural changes to the tables.
Basically the only difference between the two uploads is that the full also cleares all data in the ODS and the EDW, but runs the sames stored procedures in the pipelines, ...
Anybody has tips on how to handle such a situation in a Azure Data Factory.
I would prefer not to double the data-factories, but due to the different avalability/frequency of the output datasets I can't use the same staging logical (in the data-factory) table as output dataset ....
So any hint(s) are appreciated ...
First of all to be clear ADF is just there to invoke other Azure services, it doesn't do any of the work itself. So the question really is; what services in Azure could you call from ADF to do this work and manage this situation?
To answer that...
Option 1: I would suggest you look at Azure Data Lake. I've written simply procedures to what you've described above in USQL where parameters can be passed to the USQL procedures from ADF for different types of behaviour.
The code you create can live in an Azure Data Lake Analytics database, similar to TSQL objects. Then maybe start using Azure Data Lake Storage as well, instead of normal blobs.
Option 2: Break out the C# and create yourself an Azure Data Factory custom activity and create a set of classes to do exactly what you require. Again with params passed by ADF or include logic in the methods to check the 'full' table contents. This will however involve a lot more development work and require an Azure Batch Service for the compute.

Resources