How to structure the ETL project in Azure Databricks? - azure

In the past, the following project structure has been useful in etl development using pyspark (on-premises infrastructure)
Since Azure Databricks uses the concept of Notebooks, I'm not certain how this structure needs to be modified to be able to deploy etl pipelines using Azure Databricks.
Can someone please share what'd be an equivalent representation of this structure in an Azure Databricks workspace?a

Related

Can I use Azure Synapse functionality outside the Azure environment?

Forum,
I am currently looking into Azure Synapse as an option for migrating our on-prem data architecture. I am excited by the functionality it offers - SQL Pools, Spark Pools, and the accompanying notebooks. I get that Synapse can function as a all in one data platform, where my data scientists and data analists can use its functionality to deliver insights at will. However, a large part of the work my team does is creating data products.
We currently have a kubernetes cluster with several stand-alone API's that perform data-science operations in the larger whole of our software. They can be thought of as microservices. Most of the ETL is done in our SQL-server, and the microservices in our K8S cluster (usually python + some python packages + FastAPI) typically get the required data from our SQL-server through some SQL-query with an ODBC connector.
Now my question is, how suitable is Synapse for such an architecture? Can I call upon the SQL-pool or spark-pool to do the heavy data-lifting from outside the azure environment, say from a kubernetes pod?
Unfortunately you can't integrate Azure Synapse Analytics with Kubernetes Services.
While Synapse SQL helps perform SQL queries, Apache Spark executes batch/stream processing on Big Data. SQL Pool is used to work with data stored in Dedicated SQL Pool while Spark SQL can be integrated with existing data preparation or data science projects that you may hold in Azure Databricks or Azure Machine Learning Services.
Also, as per this third-party document, Azure Synapse Analytics can't integrate with Kubernetes Services.
As a workaround, you can copy/move your data from Kubernetes to Azure Services like Azure Dedicated SQL Pool, Azure Blob Storage or Azure Data Lake Storage and then integrate it with Azure Synapse pipeline or Spark Pool.

Customizing nodes of an Azure Synapse Workspace Spark Cluster

When creating a Spark cluster within an Azure Synapse workspace, is there a means to install arbitrary files and directories onto it's cluster nodes and/or onto the node's underlying distributed filesystem?
By arbitrary files and directories, I literally mean arbitrary files and directories; not just extra Python libraries like demonstrated here.
Databricks smartly provided a means to do this on it's cluster nodes (described in this document). Now I'm trying to see if there's a means to do the same on an Azure Synapse Workspace Spark Cluster.
Thank you.
Unfortunately, Azure Synapse Analytics don't support arbitrary binary installs or writing to Spark local storage.
I would suggest you to provide feedback on the same:
https://feedback.azure.com/forums/307516-azure-synapse-analytics
All of the feedback you share in these forums will be monitored and reviewed by the Microsoft engineering teams responsible for building Azure.

Local instance of Databricks for development

I am currently working on a small team that is developing a Databricks based solution. For now we are small enough to work off of cloud instances of Databricks. As the group grows this will not really be practical.
Is there a "local" install of Databricks that can be installed for development purposes (it doesn't need to be a scalable version but does need to be essentially fully featured)? In other words, is there a way each developer can create their own development instance of Databricks on their local machine?
Is there another way to provide a dedicated Databricks environment for each developer?
Databricks, as a cloud-deployed platform, leverages many cloud technologies in its deployment. For example, Auto Loader incrementally ingests new data files as they arrive in AWS using EventBridge, SNS and S3, while Azure uses EventHubs, Notification Hubs and ADLS technologies. They aim to create a seamless look and feel across AWS, Azure and GCP but can do this only in the cloud.
For local deployment, you may be able to use Apache Spark and MlFlow and create a similar experience, but the notebook experience isn't open source. The workflow of Databricks is proprietary, though Databricks has open-sourced many of its technologies, like Delta Lake. The local Spark, MlFlow, may suffice for some and then use the cloud sparingly, but the seamless workflow offered by Databricks is challenging to replicate outside of the leading cloud vendors.

Custom Script in Azure Data Factory & Azure Databricks

I have a requirement to parse a lot of small files and load them into a database in a flattened structure. I prefer to use ADF V2 and SQL Database to accomplish it. The file parsing logic is already available using Python script and I wanted to orchestrate it in ADF. I could see an option of using Python Notebook connector to Azure Databricks in ADF v2. May I ask if I will be able to just run a plain Python script in Azure Databricks through ADF? If I do so, will I just run the script in Databricks cluster's driver only and might not utilize the cluster's full capacity. I am also thinking of calling Azure functions as well. Please advise which one is more appropriate in this case.
Just provide some ideas for your reference.
Firstly, you are talking about Notebook and Databricks which means ADF's own copy activity and Data Flow can't meet your needs, since as i know, ADF could meet just simple flatten feature! If you miss that,please try that first.
Secondly,if you do have more requirements beyond ADF features, why not just leave it?Because Notebook and Databricks don't have to be used with ADF,why you want to pay more cost then? For Notebook, you have to install packages by yourself,such as pysql or pyodbc. For Azure Databricks,you could mount azure blob storage and access those files as File System.In addition,i suppose you don't need many workers for cluster,so just configure it as 2 for max.
Databricks is more suitable for managing as a job i think.
Azure Function also could be an option.You could create a blob trigger and load the files into one container. Surely,you have to learn the basic of azure function if you are not familiar with it.However,Azure Function could be more economical.

Do you have to use Azure Data Factory or can you just Databricks as your ETL tool from your multiple sources?

...Or do i need to add the data into a data lake using data factory first and then use databricks as an ELT?
Depends.
Databricks can connect to datasources and ingest data. However Azure Data Factory(ADF) have more connectors than databricks. So it depends on what you need. If using ADF, you need to land the data somewhere (i.e. Azure storage) so that databricks can pick it up.
Moreover, another main feature of ADF is to orchestrate data movement or activity. Databricks do have Job feature to schedule notebooks or JAR, however it is limited within databricks. If you want to schedule anything outside of databricks (e.g. drop file to SFTP or email on completion or terminate databricks cluster etc...) then ADF is the way to go.
Indeed it depends to the scenario I think. If you have a wide variety of datascources you need to connect to then adf is probably the better option.
If your sources are datafiles (in any format) you could consider using databricks for etl.
I use databricks as a pure etl tool (without adf) by mounting a notebook to a storage container in a blobstorage, take huge xml data from there and write the data to a dataframe in databricks. Then I parse the shape of the dataframe and then writing the data into an azure sql database. Fair to say I’m not really using it for the “e” in etl, as the data has already been extracted from the real source system.
Big advantage is the power you have at your disposal to parse the files.
Best regards.

Resources