Custom Script in Azure Data Factory & Azure Databricks - azure

I have a requirement to parse a lot of small files and load them into a database in a flattened structure. I prefer to use ADF V2 and SQL Database to accomplish it. The file parsing logic is already available using Python script and I wanted to orchestrate it in ADF. I could see an option of using Python Notebook connector to Azure Databricks in ADF v2. May I ask if I will be able to just run a plain Python script in Azure Databricks through ADF? If I do so, will I just run the script in Databricks cluster's driver only and might not utilize the cluster's full capacity. I am also thinking of calling Azure functions as well. Please advise which one is more appropriate in this case.

Just provide some ideas for your reference.
Firstly, you are talking about Notebook and Databricks which means ADF's own copy activity and Data Flow can't meet your needs, since as i know, ADF could meet just simple flatten feature! If you miss that,please try that first.
Secondly,if you do have more requirements beyond ADF features, why not just leave it?Because Notebook and Databricks don't have to be used with ADF,why you want to pay more cost then? For Notebook, you have to install packages by yourself,such as pysql or pyodbc. For Azure Databricks,you could mount azure blob storage and access those files as File System.In addition,i suppose you don't need many workers for cluster,so just configure it as 2 for max.
Databricks is more suitable for managing as a job i think.
Azure Function also could be an option.You could create a blob trigger and load the files into one container. Surely,you have to learn the basic of azure function if you are not familiar with it.However,Azure Function could be more economical.

Related

Azure Data Factory, How get output from scala (jar job)?

We have a Azure Data Factory pipeline and one step is a jar job that should return output used in the next steps.
It is possible to get output from notebook with dbutils.notebook.exit(....)
I need the similar feature to retrieve output from main class of jar.
Thanks!
Image of my pipeline
Actually,there is no built-in feature to execute jar job directly as i know.However, you could implement it easily with Azure Databricks Service.
Two ways in Azure Databricks workspace:
If your jar is executable jar,then just use Set JAR which could set main class and parameters:
Conversely,you could try to use Notebook to execute dbutils.notebook.exit(....) or something else.
Back to ADF, ADF has Databricks Activity and you can get output of it for next steps.Any concern,please let me know.
Updates:
There is no similar feature to dbutils.notebook.exit(....) in Jar activity as i know.So far i just provide a workaround here: storing the parameters into specific file which resides in the (for example)blob storage inside the jar execution.Then use LookUp activity after jar activity to get the params for next steps.
Updates at 1.21.2020
Got some updates from MSFT in the github link: https://github.com/MicrosoftDocs/azure-docs/issues/46347
Sending output is a feature that only notebooks support for notebook
workflows and not jar or python executions in databricks. This should
be a feature ask for databricks and only then ADF can support it.
I would recommend you to submit this as a product feedback on Azure
Databricks feedback forum.
It seems that output from jar execution is not supported by azure databricks,ADF only supports features of azure databricks naturally. Fine...,you could push the related progress by contacting with azure databricks team. I just shared all my knowledges here.

Do you have to use Azure Data Factory or can you just Databricks as your ETL tool from your multiple sources?

...Or do i need to add the data into a data lake using data factory first and then use databricks as an ELT?
Depends.
Databricks can connect to datasources and ingest data. However Azure Data Factory(ADF) have more connectors than databricks. So it depends on what you need. If using ADF, you need to land the data somewhere (i.e. Azure storage) so that databricks can pick it up.
Moreover, another main feature of ADF is to orchestrate data movement or activity. Databricks do have Job feature to schedule notebooks or JAR, however it is limited within databricks. If you want to schedule anything outside of databricks (e.g. drop file to SFTP or email on completion or terminate databricks cluster etc...) then ADF is the way to go.
Indeed it depends to the scenario I think. If you have a wide variety of datascources you need to connect to then adf is probably the better option.
If your sources are datafiles (in any format) you could consider using databricks for etl.
I use databricks as a pure etl tool (without adf) by mounting a notebook to a storage container in a blobstorage, take huge xml data from there and write the data to a dataframe in databricks. Then I parse the shape of the dataframe and then writing the data into an azure sql database. Fair to say I’m not really using it for the “e” in etl, as the data has already been extracted from the real source system.
Big advantage is the power you have at your disposal to parse the files.
Best regards.

Generating and storing JSON files from the run-time parameters passed to Azure Data Factory v2 pipeline?

Can we create a file (preferably json) and store it in its supported storage sinks (like Blob, Azure Data Lake Service etc) using the parameters that are passed to Azure Data Factory v2 pipeline at run-time. I suppose it can be done via Azure Batch but it seems to be an overkill for such a trivial task. Is there a better way to do that?
Here are all the transform activities ADFv2 currently equips with, I'm afraid there isn't a direct way to create a file in ADFv2. You could leverage Custom activity to achieve this by running your customized code logic on an Azure Batch pool of virtual machines. Hope it'll help a little.

What are the Azure ML output formats?

Does Azure ML only provide output through it's web services?
Is it possible to feed the output to an Azure SQL database?
Is it possible to feed the output to a Redshift database?
Essentially I am looking to know if I can integrate Azure ML Studio with our existing redshift analytics database.
yes you can write to SQL DB in Azure.
you can also use a Python module to make REST calls so in theory you can write to Redshift.
Writing to SQL DB is possible in Azure ML and so is Writing directly to Azure Blob Storage.
However, unlike #Hai, I do not believe you can write to a Redshift DB since it is clearly stated by the "Python Module" documentation from Microsoft that the Python execution is Sandboxed and therefore can not access resources outside the virtual machine it runs on(i.e Internet resources, on-premises resources, ...)

Azure Data Factory - moving data from On-Premise SQL to Azure SQL

A simple question: Can this be achieved directly? I mean without the Azure blob storage in between (as showed in all the examples)? Can someone provide some code example please.
yes, you can do this directly. In fact, you can do direct copies from any of our supported sources/sinks, you don't have to pass through blob. To go from on-prem SQL Server-->SQL azure, you will need to setup a Data Management Gateway connector on your on-prem server. Then, you use a linked service of type AzureStorage and an output dataset of type AzureSQLTable as the output dataset, instead of AzureBlob as is shown in the example. The exact steps to setup the DMG and the JSON code for the linked services, datasets, and pipelines can be found in our documentation. We are also improving our UI in the near future to make these kinds of copy setups an easy code-free experience.
https://azure.microsoft.com/en-us/documentation/articles/data-factory-sqlserver-connector/

Resources