I want to pass the logged in user name from Azure Data Factory to azure notebook.
I tried dbutils functionality but no luck.
x=str(dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user'))
print (x)
Tried above code in notebook. When I am running the notebook directly it is giving the expected result. But from ADF it is not working
In short, you can't.
In DataFactory when you create linked service to Databricks you specify personal access token to ADF. This means all requests made from ADF are executed as that user, but there is currently no user logged in the databricks platform and as such no context is initialized.
In ADF there isn't even user variable because pipelines are designed (expected) to be executed via schedule or event triggers or external systems. Not by users manually.
Here is the list of available system variables
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables
Related
I would like to create a Synapse Pipeline for batch inferencing with data ingestion to store the data into data lake and then use this as input to call a batch endpoint already created (through ML Execute Pipeline) Then, capture the output into the data lake (appended to a table) to continue the next steps...
The documentation from Microsoft to setup such a scenario is very poor and everything I tried is failing.
Below is the Azure Machine Learning Execute Pipeline configuration. I need to pass the value for the dataset_param with data asset instance already available as below.
But, it complains that the dataset_param is not provided. Not sure, how to pass this value...
Here is the original experiment / pipeline / endpoint created by the DevOps pipeline. I just call this endpoint above from the Synapse Pipeline
We have series of CSV files landing every day (daily Delta) then these need to be loaded to Azure database using Azure Data Factory (ADF). We have created a ADF Pipeline which moves data straight from an on-premises folder to an Azure DB table and is working.
Now, we need to make this pipeline executed based on an event, not based on a scheduled time. Which is, based on creation of a specific file on the same local folder. This file is created when the daily delta files landing is completed. Let's call this SRManifest.csv.
The question is, how to create a Trigger to start the pipeline when SRManifest.csv is created? I have looked into Azure event grid. But it seems, it doesn't work in on-premises folders.
You're right that you cannot configure an Event Grid trigger to watch local files, since you're not writing to Azure Storage. You'd need to generate your own signal after writing your local file content.
Aside from timer-based triggers, Event-based triggers are tied to Azure Storage, so the only way to use that would be to drop some type of "signal" file in a well-known storage location, after your files are written locally, to trigger your ADF pipeline to run.
Alternatively, you can trigger an ADF pipeline programmatically (.NET and Python SDKs support this; maybe other ones do as well, plus there's a REST API). Again, you'd have to build this, and run your trigger program after your local content has been created. If you don't want to write a program, you can use PowerShell (via Invoke-AzDataFactoryV2Pipeline).
There are other tools/services that integrate with Data Factory as well; I wasn't attempting to provide an exhaustive list.
Have a look at the Azure Logic Apps for File System connector Triggers. More details here.
I have a requirement to parse a lot of small files and load them into a database in a flattened structure. I prefer to use ADF V2 and SQL Database to accomplish it. The file parsing logic is already available using Python script and I wanted to orchestrate it in ADF. I could see an option of using Python Notebook connector to Azure Databricks in ADF v2. May I ask if I will be able to just run a plain Python script in Azure Databricks through ADF? If I do so, will I just run the script in Databricks cluster's driver only and might not utilize the cluster's full capacity. I am also thinking of calling Azure functions as well. Please advise which one is more appropriate in this case.
Just provide some ideas for your reference.
Firstly, you are talking about Notebook and Databricks which means ADF's own copy activity and Data Flow can't meet your needs, since as i know, ADF could meet just simple flatten feature! If you miss that,please try that first.
Secondly,if you do have more requirements beyond ADF features, why not just leave it?Because Notebook and Databricks don't have to be used with ADF,why you want to pay more cost then? For Notebook, you have to install packages by yourself,such as pysql or pyodbc. For Azure Databricks,you could mount azure blob storage and access those files as File System.In addition,i suppose you don't need many workers for cluster,so just configure it as 2 for max.
Databricks is more suitable for managing as a job i think.
Azure Function also could be an option.You could create a blob trigger and load the files into one container. Surely,you have to learn the basic of azure function if you are not familiar with it.However,Azure Function could be more economical.
We have a Azure Data Factory pipeline and one step is a jar job that should return output used in the next steps.
It is possible to get output from notebook with dbutils.notebook.exit(....)
I need the similar feature to retrieve output from main class of jar.
Thanks!
Image of my pipeline
Actually,there is no built-in feature to execute jar job directly as i know.However, you could implement it easily with Azure Databricks Service.
Two ways in Azure Databricks workspace:
If your jar is executable jar,then just use Set JAR which could set main class and parameters:
Conversely,you could try to use Notebook to execute dbutils.notebook.exit(....) or something else.
Back to ADF, ADF has Databricks Activity and you can get output of it for next steps.Any concern,please let me know.
Updates:
There is no similar feature to dbutils.notebook.exit(....) in Jar activity as i know.So far i just provide a workaround here: storing the parameters into specific file which resides in the (for example)blob storage inside the jar execution.Then use LookUp activity after jar activity to get the params for next steps.
Updates at 1.21.2020
Got some updates from MSFT in the github link: https://github.com/MicrosoftDocs/azure-docs/issues/46347
Sending output is a feature that only notebooks support for notebook
workflows and not jar or python executions in databricks. This should
be a feature ask for databricks and only then ADF can support it.
I would recommend you to submit this as a product feedback on Azure
Databricks feedback forum.
It seems that output from jar execution is not supported by azure databricks,ADF only supports features of azure databricks naturally. Fine...,you could push the related progress by contacting with azure databricks team. I just shared all my knowledges here.
Usecase
We have an on-premise Hadoop setup and we are using power BI as a BI visualization tool. What we do currently to get data on Powerbi is as follows.
Copy data from on-premise to Azure Blob(Our on-premise schedule does this once the data is ready in Hive)
Data from Azure Blob is then copied to Azure-DataWarehouse/Azure-SQL
Cube refreshes on Azure AAS, AAS pulls data from Azure DataWarehouse/SQL
To do the step2 and step3 we are currently running a web server on Azure and the endpoints are configured to take few parameters like the table name, azure file location, cube information and so on.
Sample http request:
http://azure-web-server-scheduler/copydata?from=blob&to=datawarehouse&fromloc=myblob/data/today.csv&totable=mydb.mytable
Here the web servers extract the values from variables(from, fromloc, to, totable) and them does the copy activity. We did this as we had a lot of tables and all could reuse the same function.
Now we have use cases piling up(retries, control flows, email alerts, monitoring) and we are looking for a cloud alternative to do the scheduling job for us, we would still like to hit an HTTP endpoint like the above one.
One of the alternatives we have checked till now is the Azure Data Factory, where are create pipelines to achieve the steps above and trigger the ADF using http endpoints.
Problems
How can we take parameters from the http post call and make it available as custom variables[1], this is required within the pipeline so that we can still write a function for each step{2, 3} and the function can take these parameters, we don't want to create an ADF for each table.
How can we detect for failure in ADF steps and send email alerts during failures?
What are the other options apart from ADF to do this in Azure?
[1] https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables
You could trigger the copy job from blob to SQL DW via a Get Metadata Acitivity. It can be used in the following scenarios:
- Validate the metadata information of any data
- Trigger a pipeline when data is ready/ available
For eMail notification you can use a Web Activity calling a LogicApp. See the following tuturial how to send an email notification.