How to Automate Pyspark script in Microsoft Azure - azure

Hope you are doing well.
I am new to Spark as well as Microsoft Azure. As per our project requirement we have developed a pyspark script though the jupyter notebook installed in our HDInsight cluster. Till date we ran the code from the jupyter itself but now we need to automate the script. I tried to use Azure Datafactory but could not find a way to run the pyspark script from there. Also tried to use oozie but could not figure out how to use it.
May you people please help me how I can automate/ schedule a pyspark script in azure.
Thanks,
Shamik.

Azure Data Factory today doesn't have first class support for Spark. We are working to add that integration in future. Till that time, we have published a sample on Github that uses ADF Map Reduce Activity to submit a jar that invokes spark submit.
Please take a look here:
https://github.com/Azure/Azure-DataFactory/tree/master/Samples/Spark

Related

spark-monitoring library not writing in Azure Log Analytivs Workspace

I have installed the new version of the spark-monitoring library which is supposed to support Databricks Runtime 11.0. See here: spark-monitoring-library. I have successfully attached the init script to my cluster. However, when I run jobs on this cluster, I do not see any logs of the Databricks jobs in Log Analytics. Does anyone have the same problem and has it resolved?

How to modify config file of spark job in Airflow UI?

I'm using Airflow to schedule for spark job and using a conf.properties file.
I want to change this file in Airflow UI not in server CLI.
How cant I do??
Airflow webserver doesn't support files edit in its UI. But it allows you to add your plugins and customize the UI by adding flask_appbuilder views (here is the doc).
You can also use an unofficial open source plugins to do that (ex: airflow_code_editor).

How to use Airflow-API-Plugin?

I want to List and Trigger DAGs using this https://github.com/airflow-plugins/airflow_api_plugin github repo. How and where should I place this plugin in my airflow folder so that I can call the endpoints?
Is there anything that I need to change in the airflow.cfg file?
The repository you listed has not been updated in a while. Why not just use the experimental REST APIs included in Airflow? You can find them here: https://airflow.apache.org/docs/stable/api.html .
Use:
GET /api/experimental/dags//dag_runs
to get a list of DAG runs and
POST /api/experimental/dags//dag_runs
to trigger a new dag run

What is a good Databricks workflow

I'm using Azure Databricks for data processing, with notebooks and pipeline.
I'm not satisfied with my current workflow:
The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.
My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined.
Is there a way to do efficient and systematic testing ?
Git integration is very simple, but this is not my main concern.
Great question. Definitely dont modify your production code in place.
One recommended pattern is to keep separate folders in your workspace for dev-staging-prod. Do your dev work and then run tests in staging before finally promoting to production.
You can use the Databricks CLI to pull and push a notebook from one folder to another without breaking existing code. Going one step further, you can incorporate this pattern with git to sync with version control. In either case, the CLI gives you programmatic access to the workspace and that should make it easier to update code for production jobs.
Regarding your second point about IDEs - Databricks offers Databricks Connect, which let's you use your IDE while running commands on a cluster. Based on your pain points I think this is a great solution for you, as it will give your more visibility into the functions you have defined and so on. You can also write and run your unit tests this way.
Once you have your scripts ready to go you can always import them into the workspace as a notebook and run it as a job. Also know that you can run .py scripts as a job using the REST API.
I personally prefer to package my code, and copy the *.whl package to DBFS, where I can install the tested package and import it.
Edit: To be more explicit.
The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.
This can be solved by either having separate environments DEV/TST/PRD. Or having versioned packages that can be modified in isolation. I'll clarify later on.
My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined.
Is there a way to do efficient and systematic testing ?
Yes, using the versioned packages method I mentioned in combination with databricks-connect, you are totally able to use your IDE, implement tests, have proper git integration.
Git integration is very simple, but this is not my main concern.
Built-in git integration is actually very poor when working in bigger teams. You can't develop in the same notebook simultaneously, as there's a flat and linear accumulation of changes that are shared with your colleagues. Besides that, you have to link and unlink repositories that are prone to human error, causing your notebooks to be synchronized in the wrong folders, causing runs to break because notebooks can't be imported. I advise you to also use my packaging solution.
The packaging solution works as follows Reference:
List item
On your desktop, install pyspark
Download some anonymized data to work with
Develop your code with small bits of data, writing unit tests
When ready to test on big data, uninstall pyspark, install databricks-connect
When performance and integration is sufficient, push code to your remote repo
Create a build pipeline that runs automated tests, and builds the versioned package
Create a release pipeline that copies the versioned package to DBFS
In a "runner notebook" accept "process_date" and "data folder/filepath" as arguments, and import modules from your versioned package
Pass the arguments to your module to run your tested code
The way we are doing it -
-Integrate the Dev notebooks with Azure DevOps.
-Create custom Build and Deployment tasks for Notebook, Jobs, package and cluster deployments. This is sort of easy to do with the DatabBricks RestAPI
https://docs.databricks.com/dev-tools/api/latest/index.html
Create Release pipeline for Test, Staging and Production deployments.
-Deploy on Test and test.
-Deploy on Staging and test.
-Deploy on production
Hope this can help.

How to delete an experiment from an azure machine learning workspace

I create experiments in my workspace using the python sdk (azureml-sdk). I now have a lot of 'test' experiments littering our workspace. How can I delete individual experiments either through the api or on the portal. I know I can delete the whole workspace but there are some good experiments we don't want to delete
https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-export-delete-data#delete-visual-interface-assets suggests it is possible but my workspace view does not look anything like what is shown there
Experiment deletion is a common request and we in Azure ML team are working on it. Unfortunately it's not supported quite yet.
Starting from 2021-08-24 Azure ML Workspace release you can delete the experiment - but only by clicking in UI (Select Experiment in Experiments view -> 'Delete')
Watch out - deleting the experiment will delete all the underlying runs - and deleting a run will delete the child runs, run metrics, metadata, outputs, logs and working directories!
Only for experiments without any underlying runs you can use Python SDK (azureml-core==1.34.0) - Experiment class delete static method, example:
from azureml.core import Workspace, Experiment
aml_workspace = Workspace.from_config()
experiment_id = Experiment(aml_workspace, '<experiment_name>').id
Experiment.delete(aml_workspace, experiment_id)
If an experiment has runs you will get an error:
CloudError: Azure Error: UserError
Message: Only empty Experiments can be deleted. This experiment contains run(s)
I hope Azure ML team gets this functionality to Python SDK soon!
Also on a sad note - would be great if you optimize the deletion - for now it seems like extremely slow (implementation) synchronous (need async as well) call...
You can delete your experiment with the following code:
# Declare your experiment
from azureml.core import Experiment
experiment = Experiment(workspace=ws, name="<your_experiment>")
# Delete the experiment
experiment.archive()
# Now check the list of experiments on your AML wokrspace and see that it was deleted
This issue is still opened at the moment. What I have figure out to avoid many experiments in workspace is run locally in Python SDK and after upload output files to the run's outputs folder when the run completes.
You can define it as:
run.upload_file(name='outputs/sample.csv', path_or_stream='./sample.csv')
Follow the two steps:
1.Delete experiment's child jobs in Azure Studio, here is how:
2.Delete the (empty) experiment with Python API, here is how:
from azureml.core import Workspace, Experiment, Run
# choose the workspace and experiment
ws = Workspace.from_config()
exp_name = 'digits_recognition'
# ... delete first experiment's child jobs in Azure Studio
exp = Experiment(ws,exp_name)
Experiment.delete(ws,exp.id)
Note: for a more fine-grained control over deletions, use Azure CLI.

Resources