Is there a way to run (or convert) .ipynb files on a Databricks cluster without using the import ui of Databricks. Basically I want to be able to develop in Jupyter but also be able to run this file on Databricks where its pulled trough git.
It's possible to import Jupyter notebooks into Databricks workspace as a Databricks notebook, and then execute it. You can use:
Workspace Import REST API
databricks workspace import command of databricks-cli.
P.S. Unfortunately you can't open it by committing into a Repo, it will be treated as JSON. So you need to import it to convert into a Databricks notebook
Related
Am using Azure Synapse in combination with jupyter notebooks:
Many of my jupyter notebooks import some custom python scripts like the util-import:
However, there's no option hold *.py files in Azure Synapse. Always when i use the import functionalty, the *.py is transformed to a notebook (On laptop it was util.py, after Synapse import it's a notebook):
How can custom *.py files be used in Azure Synapse Notebooks, without transforming them from *.py to notebook?
There are several options for adding python to Synapse. You can manage them at the workspace, pool, or session level. The method I've had the most success with is loading from PyPi to a Spark Pool, but you can also upload Wheel or JAR files to your workspace and reference those in your notebooks. One caveat on using PyPi is that if the package has a dependency on C library packages, Synapse won't load it into your pool.
We are using Databricks to generate ETL scripts. One step requires us to upload small csvs into a Repos folder. I can do this manually using the import window in the Repos GUI. However, i would like to do this programmatically using the databricks cli. Is this possible? I have tried using the Workspace API, but this only works for sourcecode files.
Unfortunately it's not possible as of right now, because there is no API for that that could be used by databricks-cli. But you can add and commit files to the Git repository, and then use databricks repos update to pull them inside the workspace.
I am trying to import some data from a public repo in GitHub so that to use it from my Databricks notebooks.
So far I tried to connect my Databricks account with my GitHub as described here, without results though since it seems that GitHub support comes with some non-community licensing. I get the following message when I try to set the GitHub token which is required for the GitHub integration:
The same question has been asked before on the official Databricks forum.
What is the best way to import and store a GitHub repo on databricks community edition?
I managed to solve this using shell commands from the notebook itself. To retrieve the repository for the 1st time I did git clone via HTTPS:
%sh git clone https://github.com/SomeDataRepo/TheData.git --depth 1 --branch=master /dbfs/FileStore/TheData/
Why not SSH? Well SSH requires to setup the SSH keys which was not necessary in my case.
Finally, every time that I need a fresh version of the data I execute a git pull before executing my program:
%sh git -C /dbfs/FileStore/TheData/ pull
assuming you have python installed on your desktop, install the databricks cli, clone the git repo to your local, then use the workspace cli to import the entire repo as a directory.
https://docs.databricks.com/dev-tools/cli/workspace-cli.html
The simplest way is, just import the .dbc file direct into your user workspace on Community Edition, as explained by Databricks here:
Import GitHub repo into Community Edtion Workspace
In GitHub, in the pane to the right, under Releases, click on the
Latest link:
Latest release
Under Assets look for the link to the DBC file
Right click the DBC file's link and copy the link location (there is
no need to download this file)
.dbc file
Back in Databricks, click on the Workspace icon in the
navigational pane to the left
In the Workspace swimlane, click the Home button to open your
home folder. It should open the folder /Users/your-email-address
as in /Users/student#example.com
In the swimlane for your email address, click on the down chevron
and select Import
Import
In the Import Notebooks dialog
Select URL
Paste in the URL copied in step #3 above
Click Import
Once the import is done, select the new folder for this course to
view this course's notebooks.
Which notebook you should start with depends on your courseware and/or instructor.
I create experiments in my workspace using the python sdk (azureml-sdk). I now have a lot of 'test' experiments littering our workspace. How can I delete individual experiments either through the api or on the portal. I know I can delete the whole workspace but there are some good experiments we don't want to delete
https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-export-delete-data#delete-visual-interface-assets suggests it is possible but my workspace view does not look anything like what is shown there
Experiment deletion is a common request and we in Azure ML team are working on it. Unfortunately it's not supported quite yet.
Starting from 2021-08-24 Azure ML Workspace release you can delete the experiment - but only by clicking in UI (Select Experiment in Experiments view -> 'Delete')
Watch out - deleting the experiment will delete all the underlying runs - and deleting a run will delete the child runs, run metrics, metadata, outputs, logs and working directories!
Only for experiments without any underlying runs you can use Python SDK (azureml-core==1.34.0) - Experiment class delete static method, example:
from azureml.core import Workspace, Experiment
aml_workspace = Workspace.from_config()
experiment_id = Experiment(aml_workspace, '<experiment_name>').id
Experiment.delete(aml_workspace, experiment_id)
If an experiment has runs you will get an error:
CloudError: Azure Error: UserError
Message: Only empty Experiments can be deleted. This experiment contains run(s)
I hope Azure ML team gets this functionality to Python SDK soon!
Also on a sad note - would be great if you optimize the deletion - for now it seems like extremely slow (implementation) synchronous (need async as well) call...
You can delete your experiment with the following code:
# Declare your experiment
from azureml.core import Experiment
experiment = Experiment(workspace=ws, name="<your_experiment>")
# Delete the experiment
experiment.archive()
# Now check the list of experiments on your AML wokrspace and see that it was deleted
This issue is still opened at the moment. What I have figure out to avoid many experiments in workspace is run locally in Python SDK and after upload output files to the run's outputs folder when the run completes.
You can define it as:
run.upload_file(name='outputs/sample.csv', path_or_stream='./sample.csv')
Follow the two steps:
1.Delete experiment's child jobs in Azure Studio, here is how:
2.Delete the (empty) experiment with Python API, here is how:
from azureml.core import Workspace, Experiment, Run
# choose the workspace and experiment
ws = Workspace.from_config()
exp_name = 'digits_recognition'
# ... delete first experiment's child jobs in Azure Studio
exp = Experiment(ws,exp_name)
Experiment.delete(ws,exp.id)
Note: for a more fine-grained control over deletions, use Azure CLI.
Hope you are doing well.
I am new to Spark as well as Microsoft Azure. As per our project requirement we have developed a pyspark script though the jupyter notebook installed in our HDInsight cluster. Till date we ran the code from the jupyter itself but now we need to automate the script. I tried to use Azure Datafactory but could not find a way to run the pyspark script from there. Also tried to use oozie but could not figure out how to use it.
May you people please help me how I can automate/ schedule a pyspark script in azure.
Thanks,
Shamik.
Azure Data Factory today doesn't have first class support for Spark. We are working to add that integration in future. Till that time, we have published a sample on Github that uses ADF Map Reduce Activity to submit a jar that invokes spark submit.
Please take a look here:
https://github.com/Azure/Azure-DataFactory/tree/master/Samples/Spark