Execute databricks magic command from PyCharm IDE - python-3.x

With databricks-connect we can successfully run codes written in Databricks or Databricks notebook from many IDE. Databricks has also created many magic commands to support their feature with regards to running multi-language support in each cell by adding commands like %sql or %md. One issue I am facing currently is when I try to execute Databricks notebooks in Pycharm is as follows:
How to execute Databricks specific magic command from PyCharm.
E.g.
Importing a script or notebook in Done in Databricks using this command-
%run
'./FILE_TO_IMPORT'
Where as in IDE from FILE_TO_IMPORT import XYZ works.
Again everytime I download Databricks notebook it comments out the magic commands and that makes it impossible to be used anywhere outside Databricks environment.
It's really inefficient to convert all databricks magic command everytime I want to do any developement.
Is there any configuration I could set which automatically detects Databricks specific magic commands?
Any solution to this will be helpful. Thanks in Advance!!!

Unfortunately, as per the databricks-connect version 6.2.0-
" We cannot use magic command outside the databricks environment directly. This will either require creating custom functions but again that will only work for Jupyter not PyCharm"
Again, since importing py files requires %run magic command so this also becomes a major issue. A solution to this is by converting the set of files to be imported as a python package and add it to the cluster via Databricks UI and then import and use it in PyCharm. But this is a very tedious process.

Related

Using databricks-connect debugging a notebook that runs another notebook

I am able to connect to the Azure Databricks cluster from my Linux Centos VM, using visual studio code.
Below code even works without any issue
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print("Cluster access test - ",spark.range(100).count())
setting = spark.conf.get("spark.master") # returns local[*]
if "local" in setting:
from pyspark.dbutils import DBUtils
dbutils = DBUtils().get_dbutils(spark)
else:
print("Do nothing - dbutils should be available already")
out = dbutils.fs.ls('/FileStore/')
print(out)
I have a notebook in my local which run another notebook using %run path/anothernotebook.
Since the %run string is commented # python is not executing it.
So i tried to include the dbutils.notebook.run('pathofnotebook') but it errors out stating notebook
Exception has occurred: AttributeError
'SparkServiceClientDBUtils' object has no attribute 'notebook'
Is it possible to locally debug a notebook that invokes another notebook?
It’s impossible - dbutils implementation included into Databricks Connect supports only ‘fs’ and ‘secrets’ subcommands (see docs).
Databricks Connect is designed to work with code developed locally, not with notebooks. If you can package content of that notebook as Python package, then you’ll able to debug it.
P.S. please take into account that dbutils.notebook.run executes notebook as a separate job, in contrast with %run

Can I have more than one connection in databricks-connect?

I have setup on my PC a miniconda python environment where I have installed the databricks-connect package and configured the tool with databricks-connect configure to connect to a databricks instance I want to use when developing code in the US.
I have a need to connect to a different a different databricks instance for developing code in the EU and I thought I could do this by setting up a different miniconda environment and installing databricks-connect in that environment and setting the configuration in that environment to point to the new databricks instance.
Alas, this did not work. When I look at databricks-connect configure in either miniconda environment, I see the same configuration in both which is the configuration I last configured.
My question therefore is: Is there a way to have multiple databricks-connect connections at the same time and toggle between the two without having to reconfigure each time?
Thank you for your time.
Right now, databricks-connect relies on the central configuration file, and this causes problems. There are two approaches to workaround that:
Use environment variables as described in the documentation, but they should be set somehow, plus you need to have different python environments for different versions of databricks-connect
Specify parameters as spark configuration (see in the same documentation)
For each DB cluster, do following:
separate python environment with name <name> & activate it
install databricks-connect into it
configure databricks-connect
move ~/.databricks-connect into ~/.databricks-connect-<name>
write wrapper script, that will activate python environment & symlink ~/.databricks-connect-<name> into ~/.databricks-connect (I have such script for Zsh, it could be too long to paste it here.)

How to list Databricks scopes using Python when working on it secret API

I can create a scope. However, I want to be sure to create the scope only when it does not already exist. Also, I want to do the checking using Python? Is that doable?
What I have found out is that I can create the scope multiple times and not get an error message -- is this the right way to handle this? The document https://docs.databricks.com/security/secrets/secret-scopes.html#secret-scopes points out using
databricks secrets list-scopes
to list the scopes. However, I created a cell and ran
%sh
databricks secrets list-scopes
I got an error message saying "/bin/bash: databricks: command not found".
Thanks!
This will list all the scopes.
dbutils.secrets.listScopes()
You can't run the CLI commands from your databricks cluster (through a notebook). CLI needs to be installed and configured on your own workstation and then you can run these commands on your workstation after you configure connecting to a databricks worksapce using the generated token.
still you can run databricks cli command in notebook by same kind databricks-clisetup in cluster level and run as bash command . install databricks cli by pip install databricks-cli

Azure ML Workbench File from Blob

When trying to reference/load a dsource or dprep file generated with a data source file from blob storage, I receive the error "No files for given path(s)".
Tested with .py and .ipynb files. Here's the code:
# Use the Azure Machine Learning data source package
from azureml.dataprep import datasource
df = datasource.load_datasource('POS.dsource') #Error generated here
# Remove this line and add code that uses the DataFrame
df.head(10)
Please let me know what other information would be helpful. Thanks!
Encountered the same issue and it took some research to figure out!
Currently, data source files from blob storage are only supported for two cluster types: Azure HDInsight PySpark and Docker (Linux VM) PySpark
In order to get this to work, it's necessary to follow instructions in Configuring Azure Machine Learning Experimentation Service.
I also ran az ml experiment prepare -c <compute_name> to install all dependencies on the cluster before submitting the first command, since that deployment takes quite a bit of time (at least 10 minutes for my D12 v2 cluster.)
Got the .py files to run with HDInsight PySpark compute cluster (for data stored in Azure blobs.) But .ipynb files are still not working on my local Jupyter server - the cells never finish.
I'm from the Azure Machine Learning team - sorry you are having issues with Jupyter notebook. Have you tried running the notebook from the CLI? If you run from the CLI you should see the stderr/stdout. The IFrame in WB swallows the actual error messages. This might help you troubleshoot.

When trying to register a UDF using Python on I get an error about Spark BUILD with HIVE

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o54))
This happens whenever I create a UDF on a second notebook in Jupyter on IBM Bluemix Spark as a Service.
If you are using IBM Bluemix Spark as a Service, execute the following command in a cell of the python notebook :
!rm -rf /gpfs/global_fs01/sym_shared/YPProdSpark/user/spark_tenant_id/notebook/notebooks/metastore_db/*.lck
Replace spark_tenant_id with the actual one. You can find the tenant id using the following command in a cell of the notebook:
!whoami
I've run into these errors as well. Only the first notebook you launch will have access to the hive context. From here
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user.

Resources