Switching between Databricks Connect and local Spark environment - apache-spark

I am looking to use Databricks Connect for developing a pyspark pipeline. DBConnect is really awesome because I am able to run my code on the cluster where the actual data resides, so it's perfect for integration testing, but I also want to be able to, during development and unit testing (pytest with pytest-spark), simply using a local Spark environment.
Is there any way to configure DBConnect so for one use-case I simply use a local Spark environment, but for another it uses DBConnect?

My 2 cents, since I've been done this type of development for some months now:
Work with two Python environments: one with databricks-connect (and thus, no pyspark installed), and another one with only pyspark installed. When you want to execute the tests, just activate the "local" virtual environment and run pytest as usual. Make sure, as some commenters pointed out, that you are initializing the pyspark session using SparkConf().setMaster("local").
Pycharm helps immensely to switch between environments during development. I am always on the "local" venv by default, but whenever I want to execute something using databricks-connect, I just create a new Run configuration from the menu. Easy peasy.
Also, be aware of some of databricks-connect's limitations:
It is not officially supported anymore, and Databricks recommend moving towards dbx whenever possible.
UDFs just won't work in databricks-connect.
Mlflow integration is not reliable. In my use case, I am able to download and use models, but unable to log a new experiment or track models using databricks tracking uri. This might depend on your Databricks Runtime, mlflow and local Python version.

Related

Can I use Spyder IDE with Azure Machine Learning?

I am learning to use Azure Machine Learning. it has its Notebooks (which are ok!) and also it allows me to use Jupyter Notebook and VSCode.
However I am wondering if there is a way to efficiently use Spyder with Azure Machine Learing.
eg. I was able to install R-Studio as a custom application using a docker image using steps provided here Stackoverflow link
Spyder supports connecting to a Remote Python kernel, it does, however require SSH.
You can enable SSH on your Compute Instance (see below), but only when you set it up. Also, many companies have policies against enabling SSH, so this might not work for you. If it doesn't, I can highly recommend VSCode.

Non-interactive configuration of databricks-connect

I am setting up a development environment as a Docker container image. This will allow me and my colleagues to get up and running quickly using it as an interpreter environment. Our intended workflow is to develop code locally and execute it on an Azure Databricks cluster that's connected to various data sources. For this I'm looking into using databricks-connect.
I am running into the configuration of databricks-connect apparently solely being an interactive procedure. This results in having to run databricks-connect configure and supplying various configuration values each time the Docker container image is run, which is likely to become a nuisance.
Is there a way to configure databricks-connect in a non-interactive way? This would allow me to include the configuration procedure in the development environments Dockerfile and a developer being only required to supply configuration values when (re)building their local development environment.
Yes - it’s possible, there are different ways for that:
use shell multi line input, like this (taken from here) - just need to define correct environment variables:
echo "y
$databricks_host
$databricks_token
$cluster_id
$org_id
15001" | databricks-connect configure
generate config file directly - it’s just JSON that you need to fill with necessary parameters. Generate it once, look into ~/.databricks-connect and reuse.
But really you may not need configuration at all - Databricks connect can take information either from environment variables (like DATABRICKS_ADDRESS) or Spark configuration (like spark.databricks.service.address) - just refer to official documentation.
Above didn't work for me, this however did:
with open(os.path.expanduser("~/.databricks-connect"), "w") as f:
json.dump(db_connect_config, f)
spark = SparkSession.builder.getOrCreate()
Where db_connect_config is a dictionary with the credentials.

Use local IDE such as Microsoft visual code to use additional compute

I am trying to figure out the best way about how we can use local IDE such as microsoft visual studio code to use distributed computing power. Currently, we are brining data locally but it doesn't seem like sustainable solution because of reasons like in future scale of data will grow, cloud data security, etc. One workaround we thought of is to tunnel into EC2 instances but would like to hear what's best way to solve this in machine learning/data science environment (we are using databricks and AWS services).
Not sure why you are connecting IDE to ccomputer . I have used VS Code for running scripts against HDInsight cluster . Before I fire by scripts I do configure the clusters against which it is going to run . The same is true on the Databricks also.

How to copy local MLflow run to remote tracking server?

I am currently tracking my MLflow runs to a local file path URI. I would also like to set up a remote tracking server to share with my collaborators. One thing I would like to avoid is to log everything to the server, as it might soon be flooded with failed runs.
Ideally, I'd like to keep my local tracker, and then be able to send only the promising runs to the server.
What is the recommended way of copying a run from a local tracker to a remote server?
To publish your trained model to a remote MLflow server you should use 'register_model' API. For example, if you are using spacy flavor of MLflow you can use as below, where 'nlp' is the trained model:
mlflow.spacy.log_model(spacy_model=nlp, artifact_path='mlflow_sample')
model_uri = "runs:/{run_id}/{artifact_path}".format(
run_id=mlflow.active_run().info.run_id, artifact_path='mlflow_sample'
)
mlflow.register_model(model_uri=model_uri, name='mlflow_sample')
Make sure that the following environment variables should be set. In below example S3 storage is used:
SET MLFLOW_TRACKING_URI=https://YOUR-REMOTE-MLFLOW-HOST
SET MLFLOW_S3_BUCKET=s3://YOUR-BUCKET-NAME
SET AWS_ACCESS_KEY_ID=YOUR-ACCESS-KEY
SET AWS_SECRET_ACCESS_KEY=YOUR-SECRET-KEY
I have been interested in a related capability of copying runs from one experiment to another for a similar reason, ie set one area for arbitrary runs and another into which the results for promising runs that we move forward with are moved. Your scenario with separate tracking server is just the generalization of mine. Either way, apparently there is not a feature for this capability built-in to Mlflow currently. However, the mlflow-export-import python-based tool looks like it may cover both our use cases, and it cites usage on both Databricks and the open-source version of Mlflow, and it appears current as of this writing. I have not tried using this tool yet myself though - if/when I try it I'm happy to jot a follow-up here saying whether it worked well for this purpose, and/or anyone else could do same. Thanks and cheers!

How do I pass in the google cloud project to the SHC BigTable connector at runtime?

I'm trying to access BigTable from Spark (Dataproc). I tried several different methods and SHC seems to be the cleanest for what I am trying to do and performs well.
https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc
However this approach requires that I put the Google cloud project ID in hbase-site.xml which means I need to build separate versons of the fat jar file with my spark code for each env I run on (prod, staging, etc.) which is something I'd like to avoid.
Is there a way for me to pass in the google cloud project id at runtime?
As far as I can tell, the SHC library does not let you pass through hbase configs (looking in here).
The easiest thing would be to run an init action that gets the VM's project id from VM metadata and sets it in hbase-site.xml. We are working on an initialization that does that and installs the Hbase client for Bigtable. Check out the in-progress pull request, which would be a good starting point if you needed to write one immediately. Otherwise, I expect the PR to get merged in the next couple weeks.
Alternatively, consider adding an option in SHC for passing through properties to the HBaseConfiguration it creates. That would be a valuable feature for the broader community.

Resources