I am setting up a development environment as a Docker container image. This will allow me and my colleagues to get up and running quickly using it as an interpreter environment. Our intended workflow is to develop code locally and execute it on an Azure Databricks cluster that's connected to various data sources. For this I'm looking into using databricks-connect.
I am running into the configuration of databricks-connect apparently solely being an interactive procedure. This results in having to run databricks-connect configure and supplying various configuration values each time the Docker container image is run, which is likely to become a nuisance.
Is there a way to configure databricks-connect in a non-interactive way? This would allow me to include the configuration procedure in the development environments Dockerfile and a developer being only required to supply configuration values when (re)building their local development environment.
Yes - it’s possible, there are different ways for that:
use shell multi line input, like this (taken from here) - just need to define correct environment variables:
echo "y
$databricks_host
$databricks_token
$cluster_id
$org_id
15001" | databricks-connect configure
generate config file directly - it’s just JSON that you need to fill with necessary parameters. Generate it once, look into ~/.databricks-connect and reuse.
But really you may not need configuration at all - Databricks connect can take information either from environment variables (like DATABRICKS_ADDRESS) or Spark configuration (like spark.databricks.service.address) - just refer to official documentation.
Above didn't work for me, this however did:
with open(os.path.expanduser("~/.databricks-connect"), "w") as f:
json.dump(db_connect_config, f)
spark = SparkSession.builder.getOrCreate()
Where db_connect_config is a dictionary with the credentials.
Related
I am looking to use Databricks Connect for developing a pyspark pipeline. DBConnect is really awesome because I am able to run my code on the cluster where the actual data resides, so it's perfect for integration testing, but I also want to be able to, during development and unit testing (pytest with pytest-spark), simply using a local Spark environment.
Is there any way to configure DBConnect so for one use-case I simply use a local Spark environment, but for another it uses DBConnect?
My 2 cents, since I've been done this type of development for some months now:
Work with two Python environments: one with databricks-connect (and thus, no pyspark installed), and another one with only pyspark installed. When you want to execute the tests, just activate the "local" virtual environment and run pytest as usual. Make sure, as some commenters pointed out, that you are initializing the pyspark session using SparkConf().setMaster("local").
Pycharm helps immensely to switch between environments during development. I am always on the "local" venv by default, but whenever I want to execute something using databricks-connect, I just create a new Run configuration from the menu. Easy peasy.
Also, be aware of some of databricks-connect's limitations:
It is not officially supported anymore, and Databricks recommend moving towards dbx whenever possible.
UDFs just won't work in databricks-connect.
Mlflow integration is not reliable. In my use case, I am able to download and use models, but unable to log a new experiment or track models using databricks tracking uri. This might depend on your Databricks Runtime, mlflow and local Python version.
I am looking to bind a PCF (Pivotal Cloud Foundry) Service to allow us to set certain api endpoints used by our UI within PCF environment. I want to use the values in this service to overwrite the values in the root directory file, 'config.json'. Are there any examples out there that accomplish this sort of thing?
The primary way to tackle this is to have your application do this parsing. Most (all?) programming languages give you the ability to load environment variables and to parse JSON. Using these capabilities, what you'd want to do is to read the VCAP_SERVICES environment variable and parse the JSON. This is where the platform will insert the information from your bound services. From there you, you have the configuration information so you can configure your app using the values from your bound service.
Manual Ex:
var vcap_services = JSON.parse(process.env.VCAP_SERVICES)
or you can use a library. There's a handy Node.js library called cfenv. You can read more about both of these options in the docs.
https://docs.cloudfoundry.org/buildpacks/node/node-service-bindings.html
If you cannot read the configuration inside of your application, perhaps there's a timing problem and you need the information before your app starts, you can use the platform's pre-runtime hooks.
https://docs.cloudfoundry.org/devguide/deploy-apps/deploy-app.html#profile
The runtime hooks allow your application to include a file called .profile which will execute before your application. The .profile file is a simple bash script which can do anything needed to ready your application to be run. The only catch is that this needs to happen very quickly because it must complete before your application is able to start up and your application has a finite amount of time to start (usually 60s).
In your case, you could use jq to parse you values and insert them info your config file, perhaps using sed to overwrite a template value. Another option would be to run a small Node.js script, since your app is using Node.js it should be available on the path when this script runs, to read the environment variables and generate your config file.
Hope that helps!
I am using Spark 2.3.1 on Mac, in Java.
I have confidential security info stored in an environment variable. However, as it's confidential, I don't want to expose its value through ps -e nor from http://localhost:4040/environment/.
Is there a way within Spark for me to hide the value please? Or anyway by code seal the value, while not affecting other Spark/Java functions.
You shouldn't use environment variables to store confidential information there are tools like Hashicorp Vault
regarding ps -e - if someone has access to the machine they can echo the environment variable anyway.
As for the spark UI - you can secure the access to it see here
We are using chef to deploy all of our stacks.
I need to build a runbook for each environment we deploy.
I have been parsing the environment, node and recipe files but the more information I need to extract, the more complex it becomes because I am converging the attributes in my application.
I would like to use the converged-attributes.json file produced by our chef deployment without deploying any code because we can't deploy production to the build runbooks.
We also plan to build the runbook before the environment exists to provide configuration information to the DevOps team (e.g. memory requirements, ports, etc.).
Is there a way to use any of the chef/knife components or libraries to do the following?
Converge the attributes for each node
Write the converged attributes to a location my application can access on Mac OSX.
Quit before attempting to access any servers
This is not possible in the generic case. Chef is executable code at heart and the only way to fully compute the side effects is to actually execute it. This is what chef-client does, you can't "converge" the node externally so step 3 doesn't really make any sense. You could try to use Why Run mode but we really don't recommend it and are probably going to remove the feature as it does more harm than good most of the time. Roles and environments are static data so you can parse and manipulate those, but cookbooks are code and have to be run in-place to know exactly what they will do.
When launching EC2 using Terraform (or cloud formation), we can configure EC2 by putting some scripts in user_data/remote-exec. Alternatively, we can configure EC2 using Ansible/Chef, etc. What are the difference of configuring EC2 in user_data/remote-exec and do that with Ansible/Chef? when to use the former, when to use the latter (I know Ansible/Chef is idempotent)?
In my case, the EC2 is originally manually launched, then manually configured using a lot of linux commands. and the commands are not configured by me. Now I am the person to automate the whole structure using terraform, and configure EC2s. Using user_data/remote-exec to configure EC2 is straightforward. I just need to put all the existing linux commands they have in some scripts with a little change. And if the configuration result using my script is not successful, at least I can quickly figure out whether I miss some commands by comparing my script and the original linux commands. But if I use ansible/chef, I have to rewrite all the steps using different language. And if the configuration is not what expected, it is hard for me to figure out which steps are not correct, because the syntax of ansible/chef and linux commands are totally different.
My question is, in my case, should I use ansible/chef or user_data/remote-exec for configuration?
User Data is good for initial configuration of the system. If you need longer term maintenance a configuration management software like Ansible/Chef/Salt/Puppet is a great option.
Packer can be used for immutable infrastructure, i.e. doesn't change after creation. You can run all the scripts and installs on the system for it to be ready to just boot, this is also faster because you don't have to wait for user data to run.
A few questions you have to ask as well, how often are you going to patch these? Are you going to just update existing or replace with new. Ansible is great for configuration since it's just yaml files an
Blue/Green deployments generally replace servers with all new ones and gradually move traffic over to the new servers.
Some more things to consider with your Infrastructure as code