Data bricks cluster installs all the packages every time I start it

Data bricks cluster installs all the packages every time I start it - databricks

I have been working on Databricks notebook using Python/ R. Once job is done we need to terminate the cluster to save cost involved. ( As we are utilizing the machine).
So we also have to start the cluster if we want to work on any notebook. I have seen it takes a lot of time and install the packages again in the cluster. Is there any way to avoid installation everytime we start cluster?

Update: Databricks now allows custom docker containers.
Unfortunately not.
When you terminate a cluster its memory state is lost, so when you start it again it comes with a clean image. Even if you add the desired packages into an init script they will have to be installed each initialization.
You may ask Databricks support to check if it is possible to create a custom cluster image for you.

I am using conda env to install the packages. After my 1st installation, I am saving the environment as a yaml file in dbfs and using the same yaml file in the all other runs. This way I don't have to install the packages again.
Save the environment as a conda YAML specification.
%conda env export -f /dbfs/filename.yml
Import the file to another notebook using conda env update.
%conda env update -f /dbfs/filename.yml
List the packages -
%conda list

Related

Is it possible to make airflow use a different python environment?

I used apache/airflow:2.3.3 docker image to set up the airflow. I've a pyenv installed in another drive. I've binded the volume with pyenv while creating the airflow containers. I've also updated env variable PATH in the containers to point to the python versions available in pyenv volume mounted. But airflow is still making use of the python (3.7.7) that comes with image apache/airflow:2.3.3. Is it possible to make airflow use a different python environment? I'm looking for if there is some additional environment variables that need to be updated so that airflow will take another python environment.

Have you taken a look at
PythonVirtualenvOperator
ExternalPythonOperator <-- this might require an upgrade to 2.4.0
Both of these offer flexibility call Python callables inside new Python environments.

Taking a conda environment to another PC efficiently

This may be a noob question, I am new to using Conda environments. I am looking for some advice on how to tackle the following workflow in the best way.
I have both a work desktop and a desktop at home. I want to be able to, at the end of the day, take my work environment home.
Note: I work in Ubuntu on subsystems for windows
Say I start a project from scratch. I currently use the following workflow:
I create a conda environment.
conda create --name my_new_project python=3.10
activate my workspace.
conda activate my_new_project
I install python packages I need:
conda install -c conda-forge opencv
etc...
At the end of the day I want to copy that environment and take it to another PC, so I do the following:
conda env export --f my_new_project.yml
Finally on my home PC I do
conda env create --file my_new_project.yml
This works but requires me to make a new environment every time I switch PC. Is there a way to load the differences between the two conda environments and only add the new packages? Or is there another better way to tackle this?

There's no need to create a new environment every time. You only do that once and then update the existing environment, i.e. use the following as step 5:
conda env update -f dependencies.yml
I also suggest you to put your code, including the dependencies file, into version control if you aren't doing that already. Then, getting up to speed with your project on a different computer will only require two steps.

Why "databricks-connect test" does not work after configurate Databricks Connect?

I wanna run my Spark processes directly in my cluster using IntelliJ IDEA, so I'm following the next documentation https://docs.azuredatabricks.net/user-guide/dev-tools/db-connect.html
After configuring all, I run databricks-connect test but I'm not obtained the Scala REPL as the documentation says.
That is my cluster configuration

I solve the problem. The problem was the versions of all the tools:
Install Java
Download and install Java SE Runtime Version 8.
Download and install Java SE Development Kit 8.
Install Conda
You can either download and install full blown Anaconda or use miniconda.
Download WinUtils
This pesty bugger is part of Hadoop and required by Spark to work on Windows. Quick install, open Powershell (as an admin) and run (if you are on a corporate network with funky security you may need to download the exe manually):
New-Item -Path "C:\Hadoop\Bin" -ItemType Directory -Force
Invoke-WebRequest -Uri https://github.com/steveloughran/winutils/raw/master/hadoop-2.7.1/bin/winutils.exe -OutFile "C:\Hadoop\Bin\winutils.exe"
[Environment]::SetEnvironmentVariable("HADOOP_HOME", "C:\Hadoop", "Machine")
Create Virtual Environment
We are now a new Virtual Environment. I recommend creating one environment per project you are working on. This allow us to install different versions of Databricks-Connect per project and upgrade them separately.
From the Start menu find the Anaconda Prompt. When it opens it will have a default prompt of something like:
(base) C:\Users\User
The base part means you are not in a virtual environment, rather the base install. To create a new environment execute this:
conda create --name dbconnect python=3.5
Where dbconnect is the name of your environment and can be what you want. Databricks currently runs Python 3.5 - your Python version must match. Again this is another good reason for having an environment per project as this may change in the future.
Now activate the environment:
conda activate dbconnect
Install Databricks-Connect
You are now good to go:
pip install -U databricks-connect==5.3.*
databricks-connect configure
Create Databricks cluster (in this case I used Amazon Web Services)
spark.databricks.service.server.enabled true
spark.databricks.service.port 15001 (Amazon 15001, Azure 8787)
Turn Windows Defender Firewall Off or allow access.

You problem looks like it is one of the following:
a) You specificed the wrong port (it has to be 8787 on Azure)
b) You didnt open up the port in you Databricks Cluster
c) You didnt install winUtils properly (e.g. you forgot to place the environment variable
If you can understand German by any chance, this youtube video might help you.
(Shows the full installation process for windows 10).
https://www.youtube.com/watch?v=VTfRh_RFzOs&t=3s

Try to run the databricks examples like :
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.getOrCreate()
print("Testing simple count")
# The Spark code will execute on the Databricks cluster.
print(spark.range(100).count())
this worked for me.
Maybe they will fix databricks-connect test

Your Python version should be 3.5 - as per the link you posted.
Are you behind a proxy or a network that may have a layer 7 firewall?
Everything else you have done looks correct. So I would try on another network.
Have you set:
spark.databricks.service.server.enabled true
spark.databricks.service.port 8787
IMPORTANT: I would rotate your API key - you have published your org id and key in the post meaning anyone can access it now.

No start-history-server.sh when pyspark installed through conda

I have installed pyspark in a miniconda environment on Ubuntu through conda install pyspark. So far everything works fine: I can run jobs through spark-submit and I can inspect running jobs at localhost:4040. But I can't locate start-history-server.sh, which I need to look at jobs that have completed.
It is supposed to be in {spark}/sbin, where {spark} is the installation directory of spark. I'm not sure where that is supposed to be when spark is installed through conda, but I have searched through the entire miniconda directory and I can't seem to locate start-history-server.sh. For what it's worth, this is for both python 3.7 and 2.7 environments.
My question is: is start-history-server.sh included in a conda installation of pyspark?
If yes, where? If no, what's the recommended alternative way of evaluating spark jobs after the fact?

EDIT: I've filed a pull request to add the history server scripts to pyspark. The pull request has been merged, so this should tentatively show up in Spark 3.0.
As #pedvaljim points out in a comment, this is not conda-specific, the directory sbin isn't included in pyspark at all.
The good news is that it's possible to just manually download this folder from github (i.e. not sure how to download just one directory, I just cloned all of spark) into your spark folder. If you're using mini- or anaconda, the spark folder is e.g. miniconda3/envs/{name_of_environment}/lib/python3.7/site-packages/pyspark.

Properly configuring PySpark and Anaconda3 on Linux

Here are the steps I have taken so far:
I installed Anaconda3 and everything included on the directory $HOME/anaconda3/bin.
I cd'ed into $HOME/anaconda3/bin and ran the command ./conda install -c conda-forge pyspark. It was successful.
I didn't do anything else. More specifically, there are no variables set in my .bashrc
Here are some important details:
I am on a distributed cluster running Hadoop, so there might be other directories outside of my home folder that I have yet to discover but I might need. I also don't have admin access.
Jupyter notebook runs just fine.
Here is my goal:
Goal. To do something along the lines of adding variables or configuring some files so that I can run pyspark on Jupyter Notebook.
What are other steps I need to do after Step 3 in order to achieve this goal?

Since you have installed pyspark with conda, and as you say Jupyter notebook runs fine (presumably for the same Anaconda distribution), there are no further steps required - you should be able to open a new notebook and import pyspark.
Notice though that installing pyspark that way (i.e. with pip or conda) gives only limited functionality; from the package docs:
The Python packaging for Spark is not intended to replace all of the
other use cases. This Python packaged version of Spark is suitable for
interacting with an existing cluster (be it Spark standalone, YARN, or
Mesos) - but does not contain the tools required to setup your own
standalone Spark cluster. You can download the full version of Spark
from the Apache Spark downloads page.
Installing pyspark with pip or conda is a relatively recent add-on, aimed at the cases described in the docs above. I don't know what limitations you may face (have never tried it) but if you need the full functionality, you should download the full Spark distribution (of which pyspark is an integral part).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string