Why "databricks-connect test" does not work after configurate Databricks Connect? - apache-spark

I wanna run my Spark processes directly in my cluster using IntelliJ IDEA, so I'm following the next documentation https://docs.azuredatabricks.net/user-guide/dev-tools/db-connect.html
After configuring all, I run databricks-connect test but I'm not obtained the Scala REPL as the documentation says.
That is my cluster configuration

I solve the problem. The problem was the versions of all the tools:
Install Java
Download and install Java SE Runtime Version 8.
Download and install Java SE Development Kit 8.
Install Conda
You can either download and install full blown Anaconda or use miniconda.
Download WinUtils
This pesty bugger is part of Hadoop and required by Spark to work on Windows. Quick install, open Powershell (as an admin) and run (if you are on a corporate network with funky security you may need to download the exe manually):
New-Item -Path "C:\Hadoop\Bin" -ItemType Directory -Force
Invoke-WebRequest -Uri https://github.com/steveloughran/winutils/raw/master/hadoop-2.7.1/bin/winutils.exe -OutFile "C:\Hadoop\Bin\winutils.exe"
[Environment]::SetEnvironmentVariable("HADOOP_HOME", "C:\Hadoop", "Machine")
Create Virtual Environment
We are now a new Virtual Environment. I recommend creating one environment per project you are working on. This allow us to install different versions of Databricks-Connect per project and upgrade them separately.
From the Start menu find the Anaconda Prompt. When it opens it will have a default prompt of something like:
(base) C:\Users\User
The base part means you are not in a virtual environment, rather the base install. To create a new environment execute this:
conda create --name dbconnect python=3.5
Where dbconnect is the name of your environment and can be what you want. Databricks currently runs Python 3.5 - your Python version must match. Again this is another good reason for having an environment per project as this may change in the future.
Now activate the environment:
conda activate dbconnect
Install Databricks-Connect
You are now good to go:
pip install -U databricks-connect==5.3.*
databricks-connect configure
Create Databricks cluster (in this case I used Amazon Web Services)
spark.databricks.service.server.enabled true
spark.databricks.service.port 15001 (Amazon 15001, Azure 8787)
Turn Windows Defender Firewall Off or allow access.

You problem looks like it is one of the following:
a) You specificed the wrong port (it has to be 8787 on Azure)
b) You didnt open up the port in you Databricks Cluster
c) You didnt install winUtils properly (e.g. you forgot to place the environment variable
If you can understand German by any chance, this youtube video might help you.
(Shows the full installation process for windows 10).
https://www.youtube.com/watch?v=VTfRh_RFzOs&t=3s

Try to run the databricks examples like :
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.getOrCreate()
print("Testing simple count")
# The Spark code will execute on the Databricks cluster.
print(spark.range(100).count())
this worked for me.
Maybe they will fix databricks-connect test

Your Python version should be 3.5 - as per the link you posted.
Are you behind a proxy or a network that may have a layer 7 firewall?
Everything else you have done looks correct. So I would try on another network.
Have you set:
spark.databricks.service.server.enabled true
spark.databricks.service.port 8787
IMPORTANT: I would rotate your API key - you have published your org id and key in the post meaning anyone can access it now.

Related

Can't make action calls through anaconda py35 env in spark HdInsight

As per the documentation - https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-python-package-installation
we had installed several external python modules through new anaconda env 'py35_data_prof'. However as soon as we invoke any rdd action calls like rdd.count() or rdd.avg() in our python code, spark2 throws -
Cannot run program "/usr/bin/anaconda/envs/py35_data_prof/bin/python": error=2, No such file or directory
enter image description here
FYI, The python indicated in error path - '/usr/bin/anaconda/envs/py35_data_prof/bin/python' is actually a symlink rather than python dir.
I have been looking up the HDInsight docs but can't seem to find the fix. Please let us know if there is a way around it.
The error message “Cannot run program "/usr/bin/anaconda/envs/py35_data_prof/bin/python": error=2, No such file or directory” clearly says the unable to find/locate the package installed. Make sure the package is installed with the all the requirements mentioned below.
• Create Python virtual environment using conda.
• Install external Python packages in the created virtual environment if needed.
• Change Spark and Livy configs and point to the created virtual environment.
I would request you to follow the each and every step mentioned here: “Safely install external Python packages”.
Hope this helps.

Data bricks cluster installs all the packages every time I start it

I have been working on Databricks notebook using Python/ R. Once job is done we need to terminate the cluster to save cost involved. ( As we are utilizing the machine).
So we also have to start the cluster if we want to work on any notebook. I have seen it takes a lot of time and install the packages again in the cluster. Is there any way to avoid installation everytime we start cluster?
Update: Databricks now allows custom docker containers.
Unfortunately not.
When you terminate a cluster its memory state is lost, so when you start it again it comes with a clean image. Even if you add the desired packages into an init script they will have to be installed each initialization.
You may ask Databricks support to check if it is possible to create a custom cluster image for you.
I am using conda env to install the packages. After my 1st installation, I am saving the environment as a yaml file in dbfs and using the same yaml file in the all other runs. This way I don't have to install the packages again.
Save the environment as a conda YAML specification.
%conda env export -f /dbfs/filename.yml
Import the file to another notebook using conda env update.
%conda env update -f /dbfs/filename.yml
List the packages -
%conda list

No start-history-server.sh when pyspark installed through conda

I have installed pyspark in a miniconda environment on Ubuntu through conda install pyspark. So far everything works fine: I can run jobs through spark-submit and I can inspect running jobs at localhost:4040. But I can't locate start-history-server.sh, which I need to look at jobs that have completed.
It is supposed to be in {spark}/sbin, where {spark} is the installation directory of spark. I'm not sure where that is supposed to be when spark is installed through conda, but I have searched through the entire miniconda directory and I can't seem to locate start-history-server.sh. For what it's worth, this is for both python 3.7 and 2.7 environments.
My question is: is start-history-server.sh included in a conda installation of pyspark?
If yes, where? If no, what's the recommended alternative way of evaluating spark jobs after the fact?
EDIT: I've filed a pull request to add the history server scripts to pyspark. The pull request has been merged, so this should tentatively show up in Spark 3.0.
As #pedvaljim points out in a comment, this is not conda-specific, the directory sbin isn't included in pyspark at all.
The good news is that it's possible to just manually download this folder from github (i.e. not sure how to download just one directory, I just cloned all of spark) into your spark folder. If you're using mini- or anaconda, the spark folder is e.g. miniconda3/envs/{name_of_environment}/lib/python3.7/site-packages/pyspark.

Properly configuring PySpark and Anaconda3 on Linux

Here are the steps I have taken so far:
I installed Anaconda3 and everything included on the directory $HOME/anaconda3/bin.
I cd'ed into $HOME/anaconda3/bin and ran the command ./conda install -c conda-forge pyspark. It was successful.
I didn't do anything else. More specifically, there are no variables set in my .bashrc
Here are some important details:
I am on a distributed cluster running Hadoop, so there might be other directories outside of my home folder that I have yet to discover but I might need. I also don't have admin access.
Jupyter notebook runs just fine.
Here is my goal:
Goal. To do something along the lines of adding variables or configuring some files so that I can run pyspark on Jupyter Notebook.
What are other steps I need to do after Step 3 in order to achieve this goal?
Since you have installed pyspark with conda, and as you say Jupyter notebook runs fine (presumably for the same Anaconda distribution), there are no further steps required - you should be able to open a new notebook and import pyspark.
Notice though that installing pyspark that way (i.e. with pip or conda) gives only limited functionality; from the package docs:
The Python packaging for Spark is not intended to replace all of the
other use cases. This Python packaged version of Spark is suitable for
interacting with an existing cluster (be it Spark standalone, YARN, or
Mesos) - but does not contain the tools required to setup your own
standalone Spark cluster. You can download the full version of Spark
from the Apache Spark downloads page.
Installing pyspark with pip or conda is a relatively recent add-on, aimed at the cases described in the docs above. I don't know what limitations you may face (have never tried it) but if you need the full functionality, you should download the full Spark distribution (of which pyspark is an integral part).

Upgrade Python modules at Bluemix, to get out of the error (No trigger by the name “interval” was found)

I am using the ipython of apache spark service at bluemix. I need to reinstall setuptools, but I can't enter password for sudo. How can I proceed to make it work ( the goal is to fulfil the following actually )
https://bitbucket.org/agronholm/apscheduler/issues/77/lookuperror-no-trigger-by-the-name
Thanks,
Boris
You do not have root permissions with this service, so you cannot install anything at the system level and you cannot run sudo. If you want to install apscheduler, then run pip with the '--user' arg so that it installs local to your tenant.
Update: IBM has deployed new software levels last week. If you create a new Apache Spark service on Bluemix, your environment won't include the offending version of setuptools anymore.
original answer:
As Randy pointed out, you cannot reinstall setuptools. Until IBM upgrades that package, use the workaround mentioned in the issue you linked:
"In the meantime, you can instantiate the triggers manually"
https://bitbucket.org/agronholm/apscheduler/issues/77/lookuperror-no-trigger-by-the-name#comment-14180022
The author of apscheduler apparently added a check for the version of setuptools. You'll have to use an older version of apscheduler without that check.

Resources