Spark connects from VSCode but doesn't connect from Jupyter - apache-spark

I have installed databricks-connect and I was able to connect to the clusters and launch some jobs from VSCode.
I saw that it's possible to launch databricks-connect in Jupyter notebook, so from the same terminal of my code in VScode, I have launched Jupyter Notebook in the same environment, but Spark didn't love the idea,
Here are some snapshots for the problem:
This one from VSCode Notebooks (it works also in .py ):
This one from Jupyter Notebook:
I have tried findspark, and in our case it's not the solution since I am using Databricks Connect,
I see it more that Spark is not pointing to the same context,
and I repeat, I have launch the notebook from the terminal of the same environment, so logically I have all the env variables the same

Related

pyspark to open directly in jupyter-lab

Installed below apps on windows 10
Install apache spark 3.1.3
Installed Hadoop 3.3.2
Installed Jupyter-lab
When i execute pyspark or spark-shell from command line. I get the below output which mean apache spark got installed/configured correctly
When in execute pyspark from command line, i want jupyter-lab interface to be opened automatically.
When i set the below environment variable jupyter notebook opens automatically
PYSPARK_DRIVER = C:\Users\xxxx\AppData\Local\Programs\Python\Python39\Scripts\jupyter.exe
PYSPARK_DRIVER_PYTHON_OPTS = notebook
I tried below setting, but no luck
PYSPARK_DRIVER = C:\Users\xxxx\AppData\Local\Programs\Python\Python39\Scripts\jupyter-lab.exe
PYSPARK_DRIVER_PYTHON_OPTS = lab
What environment variables, i need to set in order to open jupyter-lab directly. How to specify the kernel in jupyter kernels ?

E0401:Unable to import 'pyspark in VSCode in Windows 10

I have installed below on my windows 10 machine to use the Apache Spark.
Java,
Python 3.6 and
Spark (spark-2.3.1-bin-hadoop2.7)
I am trying to write pyspark related code in VSCode. It is showing red underline under the 'from ' and showing error message
E0401:Unable to import 'pyspark'
I have also used ctrl+Shift+P and select "Python:Update workspace Pyspark libraries". It is showing notification message
Make sure you have SPARK_HOME environment variable set to the root path of the local spark installation!
What is wrong?
You will need to install the pyspark Python package using pip install pyspark. Actually, this is the only package you'll need for VSCode, unless you also want to run your Spark application on the same machine.

how to setup pyspark with zeppelin on windows 10

I have had difficulties installing Zeppelin 0.7.2
Using the Zeppelin version 0.7.2 of spark that comes with it, I can run spark code, but I am unable to run %pyspark code even after modifying python environment variables to point to where python is installed (python was installed using anaconda).
%python code works fine.
If anyone can help resolve this issue I would be grateful. (The odd thing is I have done the same installation on another windows 10 laptop and pyspark does execute.)
The error I get is that: pyspark is not responding

Jupyter notebook kernels do not launch in notebook's directory

I run:
Jupyter 4.2.0
notebook 4.2.3
Linux Mint 18
The notebook application starts correctly and in the correct directory. But when I open a notebook, the python kernel is launched in ~/user and not in the notebooks directory (as it used to be). This problems seems to be happening since I encrypted my home folder.
Could this be a permission issue ?

Connecting SparkR to the spark cluster

I have a spark cluster running on 10 machines (1 - 10) with the master at machine 1. All of these run on CentOS 6.4.
I am trying to connect a jupyterhub installation (which is running inside a ubuntu docker because of issues with installing on CentOS), using sparkR, to the cluster and get the spark context.
The code I am using is
Sys.setenv(SPARK_HOME="/usr/local/spark-1.4.1-bin-hadoop2.4")
library(SparkR)
sc <- sparkR.init(master="spark://<master-ip>:7077")
The output I get is
attaching package: ‘SparkR’
The following object is masked from ‘package:stats’:
filter
The following objects are masked from ‘package:base’:
intersect, sample, table
Launching java with spark-submit command spark-submit sparkr-shell/tmp/Rtmpzo6esw/backend_port29e74b83c7b3 Error in sparkR.init(master = "spark://10.10.5.51:7077"): JVM is not ready after 10 seconds
Error in sparkRSQL.init(sc): object 'sc' not found
I am using Spark 1.4.1. The spark cluster is also running CDH 5.
The jupyterhub installation can connect to the cluster via pyspark and I have python notebooks which use pyspark.
Can someone tell me what I am doing wrong?
I have a similar problem and have searching all around but no solutions. Can you please tell me what do you mean by "jupyterhub installation (which is running inside a ubuntu docker because of issues with installing on CentOS), "?
We have 4 clusters too on CentOS 6.4. One of my other problem is that how do use an IDE like IPython or RStudio to interact with these 4 servers? Do I use my laptop to connect to these servers remotely (if yes, then how?) and if no then what can be the other solution.
Now to answer your question, I can give it a try. I think the you have to use --yarn-cluster option as stated here I hope this helps you solving the problem.
Cheers,
Ashish

Resources