Making pyspark default in Apache Zeppelin? - apache-spark

Can't find out how to set PySpark to be the default interpreter for Zeppelin.
I know I can make Spark the default interpreter by putting it at the top of the list. But having to remember to add %pyspark to the top of each new cell is basically as annoying as adding %spark.pyspark
I'd just use Jupyter, but I'm working off a DC/OS Cluster and Zeppelin was available as a preconfigured app, while Jupyter looks like a bit of an ordeal to install on the cluster.
So, to surmise: Anyone know how to make pyspark the default interpreter for Apache Zeppelin?
Thanks!

Related

PySpark / Kafka - org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated

So, I'm working on setting up a development environment for working with PySpark and Kafka. I'm working through getting things setup so I can run these tutorials in a Jupyter notebook as a 'hello world' exercise: https://spark.apache.org/docs/3.1.1/structured-streaming-kafka-integration.html
Unfortunately, I'm currently hitting the following error when I attempt to connect to the Kafka stream:
Py4JJavaError: An error occurred while calling o68.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:583)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:805)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:723)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1395)
...
Now, some digging has told me that the most common cause of this issue is version mismatches (either for the Spark, or Scala versions in use). However, I'm able to confirm that these are aligned properly:
Spark: 3.1.1
Scala: 2.12.10
conf/spark-defaults.conf
...
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
EDIT
So, some additional observations from trying to figure this out:
It looks like this is at least partially a Jupyter notebook issue, as I can now get things working just fine via the pyspark shell.
Looks like my notebook is firing up its own instance of Spark, so maybe there's some difference in how Spark is being run there vs from a terminal window?
At a loss for how they're different though, as both environments should be using mostly default configurations.
Ok - looks like it doesn't work when invoked via the regular Python REPL either, which is leading me to think there's something different about the spark context being created by the pyspark shell and the one I'm creating in my notebook.
Ok - looks like something differs when things are run via Jupyter - hadoop.common.configuration.version has a value of 0.23.0 for the notebook instance, but 3.0.0 for the pyspark shell instance. Not sure why this might be or what it may mean yet.
What else should I check to confirm that this is setup correctly?
Ok - so it looks like the difference was that findspark was locating and using a different Spark Home directory (one that came installed with the pyspark installation via pip).
It also looks like Spark 3.1.1 for Hadoop 2.7 has issues with the Kafka client (or maybe needs to be configured differently) but Spark 3.1.1 for Hadoop 3.2 works fine.
Solution was to ensure that I explicitly chose my SPARK_HOME by passing the spark_home path to findspark.init()
findspark.init(spark_home='/path/to/desired/home')
Things to watch out for that got me and might trip you up too:
If you've installed pyspark through pip / mambaforge this will also deploy a second SPARK_HOME - this can create dependency / library confusion.
Many of the scripts in bin/ use SPARK_HOME to determine where to execute, so don't assume that just because you're running a script from one home that you're running spark IN that home.

Using Pyspark locally when installed using databricks-connect

I have databricks-connect 6.6.0 installed, which has a Spark version 2.4.6. I have been using the databricks cluster till now, but I am trying to switch to using a local spark session for unit testing.
However, every time I run it, it still shows up on the cluster Spark UI as well as the local Spark UI on xxxxxx:4040.
I have tried initiating using SparkConf(), SparkContext(), and SQLContext() but they all do the same thing. I have also set the right SPARK_HOME, HADOOP_HOME, and JAVA_HOME, and downloaded winutils.exe separately, and none of these directories have spaces. I have also tried running it from console as well as from terminal using spark-submit.
This is one of the pieces of sample code I tried:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("name").getOrCreate()
inp = spark.createDataFrame([('Person1',12),('Person2',14)],['person','age'])
op = inp.toPandas()
I am using:
Windows 10, databricks-connect 6.6.0, Spark 2.4.6, JDK 1.8.0_265, Python 3.7, PyCharm Community 2020.1.1
Do I have to override the default/global spark session to initiate a local one? How would I do that?
I might be missing something - The code itself runs fine, it's just a matter of local vs. cluster.
TIA
You can’t run them side by side. I recommend having two virtual environments using Conda. One for databricks-connect one for pyspark. Then just switch between the two as needed.

How to enable pyspark HIVE support on Google Dataproc master node

I created a dataproc cluster and manually install conda and Jupyter notebook. Then, I install pyspark by conda. I can successfully run spark by
from pyspark import SparkSession
sc = SparkContext(appName="EstimatePi")
However, I cannot enable HIVE support. The following code gets stucked and doesn't return anything.
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.config('spark.driver.memory', '2G')
.config("spark.kryoserializer.buffer.max", "2000m")
.enableHiveSupport()
.getOrCreate())
Python version 2.7.13, Spark version 2.3.4
Any way to enable HIVE support?
I do not recommend manually installing pyspark. When you do this, you get a new spark/pyspark installation that is different from Dataproc's own and do not get the configuration/tuning/classpath/etc. This is likely the reason Hive support does not work.
To get conda with properly configured pyspark I suggest selecting ANACONDA and JUPYTER optional components on image 1.3 (the default) or later.
Additionally, on 1.4 and later images Mini-Conda is the default user Python with pyspark preconfigured. You can pip/conda install Jupyter on your own if you wish.
See https://cloud.google.com/dataproc/docs/tutorials/python-configuration
Also as #Jayadeep Jayaraman points out, Jupyter optional component works with Component Gateway which means you can use it from a link in Developers Console as opposed to opening ports to the world or SSH tunneling.
tl/dr: I recomment these flags for your next cluster: --optional-components ANACONDA,JUPYTER --enable-component-gateway
Cloud Dataproc now has the option to install optional components in the dataproc cluster and also has an easy way of accessing them via the Gateway. You can find details of installing Jupyter and Conda here - https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
The details of the component gateway can be found here - https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways. Note that this is Alpha.

Is there a way to add spark.driver.extraClassPath through PyCharm?

I'm trying to run a Spark job on my local machine to write some data to a postgres db.
You can set this through the defaults.conf configuration file.
This same approach applies regardless of PyCharm, pyspark, spark-shell, or any other means of invoking Spark.

how to add parameters to bluemix pyspark

I am using pyspark in an ipython notebook and accessing a netezza database. I am trying to do something similar on bluemix. The problem is that in order to have access to netezza, I have to add parameters to the pyspark startup. How can I do that on bluemix? Here is how I start pyspark standalone:
$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.ibm.SparkTC:spark-netezza_2.10:0.1.1 --driver-class-path /usr/local/src/netezza/jdbc/lib/nzjdbc3.jar
You cannot change the parameters for starting PySpark on Bluemix.
The %AddJar kernel magic works for the Scala notebooks, only.
It does not work for Python notebooks.
The driver for Netezza nzjdbc3.jar has to be provided and supported on the service in order to make us of it. Currently, this cannot be done by the user.
Update:
nzjdbc3.jar is not supported out of box. You could submit feedback via E-Mail and ask for the driver to be supported.
Another possibility to enable the driver for PySpark is to put the jar into a location which will be considered for PySpark configuration.
First, find out your USER_ID by using the following command:
!whoami
Then, get nzjdbc3.jar and put it to the following location:
/gpfs/fs01/user/USER_ID/data/libs
One way to put nzjdbc3.jar into the mentioned location is to use wget:
!wget URI_TO_JAR_FILE -P /gpfs/fs01/user/USER_ID/data/libs
After the driver jar was downloaded to the mentioned location, you have to restart the kernel. During the creation of the new kernel all files in the mentioned location will be considered for PySpark.

Resources