how to add parameters to bluemix pyspark - apache-spark

I am using pyspark in an ipython notebook and accessing a netezza database. I am trying to do something similar on bluemix. The problem is that in order to have access to netezza, I have to add parameters to the pyspark startup. How can I do that on bluemix? Here is how I start pyspark standalone:
$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.ibm.SparkTC:spark-netezza_2.10:0.1.1 --driver-class-path /usr/local/src/netezza/jdbc/lib/nzjdbc3.jar

You cannot change the parameters for starting PySpark on Bluemix.
The %AddJar kernel magic works for the Scala notebooks, only.
It does not work for Python notebooks.
The driver for Netezza nzjdbc3.jar has to be provided and supported on the service in order to make us of it. Currently, this cannot be done by the user.
Update:
nzjdbc3.jar is not supported out of box. You could submit feedback via E-Mail and ask for the driver to be supported.
Another possibility to enable the driver for PySpark is to put the jar into a location which will be considered for PySpark configuration.
First, find out your USER_ID by using the following command:
!whoami
Then, get nzjdbc3.jar and put it to the following location:
/gpfs/fs01/user/USER_ID/data/libs
One way to put nzjdbc3.jar into the mentioned location is to use wget:
!wget URI_TO_JAR_FILE -P /gpfs/fs01/user/USER_ID/data/libs
After the driver jar was downloaded to the mentioned location, you have to restart the kernel. During the creation of the new kernel all files in the mentioned location will be considered for PySpark.

Related

PySpark / Kafka - org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated

So, I'm working on setting up a development environment for working with PySpark and Kafka. I'm working through getting things setup so I can run these tutorials in a Jupyter notebook as a 'hello world' exercise: https://spark.apache.org/docs/3.1.1/structured-streaming-kafka-integration.html
Unfortunately, I'm currently hitting the following error when I attempt to connect to the Kafka stream:
Py4JJavaError: An error occurred while calling o68.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:583)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:805)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:723)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1395)
...
Now, some digging has told me that the most common cause of this issue is version mismatches (either for the Spark, or Scala versions in use). However, I'm able to confirm that these are aligned properly:
Spark: 3.1.1
Scala: 2.12.10
conf/spark-defaults.conf
...
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
EDIT
So, some additional observations from trying to figure this out:
It looks like this is at least partially a Jupyter notebook issue, as I can now get things working just fine via the pyspark shell.
Looks like my notebook is firing up its own instance of Spark, so maybe there's some difference in how Spark is being run there vs from a terminal window?
At a loss for how they're different though, as both environments should be using mostly default configurations.
Ok - looks like it doesn't work when invoked via the regular Python REPL either, which is leading me to think there's something different about the spark context being created by the pyspark shell and the one I'm creating in my notebook.
Ok - looks like something differs when things are run via Jupyter - hadoop.common.configuration.version has a value of 0.23.0 for the notebook instance, but 3.0.0 for the pyspark shell instance. Not sure why this might be or what it may mean yet.
What else should I check to confirm that this is setup correctly?
Ok - so it looks like the difference was that findspark was locating and using a different Spark Home directory (one that came installed with the pyspark installation via pip).
It also looks like Spark 3.1.1 for Hadoop 2.7 has issues with the Kafka client (or maybe needs to be configured differently) but Spark 3.1.1 for Hadoop 3.2 works fine.
Solution was to ensure that I explicitly chose my SPARK_HOME by passing the spark_home path to findspark.init()
findspark.init(spark_home='/path/to/desired/home')
Things to watch out for that got me and might trip you up too:
If you've installed pyspark through pip / mambaforge this will also deploy a second SPARK_HOME - this can create dependency / library confusion.
Many of the scripts in bin/ use SPARK_HOME to determine where to execute, so don't assume that just because you're running a script from one home that you're running spark IN that home.

How to enable pyspark HIVE support on Google Dataproc master node

I created a dataproc cluster and manually install conda and Jupyter notebook. Then, I install pyspark by conda. I can successfully run spark by
from pyspark import SparkSession
sc = SparkContext(appName="EstimatePi")
However, I cannot enable HIVE support. The following code gets stucked and doesn't return anything.
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.config('spark.driver.memory', '2G')
.config("spark.kryoserializer.buffer.max", "2000m")
.enableHiveSupport()
.getOrCreate())
Python version 2.7.13, Spark version 2.3.4
Any way to enable HIVE support?
I do not recommend manually installing pyspark. When you do this, you get a new spark/pyspark installation that is different from Dataproc's own and do not get the configuration/tuning/classpath/etc. This is likely the reason Hive support does not work.
To get conda with properly configured pyspark I suggest selecting ANACONDA and JUPYTER optional components on image 1.3 (the default) or later.
Additionally, on 1.4 and later images Mini-Conda is the default user Python with pyspark preconfigured. You can pip/conda install Jupyter on your own if you wish.
See https://cloud.google.com/dataproc/docs/tutorials/python-configuration
Also as #Jayadeep Jayaraman points out, Jupyter optional component works with Component Gateway which means you can use it from a link in Developers Console as opposed to opening ports to the world or SSH tunneling.
tl/dr: I recomment these flags for your next cluster: --optional-components ANACONDA,JUPYTER --enable-component-gateway
Cloud Dataproc now has the option to install optional components in the dataproc cluster and also has an easy way of accessing them via the Gateway. You can find details of installing Jupyter and Conda here - https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
The details of the component gateway can be found here - https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways. Note that this is Alpha.

Corrupt file getting generated when launching jupyter-pyspark kernel

We have installed and setup jupyter notebook on two of our linux hadoop servers with pyspark kernels. Both the servers have same kernel.jason configuration which has same spark and pyhton versions.
on one server jupyter notebook ui --> pyspark kernel is working fine but on other server when launching pyspark kernel, a file(with name - ??????????) is getting generated in the users home directory, able to execute queries in opened pyspark kernel session, but when new jupyternotebook ui is launched and in corresponding pyspark kernel, unable to execute the queries. We are able to execute only after removing the ??????? file which was generated and relauching the jupyter notebook again.
-We see this behaviuor for all users, and is happening only on one server. Can someone please help out with in resolving this issue.
versions:
Python 2.7.12
Spark 2.1.1
Steps performed:
-Verified pyspark kernel configs with jupyter running on other server which has no issues.
-restarted spark client on the server
-Tried rebooting the server which did not resolve the issue.
looks like might be an issue with the server hardware

How to install a postgresql JDBC driver in pyspark

I use pyspark with spark 2.2.0 on a lubuntu 16.04 and I want to write a Dataframe to my Postgresql database. Now as far as I understand it I have to install a jdbc driver on the spark master for it. I downloaded the postgresql jdbc driver from their website and tried to follow this post. I added spark.jars.packages /path/to/driver/postgresql-42.2.1.jar to spark-default.conf with the only result that pyspark no longer launches.
I'm kinda lost in java land for one I don't know if this is the right format.The documentation tells me I should add a list but I don't know how a path list is supposed to look like. Then I don't know if I also have to specify spark.jars and or spark.driver.extraClassPath or if spark.jars.packages is enough? And if i have to add them what kind of format are they?
spark.jars.packages is for dependencies that can be pulled from Maven (think it as pip for Java, although the analogy is probably kinda loose).
You can submit your job with the option --jars /path/to/driver/postgresql-42.2.1.jar, so that the submission will also provide the library, that the cluster manager will distribute on all worker nodes on your behalf.
If you want to set this as a configuration you can use the spark.jars key instead of spark.jars.packages. The latter requires Maven coordinates, rather then a path (which is probably the reason why your job is failing).
You can read more about the configuration keys I introduced on the official documentation.

Making pyspark default in Apache Zeppelin?

Can't find out how to set PySpark to be the default interpreter for Zeppelin.
I know I can make Spark the default interpreter by putting it at the top of the list. But having to remember to add %pyspark to the top of each new cell is basically as annoying as adding %spark.pyspark
I'd just use Jupyter, but I'm working off a DC/OS Cluster and Zeppelin was available as a preconfigured app, while Jupyter looks like a bit of an ordeal to install on the cluster.
So, to surmise: Anyone know how to make pyspark the default interpreter for Apache Zeppelin?
Thanks!

Resources