Using Pyspark locally when installed using databricks-connect - python-3.x

I have databricks-connect 6.6.0 installed, which has a Spark version 2.4.6. I have been using the databricks cluster till now, but I am trying to switch to using a local spark session for unit testing.
However, every time I run it, it still shows up on the cluster Spark UI as well as the local Spark UI on xxxxxx:4040.
I have tried initiating using SparkConf(), SparkContext(), and SQLContext() but they all do the same thing. I have also set the right SPARK_HOME, HADOOP_HOME, and JAVA_HOME, and downloaded winutils.exe separately, and none of these directories have spaces. I have also tried running it from console as well as from terminal using spark-submit.
This is one of the pieces of sample code I tried:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("name").getOrCreate()
inp = spark.createDataFrame([('Person1',12),('Person2',14)],['person','age'])
op = inp.toPandas()
I am using:
Windows 10, databricks-connect 6.6.0, Spark 2.4.6, JDK 1.8.0_265, Python 3.7, PyCharm Community 2020.1.1
Do I have to override the default/global spark session to initiate a local one? How would I do that?
I might be missing something - The code itself runs fine, it's just a matter of local vs. cluster.
TIA

You can’t run them side by side. I recommend having two virtual environments using Conda. One for databricks-connect one for pyspark. Then just switch between the two as needed.

Related

PySpark / Kafka - org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated

So, I'm working on setting up a development environment for working with PySpark and Kafka. I'm working through getting things setup so I can run these tutorials in a Jupyter notebook as a 'hello world' exercise: https://spark.apache.org/docs/3.1.1/structured-streaming-kafka-integration.html
Unfortunately, I'm currently hitting the following error when I attempt to connect to the Kafka stream:
Py4JJavaError: An error occurred while calling o68.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:583)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:805)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:723)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1395)
...
Now, some digging has told me that the most common cause of this issue is version mismatches (either for the Spark, or Scala versions in use). However, I'm able to confirm that these are aligned properly:
Spark: 3.1.1
Scala: 2.12.10
conf/spark-defaults.conf
...
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
EDIT
So, some additional observations from trying to figure this out:
It looks like this is at least partially a Jupyter notebook issue, as I can now get things working just fine via the pyspark shell.
Looks like my notebook is firing up its own instance of Spark, so maybe there's some difference in how Spark is being run there vs from a terminal window?
At a loss for how they're different though, as both environments should be using mostly default configurations.
Ok - looks like it doesn't work when invoked via the regular Python REPL either, which is leading me to think there's something different about the spark context being created by the pyspark shell and the one I'm creating in my notebook.
Ok - looks like something differs when things are run via Jupyter - hadoop.common.configuration.version has a value of 0.23.0 for the notebook instance, but 3.0.0 for the pyspark shell instance. Not sure why this might be or what it may mean yet.
What else should I check to confirm that this is setup correctly?
Ok - so it looks like the difference was that findspark was locating and using a different Spark Home directory (one that came installed with the pyspark installation via pip).
It also looks like Spark 3.1.1 for Hadoop 2.7 has issues with the Kafka client (or maybe needs to be configured differently) but Spark 3.1.1 for Hadoop 3.2 works fine.
Solution was to ensure that I explicitly chose my SPARK_HOME by passing the spark_home path to findspark.init()
findspark.init(spark_home='/path/to/desired/home')
Things to watch out for that got me and might trip you up too:
If you've installed pyspark through pip / mambaforge this will also deploy a second SPARK_HOME - this can create dependency / library confusion.
Many of the scripts in bin/ use SPARK_HOME to determine where to execute, so don't assume that just because you're running a script from one home that you're running spark IN that home.

SparkSessionExtensions injectFunction in Databricks environment

SparkSessionExtensions injectFunction works locally, but I can't get it working in the Databricks environment.
The itachi project defines Catalyst expressions, like age that I can successfully use locally via spark-sql:
bin/spark-sql --packages com.github.yaooqinn:itachi_2.12:0.1.0 --conf spark.sql.extensions=org.apache.spark.sql.extra.PostgreSQLExtensions
spark-sql> select age(timestamp '2000', timestamp'1990');
10 years
I'm having trouble getting this working in the Databricks environment.
I started up a Databricks community cluster with the spark.sql.extensions=org.apache.spark.sql.extra.PostgreSQLExtensions configuration option set.
Then I attached the library.
The array_append function that's defined in itachi isn't accessible like I expected it to be:
Confirm configuration option is properly set:
spark-alchemy has another approach that works in the Databricks environment. Do we need to mess around with Spark internals to get this working in the Databricks environment? Or is there a way to get injectFunction working in Databricks?
The spark.sql.extensions works just fine on full Databricks (until it's going too deep into the internals of the Spark - sometimes there are incompatibilities), but not on Community Edition. The problem is that spark.sql.extensions are called during session initialization, and library specified in UI is installed afterwards, so this happens after/in parallel with initialization. On full Databricks that's workarounded by using init script to install library before cluster starts, but this functionality is not available on Community Edition.
The workaround would be to register functions explicitly, like this:
%scala
import org.apache.spark.sql.catalyst.expressions.postgresql.{Age, ArrayAppend, ArrayLength, IntervalJustifyLike, Scale, SplitPart, StringToArray, UnNest}
import org.apache.spark.sql.extra.FunctionAliases
spark.sessionState.functionRegistry.registerFunction(Age.fd._1, Age.fd._2, Age.fd._3)
spark.sessionState.functionRegistry.registerFunction(FunctionAliases.array_cat._1, FunctionAliases.array_cat._2, FunctionAliases.array_cat._3)
spark.sessionState.functionRegistry.registerFunction(ArrayAppend.fd._1, ArrayAppend.fd._2, ArrayAppend.fd._3)
spark.sessionState.functionRegistry.registerFunction(ArrayLength.fd._1, ArrayLength.fd._2, ArrayLength.fd._3)
spark.sessionState.functionRegistry.registerFunction(IntervalJustifyLike.justifyDays._1, IntervalJustifyLike.justifyDays._2, IntervalJustifyLike.justifyDays._3)
spark.sessionState.functionRegistry.registerFunction(IntervalJustifyLike.justifyHours._1, IntervalJustifyLike.justifyHours._2, IntervalJustifyLike.justifyHours._3)
spark.sessionState.functionRegistry.registerFunction(IntervalJustifyLike.justifyInterval._1, IntervalJustifyLike.justifyInterval._2, IntervalJustifyLike.justifyInterval._3)
spark.sessionState.functionRegistry.registerFunction(Scale.fd._1, Scale.fd._2, Scale.fd._3)
spark.sessionState.functionRegistry.registerFunction(SplitPart.fd._1, SplitPart.fd._2, SplitPart.fd._3)
spark.sessionState.functionRegistry.registerFunction(StringToArray.fd._1, StringToArray.fd._2, StringToArray.fd._3)
spark.sessionState.functionRegistry.registerFunction(UnNest.fd._1, UnNest.fd._2, UnNest.fd._3)
After that it works:
It's not so handy as extensions, but that's a limitation of CE.

How to enable pyspark HIVE support on Google Dataproc master node

I created a dataproc cluster and manually install conda and Jupyter notebook. Then, I install pyspark by conda. I can successfully run spark by
from pyspark import SparkSession
sc = SparkContext(appName="EstimatePi")
However, I cannot enable HIVE support. The following code gets stucked and doesn't return anything.
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.config('spark.driver.memory', '2G')
.config("spark.kryoserializer.buffer.max", "2000m")
.enableHiveSupport()
.getOrCreate())
Python version 2.7.13, Spark version 2.3.4
Any way to enable HIVE support?
I do not recommend manually installing pyspark. When you do this, you get a new spark/pyspark installation that is different from Dataproc's own and do not get the configuration/tuning/classpath/etc. This is likely the reason Hive support does not work.
To get conda with properly configured pyspark I suggest selecting ANACONDA and JUPYTER optional components on image 1.3 (the default) or later.
Additionally, on 1.4 and later images Mini-Conda is the default user Python with pyspark preconfigured. You can pip/conda install Jupyter on your own if you wish.
See https://cloud.google.com/dataproc/docs/tutorials/python-configuration
Also as #Jayadeep Jayaraman points out, Jupyter optional component works with Component Gateway which means you can use it from a link in Developers Console as opposed to opening ports to the world or SSH tunneling.
tl/dr: I recomment these flags for your next cluster: --optional-components ANACONDA,JUPYTER --enable-component-gateway
Cloud Dataproc now has the option to install optional components in the dataproc cluster and also has an easy way of accessing them via the Gateway. You can find details of installing Jupyter and Conda here - https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
The details of the component gateway can be found here - https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways. Note that this is Alpha.

Working locally with remote spark context on EMR

I am trying to figure out if it is possible to work locally in python with a spark context of a remote EMR cluster(AWS). I've set up the cluster but a locally defined SparkContext with remote master doesn't seem to work. Does anybody have experience with that? Working on a remote notebook is limited because you cannot create python modules and files. Working locally is limited due to computing resources. There is the option to SSH to the master node but then I cannot use a graphical IDE such as pyCharm

Apache Zeppelin & Spark Streaming: Twitter Example only works local

I just added the example project to my Zeppelin Notebook from http://zeppelin-project.org/docs/tutorial/tutorial.html (section "Tutorial with Streaming Data"). The problem I now have is that the application seems only to work local. If I change the Spark interpreter setting "master" from "local[*]" to "spark://master:7077" the application won't bring any result anymore when I'm doing the same SQL statement. Am I doing anything wrong? I already restarted the Zeppelin interpreter, also the whole Zeppelin daemon and the Spark cluster, nothing solved the issue! Can someone help.
I use the following installation:
Spark 1.5.1 (prebuild for Hadoop 2.6+), Master + 2x Slaves
Zeppelin 0.5.5 (installed on Spark's master node)
EDIT
Also the following installation won't work for me:
Spark 1.5.0 (prebuild for Hadoop 2.6+), Master + 2x Slaves
Zeppelin 0.5.5 (installed on Spark's master node)
Screenshot: local setting (works!)
Screenshot: cluster setting (won't work!)
The job seems to run correctly in cluster mode:
I got it after 2 days of trying around!
The difference between the local Zeppelin Spark interpreter and the Spark Cluster seems to be, that the local one has included the Twitter Utils which are needed for executing the Twitter Streaming example, and the Spark Cluster doesn't have this library by default.
Therefore you have to add the dependency manually in the Zeppelin Notebook before starting the application with Spark cluster as master. So the first paragraph of the Notebook must be:
%dep
z.reset
z.load("org.apache.spark:spark-streaming-twitter_2.10:1.5.1")
If an error occures on running this paragraph, just try to restart the Zeppelin server via ./bin/zeppelin-daemon.sh stop (& start)!

Resources