I created a dataproc cluster and manually install conda and Jupyter notebook. Then, I install pyspark by conda. I can successfully run spark by
from pyspark import SparkSession
sc = SparkContext(appName="EstimatePi")
However, I cannot enable HIVE support. The following code gets stucked and doesn't return anything.
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.config('spark.driver.memory', '2G')
.config("spark.kryoserializer.buffer.max", "2000m")
.enableHiveSupport()
.getOrCreate())
Python version 2.7.13, Spark version 2.3.4
Any way to enable HIVE support?
I do not recommend manually installing pyspark. When you do this, you get a new spark/pyspark installation that is different from Dataproc's own and do not get the configuration/tuning/classpath/etc. This is likely the reason Hive support does not work.
To get conda with properly configured pyspark I suggest selecting ANACONDA and JUPYTER optional components on image 1.3 (the default) or later.
Additionally, on 1.4 and later images Mini-Conda is the default user Python with pyspark preconfigured. You can pip/conda install Jupyter on your own if you wish.
See https://cloud.google.com/dataproc/docs/tutorials/python-configuration
Also as #Jayadeep Jayaraman points out, Jupyter optional component works with Component Gateway which means you can use it from a link in Developers Console as opposed to opening ports to the world or SSH tunneling.
tl/dr: I recomment these flags for your next cluster: --optional-components ANACONDA,JUPYTER --enable-component-gateway
Cloud Dataproc now has the option to install optional components in the dataproc cluster and also has an easy way of accessing them via the Gateway. You can find details of installing Jupyter and Conda here - https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
The details of the component gateway can be found here - https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways. Note that this is Alpha.
Related
I'm having problem understanding how to connect Kafka and PySpark.
I have kafka installation on Windows 10 with topic nicely streaming data.
I've installed pyspark which runs properly-I'm able to create test DataFrame without problem.
But when I try to connect to Kafka stream it gives me error:
AnalysisException: Failed to find data source: kafka. Please deploy
the application as per the deployment section of "Structured Streaming-
Kafka Integration Guide".
Spark documentation is not really helpful - it says:
...
groupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.12
version = 3.2.0
...
For Python applications, you need to add this above library and its dependencies when deploying your application. See the Deploying subsection below.
And then when you go to Deploying section it says:
As with any Spark applications, spark-submit is used to launch your application. spark-sql-kafka-0-10_2.12 and its dependencies can be directly added to spark-submit using --packages, such as,
./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 ...
I'm developing app, I don't want to deploy it.
Where and how to add these dependencies if I'm developing pyspark app?
Tried several tutorials ended up being more confused.
Saw answer saying that
"You need to add kafka-clients JAR to your --packages".so-answer
Few more steps could be useful because for someone who is new this is unclear.
versions:
kafka 2.13-2.8.1
spark 3.1.2
java 11.0.12
All environmental variables and paths are correctly set.
EDIT
I've load :
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.apache.kafka:kafka-clients:2.8.1'
as suggested but still getting same error.
I've triple checked kafka, scala and spark versions and tried various combinations but not it didn't work, I'm still getting same error:
AnalysisException: Failed to find data source: kafka. Please deploy
the application as per the deployment section of "Structured Streaming-Kafka Integration Guide".
EDIT 2
I installed latest Spark 3.2.0 and Hadoop 3.3.1 and kafka version kafka_2.12-2.8.1. Changed all environmental variables, tested Spark and Kafka - working properly.
My environment variable looks like this now:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,org.apache.kafka:kafka-clients:2.8.1'
Still no luck, I get same error :(
Spark documentation is not really helpful - it says ... artifactId = spark-sql-kafka-0-10_2.12 version = 3.2.0 ...
Yes, that is correct... but for the latest version of Spark
versions:
spark 3.1.2
Have you tried looking at the version specific docs?
In other words, you want the matching spark-sql-kafka version of 3.1.2.
bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
Or in Python,
scala_version = '2.12'
spark_version = '3.1.2'
# TODO: Ensure match above values match the correct versions
packages = [
f'org.apache.spark:spark-sql-kafka-0-10_{scala_version}:{spark_version}',
'org.apache.kafka:kafka-clients:3.2.1'
]
spark = SparkSession.builder\
.master("local")\
.appName("kafka-example")\
.config("spark.jars.packages", ",".join(packages))\
.getOrCreate()
Or with an env-var
import os
spark_version = '3.1.2'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:{}'.format(spark_version)
# init spark here
need to add this above library and its dependencies
As you found in my previous answer, also append the kafka-clients package using comma-separated list.
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.apache.kafka:kafka-clients:2.8.1
I'm developing app, I don't want to deploy it.
"Deploy" is Spark terminology. Running locally is still a "deployment"
I have databricks-connect 6.6.0 installed, which has a Spark version 2.4.6. I have been using the databricks cluster till now, but I am trying to switch to using a local spark session for unit testing.
However, every time I run it, it still shows up on the cluster Spark UI as well as the local Spark UI on xxxxxx:4040.
I have tried initiating using SparkConf(), SparkContext(), and SQLContext() but they all do the same thing. I have also set the right SPARK_HOME, HADOOP_HOME, and JAVA_HOME, and downloaded winutils.exe separately, and none of these directories have spaces. I have also tried running it from console as well as from terminal using spark-submit.
This is one of the pieces of sample code I tried:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("name").getOrCreate()
inp = spark.createDataFrame([('Person1',12),('Person2',14)],['person','age'])
op = inp.toPandas()
I am using:
Windows 10, databricks-connect 6.6.0, Spark 2.4.6, JDK 1.8.0_265, Python 3.7, PyCharm Community 2020.1.1
Do I have to override the default/global spark session to initiate a local one? How would I do that?
I might be missing something - The code itself runs fine, it's just a matter of local vs. cluster.
TIA
You can’t run them side by side. I recommend having two virtual environments using Conda. One for databricks-connect one for pyspark. Then just switch between the two as needed.
Can't find out how to set PySpark to be the default interpreter for Zeppelin.
I know I can make Spark the default interpreter by putting it at the top of the list. But having to remember to add %pyspark to the top of each new cell is basically as annoying as adding %spark.pyspark
I'd just use Jupyter, but I'm working off a DC/OS Cluster and Zeppelin was available as a preconfigured app, while Jupyter looks like a bit of an ordeal to install on the cluster.
So, to surmise: Anyone know how to make pyspark the default interpreter for Apache Zeppelin?
Thanks!
I am using pyspark in an ipython notebook and accessing a netezza database. I am trying to do something similar on bluemix. The problem is that in order to have access to netezza, I have to add parameters to the pyspark startup. How can I do that on bluemix? Here is how I start pyspark standalone:
$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.ibm.SparkTC:spark-netezza_2.10:0.1.1 --driver-class-path /usr/local/src/netezza/jdbc/lib/nzjdbc3.jar
You cannot change the parameters for starting PySpark on Bluemix.
The %AddJar kernel magic works for the Scala notebooks, only.
It does not work for Python notebooks.
The driver for Netezza nzjdbc3.jar has to be provided and supported on the service in order to make us of it. Currently, this cannot be done by the user.
Update:
nzjdbc3.jar is not supported out of box. You could submit feedback via E-Mail and ask for the driver to be supported.
Another possibility to enable the driver for PySpark is to put the jar into a location which will be considered for PySpark configuration.
First, find out your USER_ID by using the following command:
!whoami
Then, get nzjdbc3.jar and put it to the following location:
/gpfs/fs01/user/USER_ID/data/libs
One way to put nzjdbc3.jar into the mentioned location is to use wget:
!wget URI_TO_JAR_FILE -P /gpfs/fs01/user/USER_ID/data/libs
After the driver jar was downloaded to the mentioned location, you have to restart the kernel. During the creation of the new kernel all files in the mentioned location will be considered for PySpark.
I just added the example project to my Zeppelin Notebook from http://zeppelin-project.org/docs/tutorial/tutorial.html (section "Tutorial with Streaming Data"). The problem I now have is that the application seems only to work local. If I change the Spark interpreter setting "master" from "local[*]" to "spark://master:7077" the application won't bring any result anymore when I'm doing the same SQL statement. Am I doing anything wrong? I already restarted the Zeppelin interpreter, also the whole Zeppelin daemon and the Spark cluster, nothing solved the issue! Can someone help.
I use the following installation:
Spark 1.5.1 (prebuild for Hadoop 2.6+), Master + 2x Slaves
Zeppelin 0.5.5 (installed on Spark's master node)
EDIT
Also the following installation won't work for me:
Spark 1.5.0 (prebuild for Hadoop 2.6+), Master + 2x Slaves
Zeppelin 0.5.5 (installed on Spark's master node)
Screenshot: local setting (works!)
Screenshot: cluster setting (won't work!)
The job seems to run correctly in cluster mode:
I got it after 2 days of trying around!
The difference between the local Zeppelin Spark interpreter and the Spark Cluster seems to be, that the local one has included the Twitter Utils which are needed for executing the Twitter Streaming example, and the Spark Cluster doesn't have this library by default.
Therefore you have to add the dependency manually in the Zeppelin Notebook before starting the application with Spark cluster as master. So the first paragraph of the Notebook must be:
%dep
z.reset
z.load("org.apache.spark:spark-streaming-twitter_2.10:1.5.1")
If an error occures on running this paragraph, just try to restart the Zeppelin server via ./bin/zeppelin-daemon.sh stop (& start)!