Use Apache Zeppelin with existing Spark Cluster - apache-spark

I want to install Zeppelin to use my existing Spark cluster. I used the following way:
Spark Master (Spark 1.5.0 for Hadoop 2.4):
Zeppelin 0.5.5
Spark Slave
I downladed the Zeppelin v0.5.5 and installed it via:
mvn clean package -Pspark-1.5 -Dspark.version=1.5.0 -Dhadoop.version=2.4.0 -Phadoop-2.4 -DskipTests
I saw, that the local[*] master setting works also without my Spark Cluster (notebook also runnable when shutted down the Spark cluster).
My problem: When I want to use my Spark Cluster for a Streaming application, it seems not to work correctly. My SQL-Table is empty when I use spark://my_server:7077 as master - in local mode everything works fine!
See also my other question which describes the problem: Apache Zeppelin & Spark Streaming: Twitter Example only works local
Did I something wrong
on installation via "mvn clean packge"?
on setting the master url?
Spark and/or Hadoop version (any limitations???)
Do I have to set something special in zeppelin-env.sh file (is actually back on defaults)???

The problem was caused by a missing library dependency! So before searching around too long, first check the dependencies, whether one is missing!
%dep
z.reset
z.load("org.apache.spark:spark-streaming-twitter_2.10:1.5.1")

Related

where is local hadoop folder in pyspark (mac)

I have installed pyspark in local mac using homebrew. I am able to see spark under /usr/local/Cellar/apache-spark/3.2.1/
but not able to see hadoop folder. If I run pyspark in terminal it is running spark shell.
Where can I see its path?
I a trying to connect S3 to pyspark and I have dependency jars
You do not need to know the location of Hadoop to do this.
You should use a command like spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.1 app.py instead, which will pull all necessary dependencies rather than download all JARs (with their dependencies) locally.

Spark on kubernetes with zeppelin

I am following this guide to run up a zeppelin container in a local kubernetes cluster set up using minikube.
https://zeppelin.apache.org/docs/0.9.0-SNAPSHOT/quickstart/kubernetes.html
I am able to set up zeppelin and run some sample code there. I have downloaded spark 2.4.5 & 2.4.0 source code and built it for kubernetes support with the following command:
./build/mvn -Pkubernetes -DskipTests clean package
Once spark is built I created a docker container as explained in the article:
bin/docker-image-tool.sh -m -t 2.4.X build
I configured zeppelin to use the spark image which was built with kubernetes support. The article above explains that the spark interpreter will auto configure spark on kubernetes to run in client mode and run the job.
But whenever I try to run any parahgraph with spark I receive the following error
Exception in thread "main" java.lang.IllegalArgumentException: basedir must be absolute: ?/.ivy2/local
I tried setting the spark configuration spark.jars.ivy in zeppelin to point to a temp directory but that does not work either.
I found a similar issue here:
basedir must be absolute: ?/.ivy2/local
But I can't seem to configure spark to run with the spark.jars.ivy /tmp/.ivy config. I tried building spark with the spark-defaults.conf when building spark but that does not seems to be working either.
Quite stumped at this problem and how to solve it any guidance would be appreciated.
Thanks!
I have also run into this problem, but a work-around I used for setting spark.jars.ivy=/tmp/.ivy is to rather set it is as an environment variable.
In your spark interpreter settings, add the following property: SPARK_SUBMIT_OPTIONS and set its value to --conf spark.jars.ivy=/tmp/.ivy.
This should pass additional options to spark submit and your job should continue.

Is there a way to use PySpark with Hadoop 2.8+?

I would like to run a PySpark job locally, using a specific version of Hadoop (let's say hadoop-aws 2.8.5) because of some features.
PySpark versions seem to be aligned with Spark versions.
Here I use PySpark 2.4.5 which seems to wrap a Spark 2.4.5.
When submitting my PySpark Job, using spark-submit --local[4] ..., with the option --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:2.8.5, I encounter the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql
With the following java exceptions:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
Or:
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init (Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
I suppose that the Pyspark Job Hadoop version is unaligned with the one I pass to the spark-submit option spark.jars.packages.
But I have not any idea of how I could make it work? :)
Default spark disto has hadoop libraries included. Spark use system (its own) libraries first. So you should either set --conf spark.driver.userClassPathFirst=true and for cluster add --conf spark.executor.userClassPathFirst=true or download spark distro without hadoop. Probably you will have to put your hadoop distro into spark disto jars directory.
Ok, I found a solution:
1 - Install Hadoop in the expected version (2.8.5 for me)
2 - Install a Hadoop Free version of Spark (2.4.4 for me)
3 - Set SPARK_DIST_CLASSPATH environment variable, to make Spark uses the custom version of Hadoop.
(cf. https://spark.apache.org/docs/2.4.4/hadoop-provided.html)
4 - Add the PySpark directories to PYTHONPATH environment variable, like the following:
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
(Note that the py4j version my differs)
That's it.

Needed spark-assembly-1.5.2.-hadoop2.7.jar for Spark -Oozie workflow on HDP 2.3.2

I am trying to schedule spark 1.5.2 job on oozie 4.2.0 (HDP 2.3.x). Spark 1.5.2 has been installed externally i am not using default spark version provided by hortonworks. I am referring below post to set this up.
https://community.hortonworks.com/questions/7014/oozie-sparkaction-throwing-javalangnosuchmethoderr.html
I am struggling to find below jars.
-spark-assembly-1.5.2.2.3.4.0-3485-hadoop2.7.1.2.3.4.0-3485.jar
-spark-examples-1.5.2.2.3.4.0-3485-hadoop2.7.1.2.3.4.0-3485.jar
If you can help me with some pointers to find/download above jars it will be a great help to get started.
Have you checked in spark lib path
/usr/hdp/current/spark-client/lib
You can find
[ram#IP lib]$ ls
datanucleus-api-jdo-3.2.6.jar datanucleus-rdbms-3.2.9.jar spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar spark-hdp-assembly.jar
datanucleus-core-3.2.10.jar spark-1.6.1.2.4.2.0-258-yarn-shuffle.jar spark-examples-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar
[ram#IP lib]$
Then copy the needed libs to your workflows lib
eg: hadoop fs -put /usr/hdp/current/spark-client/lib/* YOUR_WORKFLOW/lib

Copying the Apache Spark installation folder to another system will work properly?

I am using Apache Spark. Working in cluster properly with 3 machines. Now I want to install Spark on another 3 machines.
What I did: I tried to just copy the folder of Spark, which I am using currently.
Problem: ./bin/spark-shell and all other spark commands are not working and throwing error 'No Such Command'
Question: 1. Why it is not working?
Is it possible that I just build Spark installation for 1 machine and then from that installation I can distribute it to other machines?
I am using Ubuntu.
We were looking into problem and found that Spark Installation Folder , which was copied, having the .sh files but was not executable. We just make the files executable and now spark is running.
Yes, It would work but should ensure that you have set all the environment variables required for spark to work.
like SPARK_HOME, WEBUI_PORT etc...
also use hadoop integrated spark build which comes with the supported versions of hadoop.

Resources