How can I deploy an extra spark on exist ambari? - apache-spark

I have an exist ambari cluster with spark2.3.0, which have problem to execute the program I developed with pyspark3, so I'm considering to install another spark3 on one of the servers and only run in YARN mode.
Could someone tell me what I should do?
I tried to extract spark3 package on a server and added HADOOP_CONF_DIR & YARN_CONF_DIR & SCALA_HOME in spark-env.sh, after trying spark-submit, below error popup:
"Failed to find Spark jars directory (/usr/localSpark/spark-3.0.0/assembly/target/scala-2.12/jars). You need to build Spark with the target "package" before running this program.
"
Thanks!

Related

Spark on kubernetes with zeppelin

I am following this guide to run up a zeppelin container in a local kubernetes cluster set up using minikube.
https://zeppelin.apache.org/docs/0.9.0-SNAPSHOT/quickstart/kubernetes.html
I am able to set up zeppelin and run some sample code there. I have downloaded spark 2.4.5 & 2.4.0 source code and built it for kubernetes support with the following command:
./build/mvn -Pkubernetes -DskipTests clean package
Once spark is built I created a docker container as explained in the article:
bin/docker-image-tool.sh -m -t 2.4.X build
I configured zeppelin to use the spark image which was built with kubernetes support. The article above explains that the spark interpreter will auto configure spark on kubernetes to run in client mode and run the job.
But whenever I try to run any parahgraph with spark I receive the following error
Exception in thread "main" java.lang.IllegalArgumentException: basedir must be absolute: ?/.ivy2/local
I tried setting the spark configuration spark.jars.ivy in zeppelin to point to a temp directory but that does not work either.
I found a similar issue here:
basedir must be absolute: ?/.ivy2/local
But I can't seem to configure spark to run with the spark.jars.ivy /tmp/.ivy config. I tried building spark with the spark-defaults.conf when building spark but that does not seems to be working either.
Quite stumped at this problem and how to solve it any guidance would be appreciated.
Thanks!
I have also run into this problem, but a work-around I used for setting spark.jars.ivy=/tmp/.ivy is to rather set it is as an environment variable.
In your spark interpreter settings, add the following property: SPARK_SUBMIT_OPTIONS and set its value to --conf spark.jars.ivy=/tmp/.ivy.
This should pass additional options to spark submit and your job should continue.

Run spark from source code on Windows - no such file or directory error

I would like to run Spark from source code on my Windows machine. I did the following steps:
git clone https://github.com/apache/spark
Added the SPARK_HOME variable into the user variables.
Added %SPARK_HOME%\bin to the PATH variable.
./build/mvn -DskipTests clean package
./bin/spark-shell
The last command returns the following error:
What should I do to fix the error?
First, refer to the link below for the solution. The top voted answer gave me the working script for this problem.
: Failed to start master for Spark in Windows
The reason is that spark launch scripts do not support Windows. The spark documentation (https://spark.apache.org/docs/1.2.0/spark-standalone.html) insists you to start the master and workers manually if you are a Windows user. So you need to first run the master and then run spark-shell.

What / Where is the default Spark on Yarn Working directory?

When working on standalone, the working directory is basically $SPARK_HOME/work.
However i have no idea how to find that when working in Yarn mode ? Can someone else me find the working directory for spark or maybe application running on yarn ?
The default value is always $SPARK_HOME/work.
If you want a specific working directory please configure SPARK_WORKER_DIR environment variable, for example using conf/spark-env.sh
when spark run on yarn, the work dir locate at {yourYarnLocalDir}/usercache/{yourUserName}/appcache/{yourApplicationId}

"Cannot find hadoop installation : $HADOOP_HOME .. " getting this error while trying to run hive on spark.

I have followed this https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-Configurationpropertydetails
Have executed:
set spark.home=/location/to/sparkHome;
set hive.execution.engine=spark;
set spark.master= Spark-Master-URL
However, on running ./hive i am getting the above error:-
Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must
be set or hadoop must be in the path
I do not have Hadoop installed, and want to run hive on top of spark running on standalone.
Is it mandatory that i need to have HADOOP set up to run hive over spark?
IMHO hive cannot run without the hadoop. There may be VM's which have pre installed everything. Hive will run on top of Hadoop. So First you need to install Hadoop and then you can try hive.
Please refer this https://stackoverflow.com/a/21339399/5756149.
Anyone Correct me if I am wrong

Copying the Apache Spark installation folder to another system will work properly?

I am using Apache Spark. Working in cluster properly with 3 machines. Now I want to install Spark on another 3 machines.
What I did: I tried to just copy the folder of Spark, which I am using currently.
Problem: ./bin/spark-shell and all other spark commands are not working and throwing error 'No Such Command'
Question: 1. Why it is not working?
Is it possible that I just build Spark installation for 1 machine and then from that installation I can distribute it to other machines?
I am using Ubuntu.
We were looking into problem and found that Spark Installation Folder , which was copied, having the .sh files but was not executable. We just make the files executable and now spark is running.
Yes, It would work but should ensure that you have set all the environment variables required for spark to work.
like SPARK_HOME, WEBUI_PORT etc...
also use hadoop integrated spark build which comes with the supported versions of hadoop.

Resources