What / Where is the default Spark on Yarn Working directory? - apache-spark

When working on standalone, the working directory is basically $SPARK_HOME/work.
However i have no idea how to find that when working in Yarn mode ? Can someone else me find the working directory for spark or maybe application running on yarn ?

The default value is always $SPARK_HOME/work.
If you want a specific working directory please configure SPARK_WORKER_DIR environment variable, for example using conf/spark-env.sh

when spark run on yarn, the work dir locate at {yourYarnLocalDir}/usercache/{yourUserName}/appcache/{yourApplicationId}

Related

How can I deploy an extra spark on exist ambari?

I have an exist ambari cluster with spark2.3.0, which have problem to execute the program I developed with pyspark3, so I'm considering to install another spark3 on one of the servers and only run in YARN mode.
Could someone tell me what I should do?
I tried to extract spark3 package on a server and added HADOOP_CONF_DIR & YARN_CONF_DIR & SCALA_HOME in spark-env.sh, after trying spark-submit, below error popup:
"Failed to find Spark jars directory (/usr/localSpark/spark-3.0.0/assembly/target/scala-2.12/jars). You need to build Spark with the target "package" before running this program.
"
Thanks!

Can't find Spark Submit when using Spark shell

I installed spark and am trying to run a file 'train.py' in the directory, '/home/xxx/Desktop/BD_Project', in shell using the following command:
$SPARK_HOME/bin/spark-submit /home/xxx/Desktop/BD_Project/train.py > output.txt
My teammates who used the same page that I did for spark installations have no problem when running this. However, it throws up the following error for me:
bash: /bin/spark-submit: No such file or directory
You need to set your SPARK_HOME to where your spark is installed, typically its in /usr/local/spark/bin/bin/spark-submit
Before you set it make sure where spark is installed by going to the directory.
You can set it like this before running your command :
export SPARK_HOME=/usr/local/spark/bin/bin/spark-submit
If you are homebrew user, setting your SPARK_HOME to
/opt/homebrew/Cellar/apache-spark/3.3.1/libexec"
would solve. Sorry for too late responding. Hoping this would help someone with this odd error.

Copying the Apache Spark installation folder to another system will work properly?

I am using Apache Spark. Working in cluster properly with 3 machines. Now I want to install Spark on another 3 machines.
What I did: I tried to just copy the folder of Spark, which I am using currently.
Problem: ./bin/spark-shell and all other spark commands are not working and throwing error 'No Such Command'
Question: 1. Why it is not working?
Is it possible that I just build Spark installation for 1 machine and then from that installation I can distribute it to other machines?
I am using Ubuntu.
We were looking into problem and found that Spark Installation Folder , which was copied, having the .sh files but was not executable. We just make the files executable and now spark is running.
Yes, It would work but should ensure that you have set all the environment variables required for spark to work.
like SPARK_HOME, WEBUI_PORT etc...
also use hadoop integrated spark build which comes with the supported versions of hadoop.

Unable to start beeline client

I installed spark-1.5.1-bin-without-hadoop and trying to start beeline using the following command from spark install directory.
./bin/beeline
I get "Error: Could not find or load main class org.apache.hive.beeline.BeeLine".
Not sure why the classpath is not working. I ran into same issue and ended up running java with the jars under lib_managed directory. Note that verbose option is used because no errors are shown in some NoClassDef cases.
java -cp lib_managed/jars/hive-exec-1.2.1.spark.jar:lib_managed/jars/hive-metastore-1.2.1.spark.jar:lib_managed/jars/httpcore-4.3.1.jar:lib_managed/jars/httpclient-4.3.2.jar:lib_managed/jars/libthrift-0.9.2.jar:lib_managed/jars/hive-beeline-1.2.1.spark.jar:lib_managed/jars/jline-2.12.jar:lib_managed/jars/commons-cli-1.2.jar:lib_managed/jars/super-csv-2.2.0.jar:lib_managed/jars/commons-logging-1.1.3.jar:lib_managed/jars/hive-jdbc-1.2.1.spark.jar:lib_managed/jars/hive-cli-1.2.1.spark.jar:lib_managed/jars/hive-service-1.2.1.spark.jar:assembly/target/scala-2.10/spark-assembly-1.5.3-SNAPSHOT-hadoop2.2.0.jar org.apache.hive.beeline.BeeLine -u jdbc:hive2://<thrift server public address>:10000/default --verbose=true
I had exactly same problem. For me setting SPARK_HOME environment variable did it!
export SPARK_HOME=/Users/../Downloads/spark-2.1.1-bin-hadoop2.7
This is because if you actually open and see "bin/beeline" script file, you'll find this line:
Figure out if SPARK_HOME is set
So, after setting SPARK_HOME to proper location, beeline started working fine.

Why does spark-submit and spark-shell fail with "Failed to find Spark assembly JAR. You need to build Spark before running this program."?

I was trying to run spark-submit and I get
"Failed to find Spark assembly JAR.
You need to build Spark before running this program."
When I try to run spark-shell I get the same error.
What I have to do in this situation.
On Windows, I found that if it is installed in a directory that has a space in the path (C:\Program Files\Spark) the installation will fail. Move it to the root or another directory with no spaces.
Your Spark package doesn't include compiled Spark code. That's why you got the error message from these scripts spark-submit and spark-shell.
You have to download one of pre-built version in "Choose a package type" section from the Spark download page.
Try running mvn -DskipTests clean package first to build Spark.
If your spark binaries are in a folder where the name of the folder has spaces (for example, "Program Files (x86)"), it didn't work. I changed it to "Program_Files", then the spark_shell command works in cmd.
In my case, I install spark by pip3 install pyspark on macOS system, and the error caused by incorrect SPARK_HOME variable. It works when I run command like below:
PYSPARK_PYTHON=python3 SPARK_HOME=/usr/local/lib/python3.7/site-packages/pyspark python3 wordcount.py a.txt
Go to SPARK_HOME. Note that your SPARK_HOME variable should not include /bin at the end. Mention it when you're when you're adding it to path like this: export PATH=$SPARK_HOME/bin:$PATH
Run export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g" to allot more memory to maven.
Run ./build/mvn -DskipTests clean package and be patient. It took my system 1 hour and 17 minutes to finish this.
Run ./dev/make-distribution.sh --name custom-spark --pip. This is just for python/pyspark. You can add more flags for Hive, Kubernetes, etc.
Running pyspark or spark-shell will now start pyspark and spark respectively.
If you have downloaded binary and getting this exception
Then please check your Spark_home path may contain spaces like "apache spark"/bin
Just remove spaces will works.
Just to add to #jurban1997 answer.
If you are running windows then make sure that SPARK_HOME and SCALA_HOME environment variables are setup right. SPARK_HOME should be pointing to {SPARK_HOME}\bin\spark-shell.cmd
For Windows machine with the pre-build version as of today (21.01.2022):
In order to verify all the edge cases you may have and avoid tedious guesswork about what exactly is not configred properly:
Find spark-class2.cmd and open it in with a text editor
Inspect the arguments of commands staring with call or if exists by typing the arguments in Command Prompt like this:
Open Command Prompt. (For PowerShell you need to print the var another way)
Copy-paste %SPARK_HOME%\bin\ as is and press enter.
If you see something like bin\bin in the path displayed now then you have appended /bin in your environment variable %SPARK_HOME%.
Now you have to add the path to the spark/bin to your PATH variable or it will not find spark-submit command
Try out and correct every path variable that the script in this file uses and and you should be good to go.
After that enter spark-submit ... you may now encounter the missing hadoop winutils.exe for which problem you can go get the tool and paste it where the spark-submit.cmd is located
Spark Installation:
For Window machine:
Download spark-2.1.1-bin-hadoop2.7.tgz from this site https://spark.apache.org/downloads.html
Unzip and Paste your spark folder in C:\ drive and set environment variable.
If you don’t have Hadoop,
you need to create Hadoop folder and also create Bin folder in it and then copy and paste winutils.exe file in it.
download winutils file from [https://codeload.github.com/gvreddy1210/64bit/zip/master][1]
and paste winutils.exe file in Hadoop\bin folder and set environment variable for c:\hadoop\bin;
create temp\hive folder in C:\ drive and give the full permission to this folder like:
C:\Windows\system32>C:\hadoop\bin\winutils.exe chmod 777 /tmp/hive
open command prompt first run C:\hadoop\bin> winutils.exe and then navigate to C:\spark\bin>
run spark-shell

Resources