Installing Apache Spark on Windows 7 | spark-shell not working - apache-spark

I tried installing Apache Spark on my 64 bit Windwos 7 machine.
I used the guides -
Installing Spark on Windows 10
How to run Apache Spark on Windows 7
Installing Apache Spark on Windows 7 environment
This is what I did -
Install Scala
Set environment variable SCALA_HOME and add %SCALA_HOME%\bin to Path
Result: scala command works on command prompt
Unpack pre-built Spark
Set environment variable SPARK_HOME and add %SPARK_HOME%\bin to Path
Download winutils.exe
Place winutils.exe under C:/hadoop/bin
Set environment variable HADOOP_HOME and add %HADOOP_HOME%\bin to Path
I already have JDK 8 installed.
Now, the problem is, when I run spark-shell from C:/spark-2.1.1-bin-hadoop2.7/bin, I get this -
"C:\Program Files\Java\jdk1.8.0_131\bin\java" -cp "C:\spark-2.1.1-bin-hadoop2.7\conf\;C:\spark-2.1.1-bin-hadoop2.7\jars\*" "-Dscala.usejavacp=true" -Xmx1g org spark.repl.Main --name "Spark shell" spark-shell
Is it an error? Am I doing something wrong?
Thanks!

I have the same issue when trying to install Spark local with Windows 7. Please make sure the below paths is correct and I am sure I will work with you.
Create JAVA_HOME variable: C:\Program Files\Java\jdk1.8.0_181\bin
Add the following part to your path: ;%JAVA_HOME%\bin
Create SPARK_HOME variable: C:\spark-2.3.0-bin-hadoop2.7\bin
Add the following part to your path: ;%SPARK_HOME%\bin
The most important part Hadoop path should include bin file before winutils.ee as the following: C:\Hadoop\bin Sure you will locate winutils.exe inside this path.
Create HADOOP_HOME Variable: C:\Hadoop
Add the following part to your path: ;%HADOOP_HOME%\bin
Now you can run the cmd and write spark-shell it will work.

Related

Wrong JAVA_HOME in hadoop for spark-shell

I needed to install Hadoop in order to have Spark running on my WSL2 Ubuntu for school projects. I installed Hadoop 3.3.1 and Spark 3.2.1 follow those two tutorials :
Hadoop Tutorial on Kontext.tech
Spark Tutorial on Kontext.tech
I correctly set up env variables in my .bashrc :
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
export PATH=$PATH:$JAVA_HOME
export HADOOP_HOME=~/hadoop/hadoop-3.3.1
export SPARK_HOME=~/hadoop/spark-3.2.1
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:/usr/local/hadoop/bin/
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$SPARK_HOME/bin:$PATH
# Configure Spark to use Hadoop classpath
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
As well as the ~/hadoop/spark-3.2.1/conf/spark-env.sh.template :
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/
However when I launch spark-shell, I get this error :
/home/adrien/hadoop/spark-3.2.1/bin/spark-class: line 71: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin//bin/java: No such file or directory
/home/adrien/hadoop/spark-3.2.1/bin/spark-class: line 96: CMD: bad array subscript
There seems to be a mess up in a redefinition of the $PATH variable but I can't figure out where it can be. Can you help me solve it please ? I don't know Hadoop and know spark well but I never had to install them.
First, certain Spark packages come with Hadoop, so you don't need to download them separately. More specifically, Spark is built against Hadoop 3.2 for now, so using the latest version might cause its own problems
For your problem, JAVA_HOME should not end in /bin or /bin/java. Check the linked post again...
If you used apt install for java, you shouldn't really need to set JAVA_HOME or the PATH for Java, either, as the package manager will do this for you. Or you can use https://sdkman.io
Note: Java 11 is preferred
You also need to remove .template from any config files for them to actually be used... However, JAVA_HOME is automatically detected by spark-submit, so it's completely optional in spark-env.sh
Same applies for hadoop-env.sh
Also remove /usr/local/hadoop/bin/ from your PATH since it doesn't appear you've put anything in that location

Unable to launch spark using spark-shell

I am trying to set up SPARK2 on my cloudera cluster. For that, I have JDK1.8:
I have installed scala 2.11.8 using the rpm file:
I have downloaded, extracted the spark version 2.2.0 on my home directory: /home/cloudera.
I made changes to the PATH variable in .bashrc as below:
But when I try to execute spark-shell from the home directory: /home/cloudera, it says no such file or directory which can be seen below:
[cloudera#quickstart ~]$ spark-shell
/home/cloudera/spark/bin/spark-class: line 71: /usr/java/jdk1.7.0_67-cloudera/bin/java: No such file or directory
[cloudera#quickstart ~]$
Could anyone let me know how can I fix the problem and configure it properly ?
Java/JVM applications (and spark-shell in particular) uses java binary to launch itself. Therefore they need to know where it is located, which is usually done via JAVA_HOME environment variable.
In your case it's not reset explicitely and value from Clauder's default one Java distribution is used (even if it points to empty location).
You need to set JAVA_HOME pointing to correct java distribution directory for the user under which you want to launch spark-shell and other application.

install SPARK 2.1.1 on Windows 10

I want to install Spark 2.1.1 on windows 10, I used a step by step guide mentioned in http://www.eaiesb.com/blogs/?p=334
I did all the steps, but when I come to the last part where I should run the spark-shell and I get the following:
C:\>”spark\spark-2.1.1-bin-hadoop2.7\bin\spark-shell”
I keep getting this
The system cannot find the path specified.
I am running this on Windows 10 machine on a virtual box.
I didn’t have another partition (D is used in the site) so I set it to C:\spark where everything is there (i.e. Hadoop, Spark, and the tmp folders).
UPDATE:/// I reinstalled Java and selected another folder with no spaces within its name, The message that am getting now is (The system cannot find the path specified)
The environment variables are :
JAVA_HOME ——> C:\Java8\jdk1.8.0_131
HADOOP_HOME —–> C:\spark\Hadoop
SPARK_HOME —-> C:\spark\spark-2.1.1-bin-hadoop2.7
You Have TO Set Jdk > BinDir Path Like This :
JAVA_HOME ——> C:\Program Files\Java\jdk1.8.0_131\bin
NOTE: If You Set Another Path of File ENV Then You Have TO separate It using Semicolon ;
Example:
JAVA_HOME ——> C:\Program Files\Java\jdk1.8.0_131\bin;C:\Program Files\nodejs
Well, before anything, thanks for #RïshïKêsh Kümar on your comments I really appreciate you tried to help me out.
The problem was with Java path, the "location_of_java_folder" didn't work for me, So I just named a folder java8 and reinstalled java in that location,
oh, and don' forget to restart your machine for changes to take effect.

Spark on windows 10 not working

Im trying to get spark working on win10. When i try to run spark shell i get this error :
'Spark\spark-2.0.0-bin-hadoop2.7\bin..\jars""\ is not recognized as an internal or external command,operable program or batch file.
Failed to find Spark jars directory. You need to build Spark before running this program.
I am using a pre-built spark for hadoop 2.7 or later. I have installed java 8, eclipse neon, python 2.7, scala 2.11, gotten winutils for hadoop 2.7.1 And i still get this error.
When I donwloaded spark it comes in the tgz, when extracted there is another tzg inside, so i extracted it also and then I got all the bin folders and stuff. I need to access spark-shell. Can anyone help?
EDIT:
Solution i ended up using:
1) Virtual box
2) Linux mint
I got the same error while building Spark. You can move the extracted folder to C:\
Refer this:
http://techgobi.blogspot.in/2016/08/configure-spark-on-windows-some-error.html
You are probably giving the wrong folder path to Spark bin.
Just open the command prompt and change directory to the bin inside the spark folder.
Type spark-shell to check.
Refer: Spark on win 10
"On Windows, I found that if it is installed in a directory that has a space in the path (C:\Program Files\Spark) the installation will fail. Move it to the root or another directory with no spaces."
OR
If you have installed Spark under “C:\Program Files (x86)..” replace 'Program Files (x86)' with Progra~2 in the PATH env variable and SPARK_HOME user variable.

Why does spark-submit and spark-shell fail with "Failed to find Spark assembly JAR. You need to build Spark before running this program."?

I was trying to run spark-submit and I get
"Failed to find Spark assembly JAR.
You need to build Spark before running this program."
When I try to run spark-shell I get the same error.
What I have to do in this situation.
On Windows, I found that if it is installed in a directory that has a space in the path (C:\Program Files\Spark) the installation will fail. Move it to the root or another directory with no spaces.
Your Spark package doesn't include compiled Spark code. That's why you got the error message from these scripts spark-submit and spark-shell.
You have to download one of pre-built version in "Choose a package type" section from the Spark download page.
Try running mvn -DskipTests clean package first to build Spark.
If your spark binaries are in a folder where the name of the folder has spaces (for example, "Program Files (x86)"), it didn't work. I changed it to "Program_Files", then the spark_shell command works in cmd.
In my case, I install spark by pip3 install pyspark on macOS system, and the error caused by incorrect SPARK_HOME variable. It works when I run command like below:
PYSPARK_PYTHON=python3 SPARK_HOME=/usr/local/lib/python3.7/site-packages/pyspark python3 wordcount.py a.txt
Go to SPARK_HOME. Note that your SPARK_HOME variable should not include /bin at the end. Mention it when you're when you're adding it to path like this: export PATH=$SPARK_HOME/bin:$PATH
Run export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g" to allot more memory to maven.
Run ./build/mvn -DskipTests clean package and be patient. It took my system 1 hour and 17 minutes to finish this.
Run ./dev/make-distribution.sh --name custom-spark --pip. This is just for python/pyspark. You can add more flags for Hive, Kubernetes, etc.
Running pyspark or spark-shell will now start pyspark and spark respectively.
If you have downloaded binary and getting this exception
Then please check your Spark_home path may contain spaces like "apache spark"/bin
Just remove spaces will works.
Just to add to #jurban1997 answer.
If you are running windows then make sure that SPARK_HOME and SCALA_HOME environment variables are setup right. SPARK_HOME should be pointing to {SPARK_HOME}\bin\spark-shell.cmd
For Windows machine with the pre-build version as of today (21.01.2022):
In order to verify all the edge cases you may have and avoid tedious guesswork about what exactly is not configred properly:
Find spark-class2.cmd and open it in with a text editor
Inspect the arguments of commands staring with call or if exists by typing the arguments in Command Prompt like this:
Open Command Prompt. (For PowerShell you need to print the var another way)
Copy-paste %SPARK_HOME%\bin\ as is and press enter.
If you see something like bin\bin in the path displayed now then you have appended /bin in your environment variable %SPARK_HOME%.
Now you have to add the path to the spark/bin to your PATH variable or it will not find spark-submit command
Try out and correct every path variable that the script in this file uses and and you should be good to go.
After that enter spark-submit ... you may now encounter the missing hadoop winutils.exe for which problem you can go get the tool and paste it where the spark-submit.cmd is located
Spark Installation:
For Window machine:
Download spark-2.1.1-bin-hadoop2.7.tgz from this site https://spark.apache.org/downloads.html
Unzip and Paste your spark folder in C:\ drive and set environment variable.
If you don’t have Hadoop,
you need to create Hadoop folder and also create Bin folder in it and then copy and paste winutils.exe file in it.
download winutils file from [https://codeload.github.com/gvreddy1210/64bit/zip/master][1]
and paste winutils.exe file in Hadoop\bin folder and set environment variable for c:\hadoop\bin;
create temp\hive folder in C:\ drive and give the full permission to this folder like:
C:\Windows\system32>C:\hadoop\bin\winutils.exe chmod 777 /tmp/hive
open command prompt first run C:\hadoop\bin> winutils.exe and then navigate to C:\spark\bin>
run spark-shell

Resources