spark-submit won't work anywhere in cmd - python-3.x

I am trying to run spark-submit command from drive/folder where my python script and dataset is H:\spark_material. It just won't work !
But if I copy my python script into this folder C:\spark\bin then it works.
I believe it has something to do with environment variables.
Here is my Path = %JAVA_HOME%\bin; %SPARK_HOME%\bin
Here are my variables:
HADOOP_HOME = C:\winutils
JAVA_HOME = C:\jdk
SPARK_HOME = C:\spark
Java is properly installed as I have tried typing "java -version" anywhere in CMD and it works!!

Open your cmd and type path and check is apache spark path specify till bin folder
If not please fix your path

It was/is mystery - I re-installed everything one by one on my machine except operating system and it was an issue with Python distribution in my opinion. When I reinstalled Canopy(enthought), spark-submit command started to work. I still don't know why it happened as even in my previous version of Canopy (Python) was working fine properly.
Thank you everyone for your response and contribution. Learnt a lot from you guys.

Related

Can't find Spark Submit when using Spark shell

I installed spark and am trying to run a file 'train.py' in the directory, '/home/xxx/Desktop/BD_Project', in shell using the following command:
$SPARK_HOME/bin/spark-submit /home/xxx/Desktop/BD_Project/train.py > output.txt
My teammates who used the same page that I did for spark installations have no problem when running this. However, it throws up the following error for me:
bash: /bin/spark-submit: No such file or directory
You need to set your SPARK_HOME to where your spark is installed, typically its in /usr/local/spark/bin/bin/spark-submit
Before you set it make sure where spark is installed by going to the directory.
You can set it like this before running your command :
export SPARK_HOME=/usr/local/spark/bin/bin/spark-submit
If you are homebrew user, setting your SPARK_HOME to
/opt/homebrew/Cellar/apache-spark/3.3.1/libexec"
would solve. Sorry for too late responding. Hoping this would help someone with this odd error.

Can run pyspark.cmd but not pyspark from command prompt

I am trying to get pyspark setup for windows. I have java, python, Hadoop, and spark all setup and environmental variables I believe are setup as I've been instructed elsewhere. In fact, I am able to run this from the command prompt:
pyspark.cmd
And it will load up the pyspark interpreter. However, I should be able to run pyspark unqualified (without the .cmd), and python importing won't work otherwise. It does not matter whether I navigate directly to spark\bin or not, because I do have spark\bin added to the PATH already.
.cmd is listed in my PATHEXT variable, so I don't get why the pyspark command by itself doesn't work.
Thanks for any help.
While I still don't know exactly why, I think the issue somehow stemmed out of how I unzipped the spark tar file. Within the spark\bin folder, I was unable to run any .cmd programs without the .cmd extension included. But I could do that in basically any other folder. I redid the unzip and the problem no longer existed.

The SPARK_HOME env variable is set but Jupyter Notebook doesn't see it. (Windows)

I'm on Windows 10. I was trying to get Spark up and running in a Jupyter Notebook alongside Python 3.5. I installed a pre-built version of Spark and set the SPARK_HOME environmental variable. I installed findspark and run the code:
import findspark
findspark.init()
I receive a Value error:
ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation).
However the SPARK_HOME variable is set. Here is a screenshot that shows that the list of environmental variables on my system.
Has anyone encountered this issue or would know how to fix this? I only found an old discussion in which someone had set SPARK_HOME to the wrong folder but I don't think it's my case.
I had the same problem and wasted a lot of time. I found two solutions:
There are two solutions
copy downloaded spark folder in somewhere in C directory and give the link as below
import findspark
findspark.init('C:/spark')
use the function of findspark to find automatically the spark folder
import findspark
findspark.find()
The environmental variables get updated only after system reboot. It works after restarting your system.
I had same problem and had it solved by installing "vagrant" and "virtual box". (Note, though I use Mac OS and Python 2.7.11)
Take a look at this tutorial, which is for the Harvard CS109 course :
https://github.com/cs109/2015lab8/blob/master/installing_vagrant.pdf
After "vagrant reload" on the terminal , I am able to run my codes without errors.
NOTE the difference between the result of command "os.getcwd" shown in the attached images.
I had the same problem when installing spark using pip install pyspark findspark in a conda environment.
The solution was to do this:
export /Users/pete/miniconda3/envs/cenv3/lib/python3.6/site-packages/pyspark/
jupyter notebook
You'll have to substitute the name of your conda environment for cenv3 in the command above.
Restarting the system after setting up the environmental variables worked for me.
i have same problem, i solved it by closing cmd then open again. i forget that after editing env variable on windows that should restart cmd..
I got the same error. Initially, I had stored my Spark folder in the Documents directory. Later, when I moved it to the Desktop, it suddenly started recognizing all the system variables and it ran findspark.init() without any error.
Try it out once.
This error may occur, if you don't set the environment variables in .bashrc file. Set your python environment variable as follows:
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip:$PYTHONPATH
export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$PATH
the simplest way i found to use spark with jupyter notebook is
1- download spark
2- unzip to desired location
3- open jupyter notebook in usual way nothing special
4- now run the below code
import findspark
findspark.init("location of spark folder ")
# in my case it is like
import findspark
findspark.init("C:\\Users\\raj24\\OneDrive\\Desktop\\spark-3.0.1-bin-hadoop2.7")

Can't get spark to start

I have successfully installed and run apache spark in the past on my machine. Today I returned to it and tried to run it using : bin/spark-shell in the spark directory (bin file exists in this dir) but I am getting:
bin is not recognized as an internal or external command,
operable program or batch file.
It s running on windows 10 cmd shell, in case this is helpful. What could cause this?
I belive we need more info, to be able to answr your question.
Using './' specifies a path, starting in the root of your working directory. (Bash or powershell)
Are you running this in the cmd shell/powershell/bash shell?
What directory are you working in, when trying to execute your command?
Is there a bin folder in your current directory? (LS command or dir command)
JAVA_HOME was outdated... I had updated java without updating the path! That was the problem.
Check version of java installed and location where environment variable JAVA_HOME is pointing to.
In my case JAVA_HOME = C:\Program Files\Java\jdk1.7.0_79 (this is old version)
The cause of this issue was that I installed a new version of JDK and removed the previous installation but JAVA_HOME was pointing to the old environment which was missing.

Why does spark-submit and spark-shell fail with "Failed to find Spark assembly JAR. You need to build Spark before running this program."?

I was trying to run spark-submit and I get
"Failed to find Spark assembly JAR.
You need to build Spark before running this program."
When I try to run spark-shell I get the same error.
What I have to do in this situation.
On Windows, I found that if it is installed in a directory that has a space in the path (C:\Program Files\Spark) the installation will fail. Move it to the root or another directory with no spaces.
Your Spark package doesn't include compiled Spark code. That's why you got the error message from these scripts spark-submit and spark-shell.
You have to download one of pre-built version in "Choose a package type" section from the Spark download page.
Try running mvn -DskipTests clean package first to build Spark.
If your spark binaries are in a folder where the name of the folder has spaces (for example, "Program Files (x86)"), it didn't work. I changed it to "Program_Files", then the spark_shell command works in cmd.
In my case, I install spark by pip3 install pyspark on macOS system, and the error caused by incorrect SPARK_HOME variable. It works when I run command like below:
PYSPARK_PYTHON=python3 SPARK_HOME=/usr/local/lib/python3.7/site-packages/pyspark python3 wordcount.py a.txt
Go to SPARK_HOME. Note that your SPARK_HOME variable should not include /bin at the end. Mention it when you're when you're adding it to path like this: export PATH=$SPARK_HOME/bin:$PATH
Run export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g" to allot more memory to maven.
Run ./build/mvn -DskipTests clean package and be patient. It took my system 1 hour and 17 minutes to finish this.
Run ./dev/make-distribution.sh --name custom-spark --pip. This is just for python/pyspark. You can add more flags for Hive, Kubernetes, etc.
Running pyspark or spark-shell will now start pyspark and spark respectively.
If you have downloaded binary and getting this exception
Then please check your Spark_home path may contain spaces like "apache spark"/bin
Just remove spaces will works.
Just to add to #jurban1997 answer.
If you are running windows then make sure that SPARK_HOME and SCALA_HOME environment variables are setup right. SPARK_HOME should be pointing to {SPARK_HOME}\bin\spark-shell.cmd
For Windows machine with the pre-build version as of today (21.01.2022):
In order to verify all the edge cases you may have and avoid tedious guesswork about what exactly is not configred properly:
Find spark-class2.cmd and open it in with a text editor
Inspect the arguments of commands staring with call or if exists by typing the arguments in Command Prompt like this:
Open Command Prompt. (For PowerShell you need to print the var another way)
Copy-paste %SPARK_HOME%\bin\ as is and press enter.
If you see something like bin\bin in the path displayed now then you have appended /bin in your environment variable %SPARK_HOME%.
Now you have to add the path to the spark/bin to your PATH variable or it will not find spark-submit command
Try out and correct every path variable that the script in this file uses and and you should be good to go.
After that enter spark-submit ... you may now encounter the missing hadoop winutils.exe for which problem you can go get the tool and paste it where the spark-submit.cmd is located
Spark Installation:
For Window machine:
Download spark-2.1.1-bin-hadoop2.7.tgz from this site https://spark.apache.org/downloads.html
Unzip and Paste your spark folder in C:\ drive and set environment variable.
If you don’t have Hadoop,
you need to create Hadoop folder and also create Bin folder in it and then copy and paste winutils.exe file in it.
download winutils file from [https://codeload.github.com/gvreddy1210/64bit/zip/master][1]
and paste winutils.exe file in Hadoop\bin folder and set environment variable for c:\hadoop\bin;
create temp\hive folder in C:\ drive and give the full permission to this folder like:
C:\Windows\system32>C:\hadoop\bin\winutils.exe chmod 777 /tmp/hive
open command prompt first run C:\hadoop\bin> winutils.exe and then navigate to C:\spark\bin>
run spark-shell

Resources