PySpark is not starting from Windows Command Prompt - apache-spark

I am trying to start pyspark from windows cmd. But so far no luck. I am getting an error msg as shown below.
I have gone through almost every corner of stackoverflow and net search but could not able to fix this.
So far I have followed the steps as mentioned below:
set JAVA_HOME, SPARK_HOME and HADOOP_HOME in the System Variables.
Update the PATH variables as shown below.
I have managed all the 'space' related issues. Despite all these, I am still not able to start the spark-shell or pyspark from command prompt.
I am using Windows 10 Home edition.
Am I missing out something?
Note: I have installed Java, Scala and Python and from command prompt they are running fine.

Did you enable access to the default scratch directory for Hive. Make sure the directory C:\tmp\hive exists; if it doesn’t exist, create it.
Next, you need to give to it a permission to access winutils.exe. Navigate back to where you put this .exe file then run the permission command
cd c:\hadoop\bin
winutils.exe chmod -R 777 C:\tmp\hive
Once you have completed this try again to launch PySpark!

Related

When I run chmod with C:\hadoop\bin\winutils.exe , it says “The aplication was unable to start correctly”

I’m trying to run the below command,
C:\hadoop\bin\winutils.exe chmod -R 777 C:\SparkProject
But it gives me error saying
The application was unable to start correctly (0xc000007b)
I have kept the winutils.exe in \hadoop\bin, also set up the environment variable for HADOOP_HOME.
If I run the spark program which writes in local system from intellij idea. it fails. But but I can see zero byte files are created in that folder (a .crc file).
I’m using windows 10.
I have tried all the winutils available as I was not sure of the version that I need.
Finally I have downloaded one latest from GitHub for hadoop-3.3.0.
link: https://github.com/kontext-tech/winutils/blob/master/hadoop-3.3.0/bin/winutils.exe
And it's working now. I'm able to give permission via winutils.exe as well as write into local file system.

What is the use of winutils.exe?

I am running apache spark on windows(locally) using intellij.
I chose enableHiveSupport while creating spark session object.
I converted a dataframe into temp view and ran some queries.
Initially I got an error that tmp/hive does not exist. So I created one on the C: drive.
Then I got an error that tmp/hive is not writable.
So I changed the permissions in the file properties. But I still got the same error.
After researching I found the solution i.e use winutils.exe to change the permissions.
So what exactly is winutils.exe? Where is it used spark? the tmp/hive/username was empty after I ran the application.
Thank you
I advise you run on linux, but if using Windows for Spark accessing Hadoop on Windows, then cmd> winutils.exe chmod -R 777 D:\tmp\hive allows you to read and write to this pseudo Hadoop.

while executing following command, why am I getting permission denied? I am trying to run this on HDP

I installed Jupyter on Hortonworks Sandbox and wanted to run jupyter using port forwarding.
I followed these steps : https://hortonworks.com/hadoop-tutorial/using-ipython-notebook-with-apache-spark/
I am not able to start Jupyter using
./start_ipython_notebook.sh
I tried /root/start_ipython_notebook.sh but getting permission denied.
run ls -la in that directory what does it output?
You will probably see that the file you are trying to run doesn't allow your amar user to run it.
To solve this, either install jupyter in another directory where it has the correct user rights. Or use the chown and chmod to give you the correct rights over those files inorder to run them.

Node.Js windows command prompt change C:\ path to test directory instead

I'm trying to figure out if there is a way to change the Node.Js command prompt default path = C:\users...> (default when the prompt is launched) or C:\Windows\System (if launched with administrator privileges), to the location of the folder where i'm working.
Normally I have been doing C:\users..> cd C:\xampp\htdocs..... to navigate to the test folder and run test. Although once the command prompt is closed it reverts back to C:\users...>.
To achieve what I want I came across using Z:>C:\xampp\htdocs\projects.... but this returns access denied with or without administrator privileges. Even if I try C:>C:\xampp\htdocs\projects.... still get the Access Denied for some unknown reason. To be honest I don't know what Z:> or C:> will result.
Is it possible to change the default prompt path to the path of the directory I am working in so that every time command prompt is launched it goes to that directory? In this case C:\xampp\htdocs\projects.... instead of C:\users...>
This seems like a general windows CMD question. Simply change the start up directory for CMD. See this SO post.
Once you're in that directory, you should be able to run the node command as normal.
Look inside your default nodejs installation folder for a file called nodevars.bat. Here is my path:
C:\Program Files\nodejs\nodevars.bat
Open this and look towards the bottom--the line I needed was on the very bottom. Here is the line from the git master:
if "%CD%\"=="%~dp0" cd /d "%HOMEDRIVE%%HOMEPATH%"
I changed mine to
if "%CD%\"=="%~dp0" cd /d "C:\Users\David\Desktop\work\J\math"
And now I am happier.
I had the same question, today, 4/11/22, and DuckDuckGo provided this as the number one result for my query. Since the question appears to be unanswered, I will try for those who might show up later.

Why does spark-submit and spark-shell fail with "Failed to find Spark assembly JAR. You need to build Spark before running this program."?

I was trying to run spark-submit and I get
"Failed to find Spark assembly JAR.
You need to build Spark before running this program."
When I try to run spark-shell I get the same error.
What I have to do in this situation.
On Windows, I found that if it is installed in a directory that has a space in the path (C:\Program Files\Spark) the installation will fail. Move it to the root or another directory with no spaces.
Your Spark package doesn't include compiled Spark code. That's why you got the error message from these scripts spark-submit and spark-shell.
You have to download one of pre-built version in "Choose a package type" section from the Spark download page.
Try running mvn -DskipTests clean package first to build Spark.
If your spark binaries are in a folder where the name of the folder has spaces (for example, "Program Files (x86)"), it didn't work. I changed it to "Program_Files", then the spark_shell command works in cmd.
In my case, I install spark by pip3 install pyspark on macOS system, and the error caused by incorrect SPARK_HOME variable. It works when I run command like below:
PYSPARK_PYTHON=python3 SPARK_HOME=/usr/local/lib/python3.7/site-packages/pyspark python3 wordcount.py a.txt
Go to SPARK_HOME. Note that your SPARK_HOME variable should not include /bin at the end. Mention it when you're when you're adding it to path like this: export PATH=$SPARK_HOME/bin:$PATH
Run export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g" to allot more memory to maven.
Run ./build/mvn -DskipTests clean package and be patient. It took my system 1 hour and 17 minutes to finish this.
Run ./dev/make-distribution.sh --name custom-spark --pip. This is just for python/pyspark. You can add more flags for Hive, Kubernetes, etc.
Running pyspark or spark-shell will now start pyspark and spark respectively.
If you have downloaded binary and getting this exception
Then please check your Spark_home path may contain spaces like "apache spark"/bin
Just remove spaces will works.
Just to add to #jurban1997 answer.
If you are running windows then make sure that SPARK_HOME and SCALA_HOME environment variables are setup right. SPARK_HOME should be pointing to {SPARK_HOME}\bin\spark-shell.cmd
For Windows machine with the pre-build version as of today (21.01.2022):
In order to verify all the edge cases you may have and avoid tedious guesswork about what exactly is not configred properly:
Find spark-class2.cmd and open it in with a text editor
Inspect the arguments of commands staring with call or if exists by typing the arguments in Command Prompt like this:
Open Command Prompt. (For PowerShell you need to print the var another way)
Copy-paste %SPARK_HOME%\bin\ as is and press enter.
If you see something like bin\bin in the path displayed now then you have appended /bin in your environment variable %SPARK_HOME%.
Now you have to add the path to the spark/bin to your PATH variable or it will not find spark-submit command
Try out and correct every path variable that the script in this file uses and and you should be good to go.
After that enter spark-submit ... you may now encounter the missing hadoop winutils.exe for which problem you can go get the tool and paste it where the spark-submit.cmd is located
Spark Installation:
For Window machine:
Download spark-2.1.1-bin-hadoop2.7.tgz from this site https://spark.apache.org/downloads.html
Unzip and Paste your spark folder in C:\ drive and set environment variable.
If you don’t have Hadoop,
you need to create Hadoop folder and also create Bin folder in it and then copy and paste winutils.exe file in it.
download winutils file from [https://codeload.github.com/gvreddy1210/64bit/zip/master][1]
and paste winutils.exe file in Hadoop\bin folder and set environment variable for c:\hadoop\bin;
create temp\hive folder in C:\ drive and give the full permission to this folder like:
C:\Windows\system32>C:\hadoop\bin\winutils.exe chmod 777 /tmp/hive
open command prompt first run C:\hadoop\bin> winutils.exe and then navigate to C:\spark\bin>
run spark-shell

Resources