What is the use of winutils.exe? - apache-spark

I am running apache spark on windows(locally) using intellij.
I chose enableHiveSupport while creating spark session object.
I converted a dataframe into temp view and ran some queries.
Initially I got an error that tmp/hive does not exist. So I created one on the C: drive.
Then I got an error that tmp/hive is not writable.
So I changed the permissions in the file properties. But I still got the same error.
After researching I found the solution i.e use winutils.exe to change the permissions.
So what exactly is winutils.exe? Where is it used spark? the tmp/hive/username was empty after I ran the application.
Thank you

I advise you run on linux, but if using Windows for Spark accessing Hadoop on Windows, then cmd> winutils.exe chmod -R 777 D:\tmp\hive allows you to read and write to this pseudo Hadoop.

Related

Can't find Spark Submit when using Spark shell

I installed spark and am trying to run a file 'train.py' in the directory, '/home/xxx/Desktop/BD_Project', in shell using the following command:
$SPARK_HOME/bin/spark-submit /home/xxx/Desktop/BD_Project/train.py > output.txt
My teammates who used the same page that I did for spark installations have no problem when running this. However, it throws up the following error for me:
bash: /bin/spark-submit: No such file or directory
You need to set your SPARK_HOME to where your spark is installed, typically its in /usr/local/spark/bin/bin/spark-submit
Before you set it make sure where spark is installed by going to the directory.
You can set it like this before running your command :
export SPARK_HOME=/usr/local/spark/bin/bin/spark-submit
If you are homebrew user, setting your SPARK_HOME to
/opt/homebrew/Cellar/apache-spark/3.3.1/libexec"
would solve. Sorry for too late responding. Hoping this would help someone with this odd error.

When I run chmod with C:\hadoop\bin\winutils.exe , it says “The aplication was unable to start correctly”

I’m trying to run the below command,
C:\hadoop\bin\winutils.exe chmod -R 777 C:\SparkProject
But it gives me error saying
The application was unable to start correctly (0xc000007b)
I have kept the winutils.exe in \hadoop\bin, also set up the environment variable for HADOOP_HOME.
If I run the spark program which writes in local system from intellij idea. it fails. But but I can see zero byte files are created in that folder (a .crc file).
I’m using windows 10.
I have tried all the winutils available as I was not sure of the version that I need.
Finally I have downloaded one latest from GitHub for hadoop-3.3.0.
link: https://github.com/kontext-tech/winutils/blob/master/hadoop-3.3.0/bin/winutils.exe
And it's working now. I'm able to give permission via winutils.exe as well as write into local file system.

PySpark is not starting from Windows Command Prompt

I am trying to start pyspark from windows cmd. But so far no luck. I am getting an error msg as shown below.
I have gone through almost every corner of stackoverflow and net search but could not able to fix this.
So far I have followed the steps as mentioned below:
set JAVA_HOME, SPARK_HOME and HADOOP_HOME in the System Variables.
Update the PATH variables as shown below.
I have managed all the 'space' related issues. Despite all these, I am still not able to start the spark-shell or pyspark from command prompt.
I am using Windows 10 Home edition.
Am I missing out something?
Note: I have installed Java, Scala and Python and from command prompt they are running fine.
Did you enable access to the default scratch directory for Hive. Make sure the directory C:\tmp\hive exists; if it doesn’t exist, create it.
Next, you need to give to it a permission to access winutils.exe. Navigate back to where you put this .exe file then run the permission command
cd c:\hadoop\bin
winutils.exe chmod -R 777 C:\tmp\hive
Once you have completed this try again to launch PySpark!

Why I take "spark-shell: Permission denied" error in Spark Setup?

I am new on Apache Spark. I am trying to setup Apache Spark to my Macbook.
I download file "spark-2.4.0-bin-hadoop2.7" from Apache Spark official web site.
When I try to run ./bin/spark-shell or ./bin/pyspark I get Permission denied error.
I want to just run spark on my local machine.
I also tried to give permission to all folders but it does not help.
Why do I this error?
This should solve your problem chmod +x /Users/apple/spark-2.4.0-bin-hadoop2.7/bin/*
Then you could try executing bin/pyspark (spark shell in python) or bin/spark-shell (spark shell in scala).
I solve this issue by adding /libexec folder to spark home path
set $SPARK_HOME to
/usr/local/Cellar/apache-spark/<your_spark_version>/libexec

Copying the Apache Spark installation folder to another system will work properly?

I am using Apache Spark. Working in cluster properly with 3 machines. Now I want to install Spark on another 3 machines.
What I did: I tried to just copy the folder of Spark, which I am using currently.
Problem: ./bin/spark-shell and all other spark commands are not working and throwing error 'No Such Command'
Question: 1. Why it is not working?
Is it possible that I just build Spark installation for 1 machine and then from that installation I can distribute it to other machines?
I am using Ubuntu.
We were looking into problem and found that Spark Installation Folder , which was copied, having the .sh files but was not executable. We just make the files executable and now spark is running.
Yes, It would work but should ensure that you have set all the environment variables required for spark to work.
like SPARK_HOME, WEBUI_PORT etc...
also use hadoop integrated spark build which comes with the supported versions of hadoop.

Resources