WinUtils & Spark 3.1.1 Failures - apache-spark

I'm trying to create a Windows 10 developer vm with a Conda environment and PySpark but seeing constant problems with getting Spark & winutils to work.
Environment:
Windows 10 19042 (Fully Patched)
MiniConda
PySaprk 3.1.1
Java 11.0.11
I have created C:\Hadoop\bin and downloaded winutils from here https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.1/bin (I've also tried 3.2.0).
HADOOP_HOME is C:\Hadoop and Path has %HADOOP_HOME\bin in it. JAVA_HOME is correct.
This code works:
location = 'C:/myfiles/file.csv'
df = spark.read.format("csv").options(header=True).load(location)
This code fails:
location = 'C:/myfiles/'
df = spark.read.format("csv").options(header=True).load(location)
Error message:
An error occurred while calling o35.load.
: java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:645)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1230)
at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435)
at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:493)
Winutils is being picked up because if I delete it the first example above then breaks as well in the expected way.
It seems that winutils is incompatible with Spark 3.1.1 and specifically a folder of files? I find it hard to believe that.
Bizarrely though I have another machine with pySpark 3.1.1 and this version of winutils and it works! Same Java version as well. I'm lost - I've even copied the winutils files from teh working machine to this one and it still didn't work.
Can anyone guide me on what that error means at least to help me understand where the issue could be?

In my case, I was able to resolve the issue by adding the hadoop.dll from the same link as provided in the question above(here) to the same location as winutils. Just keep a check there is no version mismatch.

Related

Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z Spark 3.3.1

Trying to create a table in Apache Spark gives the following error.
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
I have tried following the same steps as described here
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z, but with no luck.
I have downloaded Apache Spark 3.3.1 (spark-3.3.1-bin-hadoop3.tgz version) and the hadoop.dll file as well as the winutils.exe file from the 3.3.1/bin from https://github.com/kontext-tech/winutils.
I have added the environment variables. I have added the files to the system32 folder.
I have also tried it with Apache Spark 3.3.0.
But I still get the same error message. Not sure what else to do. Can there be some security setting that is blocking the files from working? I'm using a work computer after all.
EDIT: Some clarification
I downloaded spark 3.3.1 (spark-3.3.1-bin-hadoop3.tgz) from the spark website. https://spark.apache.org/downloads.html
I extracted the files into my C:\Spark folder.
I downloaded the hadoop.dll file as well as the winutils.exe file from the 3.3.1/bin from https://github.com/kontext-tech/winutils I put these files into my C:\Hadoop\bin folder as well as the C:\Windows\system32 folder.
I added the environmental variables HADOOP_HOME = C:\Hadoop and SPARK_HOME = C:\Spark as described in STEP 7 in https://phoenixnap.com/kb/install-spark-on-windows-10. I also added these variables to the path, i.e %SPARK_HOME%\bin and %HADOOP_HOME%\bin
Restarted my computer.
'name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.15"
libraryDependencies ++= Seq("org.apache.spark" %% "spark-sql" % "3.3.1", "io.delta" %% "delta-core" % "2.1.1")
Looks like you're doing the correct things! I expect there is something small wrong somewhere, so these are things you can look at:
Verify whether your JVM is a 32 or 64 bit JVM. You can do this as explained here. From the link you added it seems like these were pre-compiled for a 64 bit JVM. Maybe you are on a 32 bit one?
If it's not the previous point, I expect the error to be somewhere in your environment variables. Carefully study them, looking for unwated space characters, or wrong delimiters, or something else like that.
Hope this helps!

Can't make action calls through anaconda py35 env in spark HdInsight

As per the documentation - https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-python-package-installation
we had installed several external python modules through new anaconda env 'py35_data_prof'. However as soon as we invoke any rdd action calls like rdd.count() or rdd.avg() in our python code, spark2 throws -
Cannot run program "/usr/bin/anaconda/envs/py35_data_prof/bin/python": error=2, No such file or directory
enter image description here
FYI, The python indicated in error path - '/usr/bin/anaconda/envs/py35_data_prof/bin/python' is actually a symlink rather than python dir.
I have been looking up the HDInsight docs but can't seem to find the fix. Please let us know if there is a way around it.
The error message “Cannot run program "/usr/bin/anaconda/envs/py35_data_prof/bin/python": error=2, No such file or directory” clearly says the unable to find/locate the package installed. Make sure the package is installed with the all the requirements mentioned below.
• Create Python virtual environment using conda.
• Install external Python packages in the created virtual environment if needed.
• Change Spark and Livy configs and point to the created virtual environment.
I would request you to follow the each and every step mentioned here: “Safely install external Python packages”.
Hope this helps.

Issue upon Spark Upgrade : key not found: _PYSPARK_DRIVER_CONN_INFO_PATH

Downloaded the latest Spark version because of the fix for
ERROR AsyncEventQueue:70 - Dropping event from queue appStatus.
After setting environment variables and running the same code in PyCharm, I'm getting this error, which I can't find a solution of.
Exception in thread "main" java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CONN_INFO_PATH
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:64)
at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Any help?
i met this question too. The next is what i do, hoping to help you:
1 . find your spark version, my spark's version is 2.4.3;
2 . find your pyspark version, my pyspark,version is 2.2.0;
3 . reinstall your pyspark as same as the spark's version
pip install pyspark==2.4.3
Then everything is ok. Hope to help you.
I am using Pyspark 2.3.1 with pycharm 2018.1.4 and facing similar issue on my windows machine.
When I run this python file using spark-submit, it get executed successfully.
I have followed below steps
Created a new project in pycharm, lets call it Demo
Goto Settings->Project:Demo->Project Interpreter. Make sure project interpreter is python 2.7
Goto Settings->Project:Demo->Project Structure. Add Content Root.
I have added two content root one pointing to directory where content of apache spark is present and the other location is of py4j-0.10.7.src.zip
In my case these locations are
C:\apache-spark
and
C:\apache-spark\python\lib\py4j-0.10.7-src.zip
Created new python file(Demo1.py) and pasted below content inside it.
from pyspark import SparkContext
sc = SparkContext(master="local", appName="Spark Demo")
rdd = sc.textFile("C:/apache-spark/README.md")
wordsRDD = rdd.flatMap(lambda words: words.split(" "))
wordsRDD = wordsRDD.map(lambda word: (word, 1))
wordsCount = wordsRDD.reduceByKey(lambda x, y: x+y)
print wordsCount.collect()
Running this python file on pycharm gives below error
Exception in thread "main" java.util.NoSuchElementException: key not
found: _PYSPARK_DRIVER_CONN_INFO_PATH
Where as same program when executed from command prompt yields correct result.
C:\Users\manish>spark-submit C:\Demo\demo1.py
Any suggestions to solve this problem?
I have had a similar exception. My problem was running jupyter and spark with different users. When I run them with the same user problem is solved.
Details;
When I updated spark from v2.2.0 to v2.3.1 then run the Jupyter notebook, the error log was as follows;
Exception in thread "main" java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CONN_INFO_PATH
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:64)
at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
When I googling I encountered the following link;
spark-commits mailing list archives
In the code
/core/src/main/scala/org/apache/spark/api/python/PythonGatewayServer.scala
there is a change
+ // Communicate the connection information back to the python process by writing the
+ // information in the requested file. This needs to match the read side in java_gateway.py.
+ val connectionInfoPath = new File(sys.env("_PYSPARK_DRIVER_CONN_INFO_PATH"))
+ val tmpPath = Files.createTempFile(connectionInfoPath.getParentFile().toPath(),
+ "connection", ".info").toFile()
According to this change it is created a temp dir and a file in it. My problem was running jupyter and spark with different users. Because of this I think the process could not created the temp file. When I run them with the same user problem solved. I hope it helps.
I had this problem too, and it ended up being the pyspark code I was importing/running from PyCharm was still the spark 2.2 install instead of the spark 2.3 installation that I had updated SPARK_HOME to point to.
Specifically, I added spark-2.2 to my PyCharm project structure and then marked it's python folder a "Sources" so PyCharm would recognize all it's symbols. So the PyCharm code was importing from there, instead of spark-2.3, and the older code didn't set the _PYSPARK_DRIVER_CONN_INFO_PATH environment variable.
If Vezir's answer didn't fix your case, try tracing into the creation SparkContext and compare carefully the path that is being read from as opposed to path of your spark install. Similarly, if you installed pyspark into your python project via pip, make sure you installed 2.3.1 to match your installed spark version.
This can happen when you are running spark 2.3.1 jars with an older version of pyspark (eg: 2.3.0)

Spark on windows 10 not working

Im trying to get spark working on win10. When i try to run spark shell i get this error :
'Spark\spark-2.0.0-bin-hadoop2.7\bin..\jars""\ is not recognized as an internal or external command,operable program or batch file.
Failed to find Spark jars directory. You need to build Spark before running this program.
I am using a pre-built spark for hadoop 2.7 or later. I have installed java 8, eclipse neon, python 2.7, scala 2.11, gotten winutils for hadoop 2.7.1 And i still get this error.
When I donwloaded spark it comes in the tgz, when extracted there is another tzg inside, so i extracted it also and then I got all the bin folders and stuff. I need to access spark-shell. Can anyone help?
EDIT:
Solution i ended up using:
1) Virtual box
2) Linux mint
I got the same error while building Spark. You can move the extracted folder to C:\
Refer this:
http://techgobi.blogspot.in/2016/08/configure-spark-on-windows-some-error.html
You are probably giving the wrong folder path to Spark bin.
Just open the command prompt and change directory to the bin inside the spark folder.
Type spark-shell to check.
Refer: Spark on win 10
"On Windows, I found that if it is installed in a directory that has a space in the path (C:\Program Files\Spark) the installation will fail. Move it to the root or another directory with no spaces."
OR
If you have installed Spark under “C:\Program Files (x86)..” replace 'Program Files (x86)' with Progra~2 in the PATH env variable and SPARK_HOME user variable.

NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell

I've downloaded the prebuild version of spark 1.4.0 without hadoop (with user-provided Haddop). When I ran the spark-shell command, I got this error:
> Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/
FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSpa
rkProperties$1.apply(SparkSubmitArguments.scala:111)
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSpa
rkProperties$1.apply(SparkSubmitArguments.scala:111)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkPropert
ies(SparkSubmitArguments.scala:111)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArgume
nts.scala:97)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:106)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStr
eam
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 7 more
I've searched on Internet, it is said that HADOOP_HOME has not been set yet in spark-env.cmd. But I cannot find spark-env.cmd in the spark installation folder.
I've traced the spark-shell command and it seems that there are no HADOOP_CONFIG in there. I've tried to add the HADOOP_HOME on environment variable but it still give the same exception.
Actually I don't really using the hadoop. I downloaded hadoop as a workaround as suggested in this question
I am using windows 8 and scala 2.10.
Any help will be appreciated. Thanks.
The "without Hadoop" in the Spark's build name is misleading: it means the build is not tied to a specific Hadoop distribution, not that it is meant to run without it: the user should indicate where to find Hadoop (see https://spark.apache.org/docs/latest/hadoop-provided.html)
One clean way to fix this issue is to:
Obtain Hadoop Windows binaries. Ideally build them, but this is painful (for some hints see: Hadoop on Windows Building/ Installation Error). Otherwise Google some up, for instance currently you can download 2.6.0 from here: http://www.barik.net/archive/2015/01/19/172716/
Create a spark-env.cmd file looking like this (modify Hadoop path to match your installation):
#echo off
set HADOOP_HOME=D:\Utils\hadoop-2.7.1
set PATH=%HADOOP_HOME%\bin;%PATH%
set SPARK_DIST_CLASSPATH=<paste here the output of %HADOOP_HOME%\bin\hadoop classpath>
Put this spark-env.cmd either in a conf folder located at the same level as your Spark base folder (which may look weird), or in a folder indicated by the SPARK_CONF_DIR environment variable.
I had the same problem, in fact it's mentioned on the Getting started page of Spark how to handle it:
### in conf/spark-env.sh ###
# If 'hadoop' binary is on your PATH
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
# With explicit path to 'hadoop' binary
export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
# Passing a Hadoop configuration directory
export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)
If you want to use your own hadoop follow one of the 3 options, copy and paste it into spark-env.sh file :
1- if you have the hadoop on your PATH
2- you want to show hadoop binary explicitly
3- you can also show hadoop configuration folder
http://spark.apache.org/docs/latest/hadoop-provided.html
I too had the issue,
export SPARK_DIST_CLASSPATH=`hadoop classpath`
resolved the issue.
I ran into the same error when trying to get familiar with spark. My understanding of the error message is that while spark doesn't need a hadoop cluster to run, it does need some of the hadoop classes. Since I was just playing around with spark and didn't care what version of hadoop libraries are used, I just downloaded a spark binary pre-built with a version of hadoop (2.6) and things started working fine.
linux
ENV SPARK_DIST_CLASSPATH="$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"
windows
set SPARK_DIST_CLASSPATH=%HADOOP_HOME%\etc\hadoop\*;%HADOOP_HOME%\share\hadoop\common\lib\*;%HADOOP_HOME%\share\hadoop\common\*;%HADOOP_HOME%\share\hadoop\hdfs\*;%HADOOP_HOME%\share\hadoop\hdfs\lib\*;%HADOOP_HOME%\share\hadoop\hdfs\*;%HADOOP_HOME%\share\hadoop\yarn\lib\*;%HADOOP_HOME%\share\hadoop\yarn\*;%HADOOP_HOME%\share\hadoop\mapreduce\lib\*;%HADOOP_HOME%\share\hadoop\mapreduce\*;%HADOOP_HOME%\share\hadoop\tools\lib\*
Enter into SPARK_HOME -> conf
copy spark-env.sh.template file and rename it to spark-env.sh
Inside this file you can set the parameters for spark.
Run below from your package dir just before running spark-submit -
export SPARK_DIST_CLASSPATH=`hadoop classpath`
I finally find a solution to remove the exception.
In spark-class2.cmd, add :
set HADOOP_CLASS1=%HADOOP_HOME%\share\hadoop\common\*
set HADOOP_CLASS2=%HADOOP_HOME%\share\hadoop\common\lib\*
set HADOOP_CLASS3=%HADOOP_HOME%\share\hadoop\mapreduce\*
set HADOOP_CLASS4=%HADOOP_HOME%\share\hadoop\mapreduce\lib\*
set HADOOP_CLASS5=%HADOOP_HOME%\share\hadoop\yarn\*
set HADOOP_CLASS6=%HADOOP_HOME%\share\hadoop\yarn\lib\*
set HADOOP_CLASS7=%HADOOP_HOME%\share\hadoop\hdfs\*
set HADOOP_CLASS8=%HADOOP_HOME%\share\hadoop\hdfs\lib\*
set CLASSPATH=%HADOOP_CLASS1%;%HADOOP_CLASS2%;%HADOOP_CLASS3%;%HADOOP_CLASS4%;%HADOOP_CLASS5%;%HADOOP_CLASS6%;%HADOOP_CLASS7%;%HADOOP_CLASS8%;%LAUNCH_CLASSPATH%
Then, change :
"%RUNNER%" -cp %CLASSPATH%;%LAUNCH_CLASSPATH% org.apache.spark.launcher.Main %* > %LAUNCHER_OUTPUT%
to :
"%RUNNER%" -Dhadoop.home.dir=*hadoop-installation-folder* -cp %CLASSPATH% %JAVA_OPTS% %*
It works fine with me, but I'm not sure this is the best solution.
You should add these jars in you code:
common-cli-1.2.jar
hadoop-common-2.7.2.jar
Thank you so much. That worked great, but I had to add the spark jars to the classpath as well:
;c:\spark\lib*
Also, the last line of the cmd file is missing the word "echo"; so it should say:
echo %SPARK_CMD%
I had the same issue ....Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/
FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSpa
rkProperties$1.apply(SparkSubmitArguments.scala:111)...
Then I realized that I had installed the spark version without hadoop. I installed the "with-hadoop" version the problem went away.
for my case
running spark job locally differs from running it on cluster. on cluster you might have a different dependency/context to follow. so essentially in your pom.xml you might have dependencies declared as provided.
when running locally, you don't need these provided dependencies. just uncomment them and rebuild again.
I encountered the same error. I wanted to install spark on my windows PC and therefore downloaded the without hadoop version of spark, but turns out you need the hadoop libraries! so download any hadoop spark version and set the environment variables.
I got this error because the file was copied from Windows.
Resolve it using
dos2unix file_name
I think you need spark-core dependency of maven. It worked fine for me.
I used:
export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
export HADOOP_MAPRED_HOME=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce
It's work for me!
I added hadoop-client-runtime-3.3.2.jar to my user library.

Resources