SaveAsTextFile() results in Mkdirs failed - Apache Spark - apache-spark

Good, I currently have a cluster in spark with 3 working nodes. I also have a nfs server mounted on /var/nfs with 777 permission for testing. I'm trying to run the following code to count the words in a text:
root#master:/home/usuario# MASTER="spark://10.0.0.1:7077" spark-shell
val inputFile = sc.textFile("/var/nfs/texto.txt")
val counts = inputFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.toDebugString
counts.cache()
counts.count()
counts.saveAsTextFile("/home/usuario/output");
But spark gives me the following error:
Caused by: java.io.IOException: Mkdirs failed to create
file:/var/nfs/output-4/_temporary/0/_temporary/attempt_20170614094558_0007_m_000000_20
(exists=false, cwd=file:/opt/spark/work/app-20170614093824-0005/2)
I have searched for many websites but I can not find the solution for my case. All help is grateful.

When you start a spark-shell with MASTER as valid application-master url - and not local[*], spark treats all paths as HDFS; and performs IO operations only in underlying HDFS; not in local.
YOu have mounted the locations in local file-system; and those paths are not existed in HDFS.
That's why, the error says: exists=false

Same issue with me. Check ownership of your directory again.
sudo chown -R owner-user:owner-group directory

Related

Permission error when using sparklyr with Hadoop

I am trying to get sparklyr to work on a cluster with Hadoop. When I run sc <- spark_connect(master = "yarn-client", version = "2.8.5")
I get this error message:
Error in force(code) :
Failed during initialize_connection: org.apache.hadoop.security.AccessControlException: Permission denied: user=rstudio, access=WRITE, inode="/user":hdfs:hadoop:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:189)
...
The user rstudio is what I created for RStudio server. How do I fix the permissions to get it to work?
Using hadoop superuser (looks like its hdfs in your case), you need to create an HDFS home directory (/user/rstudio) for your rstudio user, and change its ownership so that rstudio is the owner. See http://www.hadooplessons.info/2017/12/creating-home-directory-for-user-in-hdfs-hdpca.html?m=1 for details.

basedir must be absolute: ?/.ivy2/local

I'm writing here in a full desperation state...
I have 2 users:
1 local user, created in Linux. Works 100% fine, word count works perfectly. Kerberized Cluster. Valid ticket.
1 Active Directory user, can login, but pyspark instruction (same word count) fails. Same kdc ticket as the one above.
Exception in thread "main" java.lang.IllegalArgumentException: basedir
must be absolute: ?/.ivy2/local
at org.apache.ivy.util.Checks.checkAbsolute(Checks.java:48)
at org.apache.ivy.plugins.repository.file.FileRepository.setBaseDir(FileRepository.java:135)
at org.apache.ivy.plugins.repository.file.FileRepository.(FileRepository.java:44)
at org.apache.spark.deploy.SparkSubmitUtils$.createRepoResolvers(SparkSubmit.scala:943)
at org.apache.spark.deploy.SparkSubmitUtils$.buildIvySettings(SparkSubmit.scala:1035)
at org.apache.spark.deploy.SparkSubmit$$anonfun$2.apply(SparkSubmit.scala:295)
at org.apache.spark.deploy.SparkSubmit$$anonfun$2.apply(SparkSubmit.scala:295)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:294)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The Code I'm running. Super simple.
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn")
sc = SparkContext(conf=conf)
It ends in error in the last instruction with the above error (see exception).
?/.ivy2/local -> This is the problem but I have no idea what's going on :(.
With the Linux user it works perfectly... but with the AD user that doesn't exists in the local system, but has /home/userFolder ... I have this problem :(
Please help... I've reach the point of insanity... I've googled every corner of the internet but I haven't found any solution to this problem/mistake :( stackoverflow is my last resort heeeeeeeeeelp
Context
Ivy needs a directory called .ivy2, usually located in the home directory. You can also configure where .ivy2 should be by giving a configuration property when Spark starts, or when you execute spark-submit.
Where the problem comes from
In IvySettings.java (line 796 for the version 2.2.0 of ant-ivy) there is this line:
if (getVariable("ivy.home") != null) {
setDefaultIvyUserDir(Checks.checkAbsolute(getVariable("ivy.home"), "ivy.home"));
Message.verbose("using ivy.default.ivy.user.dir variable for default ivy user dir: " + defaultUserDir);
} else {
setDefaultIvyUserDir(new File(System.getProperty("user.home"), ".ivy2"));
Message.verbose("no default ivy user dir defined: set to " + defaultUserDir);
}
As you can see, if ivy.home is not set, and user.home is also not set, then you will get the error:
Exception in thread "main" java.lang.IllegalArgumentException: basedir must be absolute: ?/.ivy2/local
Solution 1 (spark-shell or spark-submit)
As Rocke Yang has mentioned, you can start spark-shell or spark-submit by setting the configuration property spark.jars.ivy. Example:
spark-shell --conf spark.jars.ivy=/tmp/.ivy
Solution 2 (spark-launcher or yarn-client)
A second solution would be to set the configuration property when calling the submit method programmatically:
sparkLauncher.setSparkHome("/path/to/SPARK_HOME")
.setAppResource("/path/to/jar/to/be/executed")
.setMainClass("MainClassName")
.setMaster("MasterType like yarn or local")
.setDeployMode("set deploy mode like cluster")
.setConf("spark.executor.cores","2")
.setConf("spark.jars.ivy","/tmp/.ivy")
Ticket opened
There is a ticket opened by Spark-Community
I have met similar issue with this.
SparkSubmit will looking for ivy home directly. If not found it will report an error. And the name changed slightly on the way.
class SparkSubmitArguments {
ivyRepoPath = sparkProperties.get("spark.jars.ivy").orNull
}
We can pass the ivy.home directory by like this
spark-shell --conf spark.jars.ivy=/tmp/.ivy

Spark program on Windows cluster fails with error CreateProcess error=5, Access is denied

I am trying to execute a program on a Spark v2.0.0 Cluster on my Windows 10 laptop. There is a master node on port 31080 and slave node on 32080. The cluster is using the Standalone manager and am using JDK 1.8, with a custom work directory for the slave.
When the program is submitted via spark-submit or through Eclipse > Run program, I get the below error, and the executor goes in a loop (a new executor is created, and fails continuously). Please guide.
Executor updated: app-20160906203653-0001/0 is now RUNNING
Executor updated: app-20160906203653-0001/0 is now FAILED (java.io.IOException: Cannot run program ""D:\jdk1.8.0_101"\bin\java"
(in directory "D:\spark-work\app-20160906203653-0001\0"):
CreateProcess error=5, Access is denied)
Executor app-20160906203653-0001/0 removed: java.io.IOException: Cannot run program ""D:\jdk1.8.0_101"\bin\java" (in directory
"D:\spark-work\app-20160906203653-0001\0"): CreateProcess error=5,
Access is denied
Removal of executor 0 requested
Got the answer.. I was starting my master and slaves through windows batch scripts. These were invoking an env script which was setting JAVA_HOME, SCALA_HOME and SPARK_HOME. The paths were enclosed in double quotes. Hence the issue. Removing the double quotes fixed the issue... no Admin priviliges or changes needed.

spark reads the file on the client instead of on the worker

mymaster:
$ ./sbin/start-master.sh
myworker:
$ ./sbin/start-slave.sh spark://mymaster:7077
myclient:
$ ./bin/spark-shell --master spark://mymaster:7077
at this moment, the log of myworker says the following, indicating that it has accepted the job:
16/06/01 02:22:41 INFO Worker: Asked to launch executor app-20160601022241-0007/0 for Spark shell
myclient:
scala> sc.textFile("mylocalfile.txt").map(_.length}).sum
res0: Double = 3264.0
It works if the file mylocalfile.txt is available in myclient. However, according to the doc, the file should be available in myworker, not in myclient.
If using a path on the local filesystem, the file must also be
accessible at the same path on worker nodes. Either copy the file to
all workers or use a network-mounted shared file system.
what am I missing here?

apache shark installation on spark cluster

When running shark on spark cluster with one node
I'm getting the following error. can anyone please solve it...
Thanks in advance
error::
Executor updated: app-20140619165031-0000/0 is now FAILED (class java.io.IOException: Cannot run program "/home/trendwise/Hadoop_tools/jdk1.7.0_40/bin/java" (in directory "/home/trendwise/Hadoop_tools/spark/spark-0.9.1-bin-hadoop1/work/app-20140619165031-0000/0"): error=2, No such file or directory)
In my experience "No such file or directory" is often a symptom of some other exception. Usually a "no space left on device" and sometimes "too many files open". Mine the logs for other stack traces and monitor your disk usage and inode usage to confirm.

Resources