Local file upload failed in spark application - apache-spark

In my code, I am trying to load a file which is in my local machine into spark application,
sc.textFile("file:///home/testpath/file1“).
When I submit the job on the command line
Scenario 1: spark submit --class … master local
Job ran successfully with out any issues.
Scenario 2 : spark submit --class …. —master yarn —deploy-mode cluster
Job failed by throwing file:///home/testpath/file1 file not found Exception.
But when I tested file1.... File exists on my local.
Scenario 3 : spark submit —class … —master yarn —deploy-mode client
Job failed by throwing file:///home/testpath/file1 file not found Exception.
But when I tested file1,, File exists on my local.
Scenario 4: spark-shell —master=yarn
Val file1 = sc.textFile("file:///home/testpath/file1“).
Job failed by throwing file:///home/testpath/file1 file not found Exception.
In core-site.xml, fs.default.name property set to hdfs://mynamenode:9000
Could you please help how can I load local file in my spark application( Using spark 2.X version)
Any Ideas? Thanks in advance.

When spark execution mode is local, spark executor jobs are scheduled on the same local node and hence, it is able to find the file. But, when in yarn mode, executor jobs are scheduled randomly on any of the cluster nodes. So, you may either move your file to HDFS or maintain a copy of this file on each node

Related

Hadoop copyToLocalFile failing in Yarn cluster mode

I was trying to copy a file to local from HDFS using Hadoop's copyToLocalFile function from my Spark2 application.
val hadoopConf = new Configuration()
val hdfs = FileSystem.get(hadoopConf)
val src = new Path("/user/yxs7634/all.txt")
val dest = new Path("file:///home/yxs7634/all.txt")
hdfs.copyToLocalFile(src, dest)
The above code is working fine when I submit my spark application in Yarn client mode. But, It keeps failing with the below exception in Yarn cluster mode.
18/10/03 12:18:40 ERROR yarn.ApplicationMaster: User class threw exception: java.io.FileNotFoundException: /home/yxs7634/all.txt (Permission denied)
In yarn-cluster mode the driver is also handled by yarn and the selected driver node may not be the one where you're submitting the job. Hence for this job to work in yarn-cluster mode I believe you need to place the local file in all the spark nodes in the cluster.
In yarn mode, the spark job is submitted through YARN.
The driver would be started on a different node.
To tackle this issue, you can use a distributed file system like HDFS to store your file and then giving the absolute path.
eg:
val src = new Path("hdfs://nameservicehost:8020/user/yxs7634/all.txt")
Looks like Spark server running under one user (for ex. "spark"), and file in code stored in other user "yxs7634" directory.
In cluster mode user "spark" does not allows to write in "yxs7634" user dir, and such exception occurs.
Additional permission for Spark user to write in "/home/yxs7634" is required.
In local mode worked fine, because Spark runs under "yxs7634" user.
You have a permission denied error, I mean, the user you are using to submit the job is not able to access the file. The directory should have at least read permission to user "other", something like this: -rw-rw-r--
Can you paste the permissions of the directory and the file? The command is
hdfs dfs -ls /your-directory/

No such file or directory in spark cluster mode

I am writing a spark-streaming application using pyspark which basically process the data.
Inshort packaging overview:
This application contains several modules and some config files which are non .py files (ex:.yaml or .json).
I am packaging this entire application in package.zip file and submitting this package.zip to spark.
Now the problem is when i issue the spark-submit command in yarn cluster mode. I get IOError. Below is stacktrace
Traceback (most recent call last):
File "main/main.py", line 10, in <module>
import logger.logger
File "package.zip/logger/logger.py", line 36, in get_logger
IOError: [Errno 2] No such file or directory: 'logger/config.yaml'
Spark-Command :
spark-submit --master yarn-cluster --py-files package.zip main/main.py
But when I am submitting job in yarn-client mode the application works as expected.
My understanding:
When I submit the job in client mode the spark driver runs in same machine where I have issued the command. And the package is distributed across all nodes.
And when I issue the command in cluster mode the both spark driver and application master runs in single node(which is not client who submitted code.) and still package is distribute to all nodes in cluster.
In both the cases package.zip is available to all nodes then why is that only py files are getting loaded and non py files are failed to load in cluster mode.
Can any one please help me to understand the situation here and resolve the problem?
Updated--
Observations
In Client Mode The zipped package is unzipped in the path where driver script is running.
Where as in Cluster Mode the zip package shared across all node but not unzipped.
Here do I need to unzip package in all nodes ?
Is there any way to tell spark to unzip package in worker node?
You can pass your extra files with --files option.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-submit.html

Using same jar with Spark-submit

I deploy a job on yarn cluster mode by spark-submit with my jar file. The job deployed every time I submitted with 'same jar file', but It upload to hadoop everytime it's submitted. I think it's unnecessary routine to upload same jar every time. Is there any way to upload once and do yarn jobs with the jar?
You can put your spark jar in hdfs and then use --master yarn-cluster mode, this way you could save the time required to upload the jar to hdfs everytime.
Other alternatives is put your jar in spark classpath on every node which has the following drawbacks:
If you have more than 30 nodes it would be very tedious to scp your jar in each node.
If you hadoop cluster upgrades and there is a new installation of spark, you would have to reploy.

Spark streaming on dataproc throws FileNotFoundException

When I try to submit a spark streaming job to google dataproc cluster, I get this exception:
16/12/13 00:44:20 ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
...
16/12/13 00:44:20 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector#d7bffbc{HTTP/1.1}{0.0.0.0:4040}
16/12/13 00:44:20 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
16/12/13 00:44:20 ERROR org.apache.spark.util.Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:152)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1360)
...
Exception in thread "main" java.io.FileNotFoundException: File file:/tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
Full output here.
It seems this error happens when hadoop configuration is not correctly defined in spark-env.sh - link1, link2
Is it configurable somewhere? Any pointers on how to resolve it?
Running the same code in local mode works fine:
sparkConf.setMaster("local[4]")
For additional context: the job was invoked like this:
gcloud dataproc jobs submit spark \
--cluster my-test-cluster \
--class com.company.skyfall.Skyfall \
--jars gs://my-bucket/resources/skyfall-assembly-0.0.1.jar \
--properties spark.ui.showConsoleProgress=false
This is the boilerplate setup code:
lazy val conf = {
val c = new SparkConf().setAppName(this.getClass.getName)
c.set("spark.ui.port", (4040 + scala.util.Random.nextInt(1000)).toString)
if (isLocal) c.setMaster("local[4]")
c.set("spark.streaming.receiver.writeAheadLog.enable", "true")
c.set("spark.streaming.blockInterval", "1s")
}
lazy val ssc = if (checkPointingEnabled) {
StreamingContext.getOrCreate(getCheckPointDirectory, createStreamingContext)
} else {
createStreamingContext()
}
private def getCheckPointDirectory: String = {
if (isLocal) localCheckPointPath else checkPointPath
}
private def createStreamingContext(): StreamingContext = {
val s = new StreamingContext(conf, Seconds(batchDurationSeconds))
s.checkpoint(getCheckPointDirectory)
s
}
Thanks in advance
Is it possible that this wasn't the first time you ran the job with the given checkpoint directory, as in the checkpoint directory already contains a checkpoint?
This happens because the checkpoint hard-codes the exact jarfile arguments used to submit the YARN application, and when running on Dataproc with a --jars flag pointing to GCS, this is actually syntactic sugar for Dataproc automatically staging your jarfile from GCS into a local file path /tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar that's only used temporarily for the duration of a single job-run, since Spark isn't able to invoke the jarfile directly out of GCS without staging it locally.
However, in a subsequent job, the previous tmp jarfile will already be deleted, but the new job tries to refer to that old location hard-coded into the checkpoint data.
There are also additional issues caused by hard-coding in the checkpoint data; for example, Dataproc also uses YARN "tags" to track jobs, and will conflict with YARN if an old Dataproc job's "tag" is reused in a new YARN application. To run your streaming application, you'll need to first clear out your checkpoint directory if possible to start from a clean slate, and then:
You must place the job jarfile somewhere on the master node before starting the job, and then your "--jar" flag must specify "file:///path/on/master/node/to/jarfile.jar".
When you specify a "file:///" path dataproc knows its already on the master node so it doesn't re-stage into a /tmp directory, so in that case it's safe for the checkpoint to point to some fixed local directory on the master.
You can do this either with an init action or you can submit a quick pig job (or just ssh into the master and download that jarfile):
# Use a quick pig job to download the jarfile to a local directory (for example /usr/lib/spark in this case)
gcloud dataproc jobs submit pig --cluster my-test-cluster \
--execute "fs -cp gs://my-bucket/resources/skyfall-assembly-0.0.1.jar file:///usr/lib/spark/skyfall-assembly-0.0.1.jar"
# Submit the first attempt of the job
gcloud dataproc jobs submit spark --cluster my-test-cluster \
--class com.company.skyfall.Skyfall \
--jars file:///usr/lib/spark/skyfall-assembly-0.0.1.jar \
--properties spark.ui.showConsoleProgress=false
Dataproc relies on spark.yarn.tags under the hood to track YARN applications associated with jobs. However, the checkpoint holds a stale spark.yarn.tags which causes Dataproc to get confused with new applications that seem to be associated with old jobs.
For now, it only "cleans up" suspicious YARN applications as long as the recent killed jobid is held in memory, so rebooting the dataproc agent will fix this.
# Kill the job through the UI or something before the next step.
# Now use "pig sh" to restart the dataproc agent
gcloud dataproc jobs submit pig --cluster my-test-cluster \
--execute "sh systemctl restart google-dataproc-agent.service"
# Re-run your job without needing to change anything else,
# it'll be fine now if you ever need to resubmit it and it
# needs to recover from the checkpoint again.
Keep in mind though that by nature of checkpoints this means you won't be able to change the arguments you pass on subsequent runs, because the checkpoint recovery is used to clobber your command-line settings.
You can also run the job in yarn cluster mode to avoid adding jar to your master machine. The potential trade off is the spark driver will run in worker node instead of the master.

Apache Spark Multi Node Clustering

I am currently working on logger analyse by using apache spark. I am new for Apache Spark. I have tried to use apache spark standalone mode. I can run my code by submitting jar with deploy-mode on the client. But I can not run with multi node cluster. I have used worker nodes are different machine.
sh spark-submit --class Spark.LogAnalyzer.App --deploy-mode cluster --master spark://rishon.server21:7077 /home/rishon/loganalyzer.jar "/home/rishon/apache-tomcat-7.0.63/LogAnalysisBackup/"
when i Run this command, it shows following error
15/10/20 18:04:23 ERROR ClientEndpoint: Exception from cluster was: java.io.FileNotFoundException: /home/rishon/loganalyzer.jar (No such file or directory)
java.io.FileNotFoundException: /home/rishon/loganalyzer.jar (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
at org.spark-project.guava.io.ByteSource.copyTo(ByteSource.java:202)
at org.spark-project.guava.io.Files.copy(Files.java:436)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:514)
at org.apache.spark.util.Utils$.copyFile(Utils.scala:485)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:562)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)
As my understanding, The driver program sends the data and application code to worker node. I don't know my understanding is correct or not. So Please help me to run application on a cluster.
I have tried to run jar on cluster and Now there is no exception but why the task is not assigned to worker node?
I have tried without clustering. Its working fine. shown in following figure
Above image shows, Task assigned to worker nodes. But I have one more problem to analyse the log file. Actually, I have log files in master node which is in a folder (ex: '/home/visva/log'). But the worker node searching the file on their own file system.
I met same problem.
My solution was that I uploaded my .jar file on the HDFS.
Enter the command line like this:
spark-submit --class com.example.RunRecommender --master spark://Hadoop-NameNode:7077 --deploy-mode cluster --executor-memory 6g --executor-cores 3 hdfs://Hadoop-NameNode:9000/spark-practise-assembly-1.0.jar
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
If you use the cluster model in spark-submit , you need use the 6066 port(the default port of rest in spark) :
spark-submit --class Spark.LogAnalyzer.App --deploy-mode cluster --master spark://rishon.server21:6066 /home/rishon/loganalyzer.jar "/home/rishon/apache-tomcat-7.0.63/LogAnalysisBackup/"
In my case, i upload the jar of app to every node in cluster because i do not know how does the spark-submit to transfer the app automatically and i don't know how to specify a node as driver node .
Note: The jar path of app is a path that in the any node of cluster.
There are two deploy modes in Spark to run the script.
1.client (default): In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster.(Master node)
2.cluster : If your application is submitted from a machine far from the worker machines, it is common to use cluster mode to minimize network latency between the drivers and the executors.
Reference Spark Documentation For Submitting JAR

Resources