Spark streaming on dataproc throws FileNotFoundException - apache-spark

When I try to submit a spark streaming job to google dataproc cluster, I get this exception:
16/12/13 00:44:20 ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
...
16/12/13 00:44:20 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector#d7bffbc{HTTP/1.1}{0.0.0.0:4040}
16/12/13 00:44:20 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
16/12/13 00:44:20 ERROR org.apache.spark.util.Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:152)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1360)
...
Exception in thread "main" java.io.FileNotFoundException: File file:/tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
Full output here.
It seems this error happens when hadoop configuration is not correctly defined in spark-env.sh - link1, link2
Is it configurable somewhere? Any pointers on how to resolve it?
Running the same code in local mode works fine:
sparkConf.setMaster("local[4]")
For additional context: the job was invoked like this:
gcloud dataproc jobs submit spark \
--cluster my-test-cluster \
--class com.company.skyfall.Skyfall \
--jars gs://my-bucket/resources/skyfall-assembly-0.0.1.jar \
--properties spark.ui.showConsoleProgress=false
This is the boilerplate setup code:
lazy val conf = {
val c = new SparkConf().setAppName(this.getClass.getName)
c.set("spark.ui.port", (4040 + scala.util.Random.nextInt(1000)).toString)
if (isLocal) c.setMaster("local[4]")
c.set("spark.streaming.receiver.writeAheadLog.enable", "true")
c.set("spark.streaming.blockInterval", "1s")
}
lazy val ssc = if (checkPointingEnabled) {
StreamingContext.getOrCreate(getCheckPointDirectory, createStreamingContext)
} else {
createStreamingContext()
}
private def getCheckPointDirectory: String = {
if (isLocal) localCheckPointPath else checkPointPath
}
private def createStreamingContext(): StreamingContext = {
val s = new StreamingContext(conf, Seconds(batchDurationSeconds))
s.checkpoint(getCheckPointDirectory)
s
}
Thanks in advance

Is it possible that this wasn't the first time you ran the job with the given checkpoint directory, as in the checkpoint directory already contains a checkpoint?
This happens because the checkpoint hard-codes the exact jarfile arguments used to submit the YARN application, and when running on Dataproc with a --jars flag pointing to GCS, this is actually syntactic sugar for Dataproc automatically staging your jarfile from GCS into a local file path /tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar that's only used temporarily for the duration of a single job-run, since Spark isn't able to invoke the jarfile directly out of GCS without staging it locally.
However, in a subsequent job, the previous tmp jarfile will already be deleted, but the new job tries to refer to that old location hard-coded into the checkpoint data.
There are also additional issues caused by hard-coding in the checkpoint data; for example, Dataproc also uses YARN "tags" to track jobs, and will conflict with YARN if an old Dataproc job's "tag" is reused in a new YARN application. To run your streaming application, you'll need to first clear out your checkpoint directory if possible to start from a clean slate, and then:
You must place the job jarfile somewhere on the master node before starting the job, and then your "--jar" flag must specify "file:///path/on/master/node/to/jarfile.jar".
When you specify a "file:///" path dataproc knows its already on the master node so it doesn't re-stage into a /tmp directory, so in that case it's safe for the checkpoint to point to some fixed local directory on the master.
You can do this either with an init action or you can submit a quick pig job (or just ssh into the master and download that jarfile):
# Use a quick pig job to download the jarfile to a local directory (for example /usr/lib/spark in this case)
gcloud dataproc jobs submit pig --cluster my-test-cluster \
--execute "fs -cp gs://my-bucket/resources/skyfall-assembly-0.0.1.jar file:///usr/lib/spark/skyfall-assembly-0.0.1.jar"
# Submit the first attempt of the job
gcloud dataproc jobs submit spark --cluster my-test-cluster \
--class com.company.skyfall.Skyfall \
--jars file:///usr/lib/spark/skyfall-assembly-0.0.1.jar \
--properties spark.ui.showConsoleProgress=false
Dataproc relies on spark.yarn.tags under the hood to track YARN applications associated with jobs. However, the checkpoint holds a stale spark.yarn.tags which causes Dataproc to get confused with new applications that seem to be associated with old jobs.
For now, it only "cleans up" suspicious YARN applications as long as the recent killed jobid is held in memory, so rebooting the dataproc agent will fix this.
# Kill the job through the UI or something before the next step.
# Now use "pig sh" to restart the dataproc agent
gcloud dataproc jobs submit pig --cluster my-test-cluster \
--execute "sh systemctl restart google-dataproc-agent.service"
# Re-run your job without needing to change anything else,
# it'll be fine now if you ever need to resubmit it and it
# needs to recover from the checkpoint again.
Keep in mind though that by nature of checkpoints this means you won't be able to change the arguments you pass on subsequent runs, because the checkpoint recovery is used to clobber your command-line settings.

You can also run the job in yarn cluster mode to avoid adding jar to your master machine. The potential trade off is the spark driver will run in worker node instead of the master.

Related

How to get basic Spark program running on Kubernetes

I'm trying to get off the ground with Spark and Kubernetes but I'm facing difficulties. I used the helm chart here:
https://github.com/bitnami/charts/tree/main/bitnami/spark
I have 3 workers and they all report running successfully. I'm trying to run the following program remotely:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://<master-ip>:<master-port>").getOrCreate()
df = spark.read.json('people.json')
Here's the part that's not entirely clear. Where should the file people.json actually live? I have it locally where I'm running the python code and I also have it on a PVC that the master and all workers can see at /sparkdata/people.json.
When I run the 3rd line as simply 'people.json' then it starts running but errors out with:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
If I run it as '/sparkdata/people.json' then I get
pyspark.sql.utils.AnalysisException: Path does not exist: file:/sparkdata/people.json
Not sure where I go from here. To be clear I want it to read files from the PVC. It's an NFS share that has the data files on it.
Your people.json file needs to be accessible to your driver + executor pods. This can be achieved in multiple ways:
having some kind of network/cloud drive that each pod can access
mounting volumes on your pods, and then uploading the data to those volumes using --files in your spark-submit.
The latter option might be the simpler to set up. This page discusses in more detail how you could do this, but we can shortly go to the point. If you add the following arguments to your spark-submit you should be able to get your people.json on your driver + executors (you just have to choose sensible values for the $VAR variables in there):
--files people.json \
--conf spark.kubernetes.file.upload.path=$SOURCE_DIR \
--conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
--conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
--conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
--conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
You can always verify the existence of your data by going inside of the pods themselves like so:
kubectl exec -it <driver/executor pod name> bash
(now you should be inside of a bash process in the pod)
cd <mount-path-you-chose>
ls -al
That last ls -al command should show you a people.json file in there (after having done your spark-submit of course).
Hope this helps!

Spark job not showing up on standalone cluster GUI

I am playing with running spark jobs in my lab and have a three node standalone cluster. When I execute a new job on the master node via CLI
spark-submit sparktest.py --master spark://myip:7077
while the job completes as expected it does not show up at all on the cluster GIU. After some investigation, I added the --master to the submit command but to no avail. During job execution as well as after completion when I navigate to http://mymasternodeip:8080/
none of these jobs are recognized in Running Jobs nor Completed Jobs. Any thoughts as to why the jobs dont show up would be appreciated.
You should specify --master flag first then remaining flags/options. If not master will be considered as local.
spark-submit --master spark://myip:7077 sparktest.py
Make sure that you don't override master config in your code while creating SparkSession object. Provide same master url in code also or don't add it.

Local file upload failed in spark application

In my code, I am trying to load a file which is in my local machine into spark application,
sc.textFile("file:///home/testpath/file1“).
When I submit the job on the command line
Scenario 1: spark submit --class … master local
Job ran successfully with out any issues.
Scenario 2 : spark submit --class …. —master yarn —deploy-mode cluster
Job failed by throwing file:///home/testpath/file1 file not found Exception.
But when I tested file1.... File exists on my local.
Scenario 3 : spark submit —class … —master yarn —deploy-mode client
Job failed by throwing file:///home/testpath/file1 file not found Exception.
But when I tested file1,, File exists on my local.
Scenario 4: spark-shell —master=yarn
Val file1 = sc.textFile("file:///home/testpath/file1“).
Job failed by throwing file:///home/testpath/file1 file not found Exception.
In core-site.xml, fs.default.name property set to hdfs://mynamenode:9000
Could you please help how can I load local file in my spark application( Using spark 2.X version)
Any Ideas? Thanks in advance.
When spark execution mode is local, spark executor jobs are scheduled on the same local node and hence, it is able to find the file. But, when in yarn mode, executor jobs are scheduled randomly on any of the cluster nodes. So, you may either move your file to HDFS or maintain a copy of this file on each node

Hadoop copyToLocalFile failing in Yarn cluster mode

I was trying to copy a file to local from HDFS using Hadoop's copyToLocalFile function from my Spark2 application.
val hadoopConf = new Configuration()
val hdfs = FileSystem.get(hadoopConf)
val src = new Path("/user/yxs7634/all.txt")
val dest = new Path("file:///home/yxs7634/all.txt")
hdfs.copyToLocalFile(src, dest)
The above code is working fine when I submit my spark application in Yarn client mode. But, It keeps failing with the below exception in Yarn cluster mode.
18/10/03 12:18:40 ERROR yarn.ApplicationMaster: User class threw exception: java.io.FileNotFoundException: /home/yxs7634/all.txt (Permission denied)
In yarn-cluster mode the driver is also handled by yarn and the selected driver node may not be the one where you're submitting the job. Hence for this job to work in yarn-cluster mode I believe you need to place the local file in all the spark nodes in the cluster.
In yarn mode, the spark job is submitted through YARN.
The driver would be started on a different node.
To tackle this issue, you can use a distributed file system like HDFS to store your file and then giving the absolute path.
eg:
val src = new Path("hdfs://nameservicehost:8020/user/yxs7634/all.txt")
Looks like Spark server running under one user (for ex. "spark"), and file in code stored in other user "yxs7634" directory.
In cluster mode user "spark" does not allows to write in "yxs7634" user dir, and such exception occurs.
Additional permission for Spark user to write in "/home/yxs7634" is required.
In local mode worked fine, because Spark runs under "yxs7634" user.
You have a permission denied error, I mean, the user you are using to submit the job is not able to access the file. The directory should have at least read permission to user "other", something like this: -rw-rw-r--
Can you paste the permissions of the directory and the file? The command is
hdfs dfs -ls /your-directory/

How to get the progress bar (with stages and tasks) with yarn-cluster master?

When running a Spark Shell query using something like this:
spark-shell yarn --name myQuery -i ./my-query.scala
Inside my query is simple Spark SQL query where I read parquet files and run simple queries and write out parquet files. When running these queries I get a nice progress bar like this:
[Stage7:===========> (14174 + 5) / 62500]
When I create a jar using the exact same query and run it with the following command-line:
spark-submit \
--master yarn-cluster \
--driver-memory 16G \
--queue default \
--num-executors 5 \
--executor-cores 4 \
--executor-memory 32G \
--name MyQuery \
--class com.data.MyQuery \
target/uber-my-query-0.1-SNAPSHOT.jar
I don't get any such progress bar. The command simply says repeatedly
17/10/20 17:52:25 INFO yarn.Client: Application report for application_1507058523816_0443 (state: RUNNING)
The query works fine and the results are fine. But I just need to have feedback when the process will finish. I have tried the following.
The web page of RUNNING Hadoop Applications does have a progress bar but it basically never moves. Even in the case of the spark-shell query that progress bar is useless.
I have tried get the progress bar through the YARN logs but they are not aggregated until the job is complete. Even then there is no progress bar in the logs.
Is there is a way to launch a spark query in jar on a cluster and have a progressbar?
When I create a jar using the exact same query and run it with the following command-line (...) I don't get any such progress bar.
The difference between these two seemingly similar Spark executions is the master URL.
In the former Spark execution with spark-shell yarn, the master is YARN in client deploy mode, i.e. the driver runs on the machine where you start spark-shell from.
In the latter Spark execution with spark-submit --master yarn-cluster, the master is YARN in cluster deploy mode (which is actually equivalent to --master yarn --deploy-mode cluster), i.e. the driver runs on a YARN node.
With that said, you won't get the nice progress bar (which is actually called ConsoleProgressBar) on the local machine but on the machine where the driver runs.
A simple solution is to replace yarn-cluster with yarn.
ConsoleProgressBar shows the progress of active stages to standard error, i.e. stderr.
The progress includes the stage id, the number of completed, active, and total tasks.
ConsoleProgressBar is created when spark.ui.showConsoleProgress Spark property is turned on and the logging level of org.apache.spark.SparkContext logger is WARN or higher (i.e. less messages are printed out and so there is a "space" for ConsoleProgressBar).
You can find more information in Mastering Apache Spark 2's ConsoleProgressBar.

Resources