Spark Yarn /tmp No such file or directory - apache-spark

I am getting error in my spark jobs and they error are usually similar to one shown below. A node in cluster has around 256 GB of memory and around 8 cores, also I have specified executor memory as 4GB and extra 4GB overhead. For shuffle I have specified memory fraction as 0.5, by all this I want to indicate it does not seems like memory issue. However I am not able to figure out what could be issue and this comes up in one stage or another, I reran my job multiple times and this comes at multiple points. You can assume we have infrastructure of around 200+ nodes with decent configuration.
Job aborted due to stage failure: Task 0 in stage 2.0 failed 12 times, most recent failure: Lost task 0.11 in stage 2.0 (TID 27, lgpbd1107.sgp.ladr.com): java.io.FileNotFoundException: /tmp/hadoop-mapr/nm-local-dir/usercache/names/appcache/application_1485048538020_113554/3577094671485456431296_lock (No such file or directory)
I am unable to figure out whether its issue related to application or infrastructure. Could someone please help.

It is due to the tmpwatch utility, which runs daily on CentOS systems to clean up /tmp/files not recently accessed. The NodeManager service will not recreate the top level hadoop.tmp.dir (which defaults to /tmp/hadoop-${user.name}) when it launches a job.
Now you have two options:
Option -1: Go to /etc/cron.daily/tmp-watch and exclude this directory from cleaning up daily. /tmp/hadoop-mapr/nm-local-dir/filecache
Option -2: Go to
core-site.xml and add/change value of hadoop.tmp.dir property --- default is /tmp/hadoop-${user.name}
or
yarn-site.xml and add/change value of yarn.nodemanager.local-dirs property --- default is ${hadoop.tmp.dir}/nm-local-dir

Related

Spark breaks when you need to make a very large shuffle

I'm working with 1 terabytes of data, and at a moment I need to join two smaller dataframes, I don't know the size, but it has more than 200 GB and I get the error below.
The break occurs in the middle of the operation after 2 hours.
It seems to me to be a memory stick, but that doesn't make sense, because looking at the UI Spark Ganglia, the RAM memory doesn't reach the limit as shown in the print below.
Does anyone have any idea how I can solve this without decreasing the amount of data analyzed.
My cluster has:
1 x master node n1-highmem-32
4 x slave node n1-highmem-32
[org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 482.1 failed 4 times, most recent failure: Lost task 3.3 in stage 482.1 (TID 119785, 10.0.101.141, executor 1): java.io.FileNotFoundException: /tmp/spark-83927f3e-4511-1b/3d/shuffle_248_72_0.data.f3838fbc-3d38-4889-b1e9-298f743800d0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
Caused by: java.io.FileNotFoundException: /tmp/spark-83927f3e-4511-1b/3d/shuffle_248_72_0.data.f3838fbc-3d38-4889-b1e9-298f743800d0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)][1]
This types of errors typically occur when there are deeper problems with some tasks, like significant data skew. Since you don't provide enough details (please be sure to read How To Ask and How to create a Minimal, Complete, and Verifiable example) and job statistics the only approach that I can think off is to significantly increase number of shuffle partitions:
´´´
sqlContext.setConf("spark.sql.shuffle.partitions", 2048)
´´´

PySpark: Job aborts due to stage failure, but resetting max size isn't recognized

I'm attempting to display a dataframe in PySpark after reading the files in using a function/subroutine. Reading the files in works greatly, but it's the display that's not working. Actually, due to lazy evaluation, this may not be true.
I get this error
SparkException: Job aborted due to stage failure: Total size of serialized results of 29381 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
so I do what was suggested https://forums.databricks.com/questions/66/how-do-i-work-around-this-error-when-using-rddcoll.html
sqlContext.setConf("spark.driver.maxResultSize", "8g")
sqlContext.getConf("spark.driver.maxResultSize")
however, the bizarre part is, this gives the same error back when I re-run the display(df) command.
It's like Spark is just ignoring my commands.
I've tried increasing the number of workers and making both the worker type and driver type larger, but neither of these fixed anything.
How can I get this to work? or is this a bug in Databricks/Spark?
It all depends on your code and partitioning of the code with respect to the cluster size. Increasing spark.driver.maxResultSize is the first option to solve the problem and eventually look for a permanent solution to modify the code or design. Please do avoid collecting more data to driver node.
OR
You need to change this parameter in the cluster configuration. Go into the cluster settings, under Advanced select spark and paste spark.driver.maxResultSize 0 (for unlimited) or whatever the value suits you. Using 0 is not recommended. You should optimize the job by re partitioning.
For more details, refer "Spark Configurations - Application Properties".
Hope this helps. Do let us know if you any further queries.

Spark Structured streaming - java.lang.OutOfMemoryError: Java heap space

I am getting the below exception when processing input streams using Spark structured streaming.
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 22 in stage 5.0 failed 1 times, most recent failure: Lost task
22.0 in stage 5.0 (TID 403, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
I have handled watermark as given below,
.withWatermark("timestamp", "5 seconds")
.groupBy(window($"timestamp", "1 second"), $"column")
What could be the issue? I have tried changing the trigger from default to fixed interval but still I am still facing the problem.
I don't believe this issue is related to watermarks or triggers. OutOfMemory errors occur due to two reasons:
Memory Leaks. This programming error will lead your application to constantly consume more memory. Every time the leaking functionality of the application is used it leaves some objects behind into the Java heap space. Over time the leaked objects consume all of the available Java heap space and trigger the error.
Too much data for the resources designated to it. Your cluster has a designated threshold and can only hold a certain amount of data. When the volume of data exceeds that threshold, the job which functioned normally before the spike ceases to operate and triggers the java.lang.OutOfMemoryError: Java heap space error.
Your error says task 22.0 in stage 5.0 as well which means that it completed stages 1 - 4 successfully. To me, that signifies that there was too much data for the resources designated to it as it did not die over multiple runs as it would with a memory leak. Try limiting the amount of data being read in with something like spark.readStream.option("maxFilesPerTrigger", "6")or increasing the memory assigned to that cluster.

How does hive on spark determine reducer number?

I enable Hive on Spark according to Cloudera documentation 1 and 2. I now find that reducer number behaves unexpectedly. I wish someone could provide detailed documentation or explanation regarding that.
As far as I know, Hive on MR calculates reducer number based on data volume and hive.exec.reducers.bytes.per.reducer, which means bytes per reducer processes, hence job parallelism can be adjusted automatically. But Hive on Spark seems to treat this parameter differently. Though setting it to very low number (<1K) increases reducer number indeed, no common rule can be applied to different jobs.
Below is segment from Cloudera tuning documentation for parallelism.
Adjust hive.exec.reducers.bytes.per.reducer to control how much data each reducer processes, and Hive determines an optimal number of partitions, based on the available executors, executor memory settings, the value you set for the property, and other factors. Experiments show that Spark is less sensitive than MapReduce to the value you specify for hive.exec.reducers.bytes.per.reducer, as long as enough tasks are generated to keep all available executors busy
Also, I understand that RDD in Spark spills data on disk when memory is not sufficient. If that, the following error messages from Hive on Spark jobs really confuse me.
Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 146, fuxi-luoge-105, executor 34): ExecutorLostFailure (executor 34 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 6.2 GB of 6.0 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

Severe straggler tasks due to Locality Level being "Any" and a Network Fetch on cached RDD

A cached dataset that has been completely read through - successfully - is being reprocessed. A small number (typically 2/204 tasks - 1%) of the tasks may fail on a subsequent pass over the same (still cached) dataset. We are on spark 1.3.1.
The following screenshot shows that - of 204 tasks - the last two seem to have been 'forgotten' by the scheduler.
Is there any way to get more information about these tasks that are in limbo?
All of the other tasks completed within a reasonable fraction of similar time: in particular the 75% is still within 50% of the median. It is just these last two stragglers that are killing the entire job completion time. Notice also these are not due to record count skew
Update The two stragglers did finally finish - at over 7 minutes (over 3x longer any other other 202 tasks) !
15/08/15 20:04:54 INFO TaskSetManager: Finished task 201.0 in stage 2.0 (TID 601) in 133583 ms on x125 (202/204)
15/08/15 20:09:53 INFO TaskSetManager: Finished task 189.0 in stage 2.0 (TID 610) in 423230 ms on i386 (203/204)
15/08/15 20:10:05 INFO TaskSetManager: Finished task 190.0 in stage 2.0 (TID 611) in 435459 ms on i386 (204/204)
15/08/15 20:10:05 INFO DAGScheduler: Stage 2 (countByKey at MikeFilters386.scala:76) finished in 599.028 s
Suggestions on what to look for /review appreciated.
Another update The TYPE has turned out to be Network for those two. What does that mean?
I had a similar issue with you. Try increasing spark.locality.wait.
If that works, the following might apply to you:
https://issues.apache.org/jira/browse/SPARK-13718#
** ADDED **
Some extra information that I found helpful.
Spark will always initially assign a task to the executor that contains the respective cached RDD partition.
If Task is not accepted under the locality timeouts as defined in the spark config, then it will try NODE_LOCAL, RACK_LOCAL, ANY in that sequence.
Regardless if the cached data are available locally (HDFS replicas), Spark will always fetch the cached partition from the node that contains it. It will only re-compute if that executor crashed so the RDD is no longer cached. This will, in many cases, cause a network bottleneck on the original straggler node as well.
Have you tried using Spark speculation (spark.speculation true)? Spark will identify these stragglers and relaunch then on another node.

Resources