Why treeAggregate always change the partition to 70? - apache-spark

Background:
I have several billion rows, which I need to use to run logistic regression. I choose L-BFGS and it succeed when faced to a small data set(1 hundred million), but always failed when faced to the several billion rows.
I read the log and I find this error:
Job aborted due to stage failure: Task 54 in stage 37.0 failed 4 times, most recent failure:
Lost task 54.3 in stage 37.0 (TID 79670, 10.215.155.83):
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Which is triggered by:
treeAggregate at StandardScaler.scala:55
And I find the function treeAggregate seems change the partition number to 70, which I set to 5000. It explains the OOM, but I don't know why, and I wonder how to change it to 5000 to avoid OOM(VM's memory limit is 14G which cannot be changed).

Related

Spark breaks when you need to make a very large shuffle

I'm working with 1 terabytes of data, and at a moment I need to join two smaller dataframes, I don't know the size, but it has more than 200 GB and I get the error below.
The break occurs in the middle of the operation after 2 hours.
It seems to me to be a memory stick, but that doesn't make sense, because looking at the UI Spark Ganglia, the RAM memory doesn't reach the limit as shown in the print below.
Does anyone have any idea how I can solve this without decreasing the amount of data analyzed.
My cluster has:
1 x master node n1-highmem-32
4 x slave node n1-highmem-32
[org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 482.1 failed 4 times, most recent failure: Lost task 3.3 in stage 482.1 (TID 119785, 10.0.101.141, executor 1): java.io.FileNotFoundException: /tmp/spark-83927f3e-4511-1b/3d/shuffle_248_72_0.data.f3838fbc-3d38-4889-b1e9-298f743800d0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
Caused by: java.io.FileNotFoundException: /tmp/spark-83927f3e-4511-1b/3d/shuffle_248_72_0.data.f3838fbc-3d38-4889-b1e9-298f743800d0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)][1]
This types of errors typically occur when there are deeper problems with some tasks, like significant data skew. Since you don't provide enough details (please be sure to read How To Ask and How to create a Minimal, Complete, and Verifiable example) and job statistics the only approach that I can think off is to significantly increase number of shuffle partitions:
´´´
sqlContext.setConf("spark.sql.shuffle.partitions", 2048)
´´´

Spark Structured streaming - java.lang.OutOfMemoryError: Java heap space

I am getting the below exception when processing input streams using Spark structured streaming.
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 22 in stage 5.0 failed 1 times, most recent failure: Lost task
22.0 in stage 5.0 (TID 403, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
I have handled watermark as given below,
.withWatermark("timestamp", "5 seconds")
.groupBy(window($"timestamp", "1 second"), $"column")
What could be the issue? I have tried changing the trigger from default to fixed interval but still I am still facing the problem.
I don't believe this issue is related to watermarks or triggers. OutOfMemory errors occur due to two reasons:
Memory Leaks. This programming error will lead your application to constantly consume more memory. Every time the leaking functionality of the application is used it leaves some objects behind into the Java heap space. Over time the leaked objects consume all of the available Java heap space and trigger the error.
Too much data for the resources designated to it. Your cluster has a designated threshold and can only hold a certain amount of data. When the volume of data exceeds that threshold, the job which functioned normally before the spike ceases to operate and triggers the java.lang.OutOfMemoryError: Java heap space error.
Your error says task 22.0 in stage 5.0 as well which means that it completed stages 1 - 4 successfully. To me, that signifies that there was too much data for the resources designated to it as it did not die over multiple runs as it would with a memory leak. Try limiting the amount of data being read in with something like spark.readStream.option("maxFilesPerTrigger", "6")or increasing the memory assigned to that cluster.

Databricks Checksum error while writing to a file

I am running a job in 9 nodes.
All of them are going to write some information to files doing simple writes like below:
dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)
However I am receiving this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling
o106.save. : java.util.concurrent.ExecutionException:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task
1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0
exp: 1179219224 got: -1020415797
It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.
Is there any known scenario that may be causing it?
So there are a couple of things going on and it should explain why coalesce may not work.
What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.
What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.
Why coalesce would not work?
Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.
Here is the code that would work:
dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

Prediction.io - pio train fails with OutOfMemoryError

We are getting the following error after running "pio train". It works about 20 minutes and fails on Stage 26.
[ERROR] [Executor] Exception in task 0.0 in stage 1.0 (TID 3)
[ERROR] [SparkUncaughtExceptionHandler] Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
[ERROR] [SparkUncaughtExceptionHandler] Uncaught exception in thread Thread[Executor task launch worker-4,5,main]
[WARN] [TaskSetManager] Lost task 2.0 in stage 1.0 (TID 5, localhost): java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.io.Output.<init>(Output.java:35)
at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:80)
at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:289)
at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:289)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:293)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:239)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Our server has about 30gb memory, but about 10gb is taken by hbase+elasticsearch.
We are trying to process about 20 millions of records created by Universal Recommender.
I've tried the following command to increase executor/driver memory, but it didn't help:
pio train -- --driver-memory 6g --executor-memory 8g
What options could we try to fix the issue? Is it possible to process that amount of events on server with that amount of memory?
Vertical scaling can take you only so far but you could try increasing the memory available if it's AWS by stopping and restarting with a larger instance.
CF looks at a lot of data, Since Spark gets it's speed by doing in-memory calculations (by default) you will need enough memory to hold all of your data spread over all Spark workers and in your case you have only 1.
Another thing that comes to mind is that this is a Kryo error so you might try increasing the Kryo buffer size a little, which is configured in engine.json
Also there is a Google Group for community support here: https://groups.google.com/forum/#!forum/actionml-user

Are failed tasks resubmitted in Apache Spark?

Are failed tasks automatically resubmitted in Apache Spark to the same or another executor?
Yes, but there is a parameter set for the max number of failures
spark.task.maxFailures 4 Number of individual task failures before giving up on the job. Should be greater than or equal to 1. Number of allowed retries = this value - 1.
I believe failed tasks are resubmitted because I have seen the same failed task submitted multiple times on the Web UI. However, if the same task fails multiple times, the full job fail:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 120 in stage 91.0 failed 4 times, most recent failure: Lost task 120.3 in stage 91.0

Resources