I am currently encountering this exception in Spark 2.3, while running in Azure HDInsight 3.6 on an 80 node cluster:
java.lang.UnsupportedOperationException: Can not build a HashedRelation that is larger than 8G
at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.grow(HashedRelation.scala:623)
at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:570)
at org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:867)
at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:111)
at org.apache.spark.sql.execution.joins.ShuffledHashJoinExec.org$apache$spark$sql$execution$joins$ShuffledHashJoinExec$$buildHashedRelation(ShuffledHashJoinExec.scala:56)
at org.apache.spark.sql.execution.joins.ShuffledHashJoinExec$$anonfun$doExecute$1.apply(ShuffledHashJoinExec.scala:68)
at org.apache.spark.sql.execution.joins.ShuffledHashJoinExec$$anonfun$doExecute$1.apply(ShuffledHashJoinExec.scala:67)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This has occurred multiple times while performing a union between 6 tables, one of which is several GB. However, it does not always occur and I cannot reproduce it. This union has run on a much larger dataframe on the same size and number of executors without failing. It happened on a particular run for all 5 retries, and after setting "spark.sql.join.preferSortMergeJoin" to true, it then ran through. Now when trying to reproduce it on a new cluster, with everything else being the same, I cannot, and it runs as expected.
Are there any ideas on what could cause this?
Since I was able to resolve this, here is what I found:
It was discovered that one of the dataframes in the union was not cached, and was causing this issue. This dataframe had a large number of partitions, but had 0 rows.
This was difficult to discover as it was unexpected. I don't have a link, but there is a known issue of extreme performance degradation of having a multi-partition dataframe with 0 rows.
Related
I am working on a dataset of initially 569 MB, calculating the TF-IDF metric. Although I am getting results in the end I keep getting the below error:
WARN scheduler.TaskSetManager: Lost task 13.0 in stage 11.0 (TID 84, X.X.X.X, executor 0): FetchFailed(null, shuffleId=4, mapId=-1, reduceId=4, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 4
at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:882)
at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:878)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:878)
at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:691)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:103)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I have read relative posts and have already changed some spark properties as below
spark=SparkSession.builder.appName("part_2_task_2").config('spark.executor.memory','2g').config('spark.executor.memoryOverhead','1g').config('spark.shuffle.io.maxRetries',5).config('spark.shuffle.io.retryWait','30s').config('spark.network.timeout','200s').getOrCreate()
So currently I have the below cluster details:
spark.executor.cores 2
spark.executor.instances 2
spark.executor.memory 2g
spark.executor.memoryOverhead 1g
Moreover I checked in more details where the issue is from the UI and was able to find that the failed Stage arises from line 126 of my code which is the below join:
tfidf = tf.join(idf)
and the two rdds tf and idf are calculated as
tf = step1.map(lambda x: (x[0][0], (x[0][1], x[0][2], x[0][3], x[1]/x[0][3])))
idf = step1.map(lambda x: (x[0][0], (x[0][2], x[1], 1))). \
map(lambda x: (x[0], x[1][2])). \
reduceByKey(lambda x, y: x +y ). \
map(lambda x: (x[0], (x[1], math.log10(number_of_docs/x[1]))))
The rdds tf and idf have different .count() since tf is per each document and word whereas idf is per each word only thus I am joining them. Would that be an issue so I should check if they are equal size before joining by using partition commands although they are costly? If this is not an issue what would be the ideal cluster properties to process this kind of size data as mentioned above?
I gave another 1g to the executor memory from the available 4G memory my virtual machine provides me so ended up in the following setting
spark.executor.memory 3g
spark.executor.memoryOverhead 1g
and the Exception disappeared.
I am not still sure if this is the best solution, or if my code needed fixing to overcome this issue of joining two rdds of different lengths, hence partitions that maybe have caused the problem. Any explanations would be much appreciated, since this is my first attempt with Apache Spark applications.
I want to load all parquet files that are stored in a folder structure in S3 AWS.
The folder structure is as follows: S3/bucket_name/folder_1/folder_2/folder_3/year=2019/month/day
What I want is to read all parquet files at once, so I want PySpark to read all data from 2019 for all months and days that are available and then store it in one dataframe (so you get a concatenated/unioned dataframe with all days in 2019).
I am told that these are partitioned files (though I am not sure of this).
Is this possible in PySpark and if so how?
When I try spark.read.parquet('S3/bucket_name/folder_1/folder_2/folder_3/year=2019')
it works. However, when I want to take a look at the Spark dataframe using spark.read.parquet('S3/bucket_name/folder_1/folder_2/folder_3/year=2019').show()
it says:
An error occurred while calling o934.showString.
: org.apache.spark.SparkException:
Job aborted due to stage failure: Task 0 in stage 36.0 failed 4 times,
most recent failure:
Lost task 0.3 in stage 36.0 (TID 718, executor 7):
java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I want to be able to show the dataframe.
Please refer the "Partition discovery" part of the documentation:
https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#partition-discovery
In PySpark, you can do this simply as follows:
from pyspark.sql.functions import col
(
spark.read
.parquet('S3/bucket_name/folder_1/folder_2/folder_3')
.filter(col('year') == 2019)
)
So you will point the path to the folder where it is partitioned into some subfolders and you apply the partition filter which should take the data only from the given year subfolder.
I've generated parquet files using append data mode over spark. But while reading those files, throwing parquet decode exceptions.
I'm already using merge-schema option, but the problem I'm facing is with files part of some partitions. Other partitions are not throwing any kind of exception.
df = spark.read.parquet("s3://bucket/folder/date=<>/")
org.apache.spark.sql.execution.QueryExecutionException: Encounter error while reading parquet files. One possible cause: Parquet column cannot be converted in the corresponding files. Details:
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:193)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://bucket/key/date=<>/part-00028-625e653b-c000.snappy.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
... 21 more
Caused by: java.lang.ClassCastException: Expected instance of group converter but got "org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetStringConverter"
at org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:34)
at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:267)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
... 25 more
Reading the mentioned file, doesn't throw any exception. But reading all the files under folder, throw exception wrt a single file. There is other data set in similar folders, but that doesn't threw any exceptions.
I'm not able to understand the real cause for this error, and is there any option or way to fix this ?
when running spark structured streaming using lib: "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.0", we keep getting error regarding current offset fetching:
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost
task 0.3 in stage 0.0 (TID 3, qa2-hdp-4.acuityads.org, executor 2):
java.lang.AssertionError: assertion failed: latest offs et
-9223372036854775808 does not equal -1 at scala.Predef$.assert(Predef.scala:170) at
org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.resolveRange(KafkaMicroBatchReader.scala:371)
at
org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.(KafkaMicroBatchReader.scala:329)
at
org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartition.createPartitionReader(KafkaMicroBatchReader.scala:314)
at
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121) at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
for some reason, looks like fetchLatestOffset returned a Long.MIN_VALUE for one of the partitions. I checked the structured streaming checkpoint, that was correct, it's the currentAvailableOffset was set to Long.MIN_VALUE.
kafka broker version: 1.1.0.
lib we used:
{{libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.0" }}
how to reproduce:
basically we started a structured streamer and subscribed a topic of 4 partitions. then produced some messages into topic, job crashed and logged the stacktrace like above.
also the committed offsets seem fine as we see in the logs:
=== Streaming Query ===
Identifier: [id = c46c67ee-3514-4788-8370-a696837b21b1, runId = 31878627-d473-4ee8-955d-d4d3f3f45eb9]
Current Committed Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: {"REVENUEEVENT":{"0":1}}}
Current Available Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: {"REVENUEEVENT":{"0":-9223372036854775808}}}
so spark streaming recorded the correct value for partition: 0, but the current available offsets returned from kafka is showing Long.MIN_VALUE.
found the issue, this is due to a integer overflow inside the spark structured streaming library. details are posted here: https://issues.apache.org/jira/browse/SPARK-26718
We are using Apache Beam which is executed on Spark runner. Our Case is the following. Both the 2 use cases causes OutofMemory error.
1) Join - 2 Big Tables using Apache Beam - One table of size 120GB and the other is of 60GB. This causes OutofMemory error when groupByKeyOnly() is called internally in GroupCombineFunctions.java.
2) GroupByKey - We are grouping the dataset based on a key like the following.
PCollection>> costBasisRecords = masterDataResult.apply(GroupByKey.create());
This GroupbyKey operation also causes OutOfmemory errors.
Could you please give us suggestions such that we can achieve result.
From online, We saw that reduceByKey method - Could you please guide us how we can implement that functionality for Spark runners.
Error Message:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.reflect.Array.newInstance(Array.java:75)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1897)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:45)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
If possible, I would definitely recommend using a Combine.perKey as Lukasz suggests.
If you are unable to do that or if you still run into OOMs, try to decrease partition size by increasing the number of partitions. You can increase the number of shuffle partitions by manually setting the spark.default.parallelism configuration. This is explicitly used to determine the partitioning scheme for groupByKeyOnly shuffles.
It looks like the way to plumb configurations through is via a manually-constructed SparkContextOptions. There's a test case that shows how to do this. Note that this requires your pipeline program to directly link against Spark. For example:
SparkConf conf = new SparkConf().set("spark.default.parallelism", parallelism);
JavaSparkContext jsc = new JavaSparkContext(conf);
SparkContextOptions options = PipelineOptionsFactory.as(SparkContextOptions.class);
options.setUsesProvidedSparkContext(true);
options.setProvidedSparkContext(jsc);
Pipeline p = Pipeline.create(options);
// ...
NOTE: Spark has its own limitation that all grouped values for a given key must fit in memory on the machine processing that key. If this does not hold for your datasets (i.e., you have very strong key skew), then you will need to combine rather than group by key.
reduceByKey in Spark is similar to Combine.perKey in Apache Beam, see the Programming Guide for examples.
Note that reduceByKey and Combine.perKey will only work if there is a reduction per key otherwise your just going to hit the same out of memory problem. For example, combining all integers per key into a list will not reduce the amount of memory usage but summing the integers per key will.