Dataset explain method runs out of memory - apache-spark

I am using spark 2.4.4. I have a Dataframe with big (16 stages, 14 of which cached) DAG. When df.explain(true) is run, I get OOM errors no matter how big is driver memory (I stopped testing at 16G since actual data size is smaller). I can't remove that explain call since it's in a library I have to use. UI seems to have no problem displaying DAG on last job with DF.
Could there be a cycle in DAG? There are no cycles in UI. Or what else could be wrong?
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:142)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:287)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:285)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:248)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396)
at org.apache.spark.sql.execution.command.ExecutedCommandExec$$anonfun$sideEffectResult$1.apply(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec$$anonfun$sideEffectResult$1.apply(commands.scala:70)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset.explain(Dataset.scala:484)
...
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:210)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:545)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:544)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:544)
at org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:416)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:571)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:571)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:571)
at org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:692)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:571)
at org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:416)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:571)
at org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:692)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:571)
at org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:416)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:571)
at org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:692)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:571)
at org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:416)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:571)
at org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:692)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:563)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:563)
at org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:416)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:571)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:571)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:571)

Related

Why does spark shuffling not spill to disk ?

The simple wordcount program in spark doesn’t spill to disk and results in OOM error. In short:
The environment:
Spark: 2.3.0, Scala 2.11.8
3 x Executor, each: 1 core + 512 MB RAM
Text file: 341 MB
Other configurations are default (spark.memory.fraction = 0.6)
The code:
import org.apache.spark.SparkContext
object WordCount {
def main(args: Array[String]): Unit = {
val inPath = args(0)
val sc = new SparkContext("spark://master:7077", "Word Count ver3")
val words = sc.textFile(inPath, minPartitions = 20)
.map(line => line.toLowerCase())
.flatMap(text => text.split(' '))
val wc = words.groupBy(word => word)
.map({ case (groupName, groupList) => (groupName, groupList.size) })
.count()
}
}
The error:
2018-05-04 13:46:36 WARN TaskSetManager:66 - Lost task 1.0 in stage 1.0 (TID 21, 192.168.10.107, executor 0): java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.<init>(String.java:325)
at com.esotericsoftware.kryo.io.Input.readAscii(Input.java:598)
at com.esotericsoftware.kryo.io.Input.readString(Input.java:472)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:195)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:184)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:278)
at org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:156)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:188)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:185)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:153)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:90)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The heapdump:
The problems are:
Heapsize for execution would be (512 - 300) * 0.6 = 127 MB (since I don’t use cache). Why does the ExternalAppendOnlyMap size is more than 380 MB ? The class must be stored in heap memory, and its size cannot be larger than the heap size.
The ExternalAppendOnlyMap is a spillable class, and it should spill its data to disk due to lack of memory in this case, but in this case it didn’t, results in a OOM error.
Heap memory of the program is divided into: Spark execution memory and user memory. Look into the heap dump, which objects will be stored in which division of heap memory ?
Really appreciated for your time.

Spark com.databricks.spark.csv is not able to load a snappy compressed file using node-snappy

I have some csv files on S3 that are compressed using the snappy compression algorithm (using node-snappy package). I like to process those files in spark using com.databricks.spark.csv but I am consistently getting an invalid file input error.
code:
file_df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true', codec='snappy', mode='FAILFAST').load('s3://sample.csv.snappy')
error message:
16/09/24 21:57:25 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-10-0-32-5.ec2.internal): java.lang.InternalError: Could not decompress data. Input is invalid.
at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method)
at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239)
at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:255)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:209)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1305)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1305)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Looks like the same problem that is answered here - basically python snappy isn't compatible with Hadoop snappy.

Spark. ~100 million rows. Size exceeds Integer.MAX_VALUE?

(This is with Spark 2.0 running on a small three machine Amazon EMR cluster)
I have a PySpark job that loads some large text files into a Spark RDD, does count() which successfully returns 158,598,155.
Then the job parses each row into a pyspark.sql.Row instance, builds a DataFrame, and does another count. This second count() on the DataFrame causes an exception in Spark internal code Size exceeds Integer.MAX_VALUE. This works with smaller volumes of data. Can someone explain why/how this would happen?
org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 1.0 failed 4 times, most recent failure: Lost task 22.3 in stage 1.0 (TID 77, ip-172-31-97-24.us-west-2.compute.internal): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:439)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:604)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
PySpark code:
raw_rdd = spark_context.textFile(full_source_path)
# DEBUG: This call to count() is expensive
# This count succeeds and returns 158,598,155
logger.info("raw_rdd count = %d", raw_rdd.count())
logger.info("completed getting raw_rdd count!!!!!!!")
row_rdd = raw_rdd.map(row_parse_function).filter(bool)
data_frame = spark_sql_context.createDataFrame(row_rdd, MySchemaStructType)
data_frame.cache()
# This will trigger the Spark internal error
logger.info("row count = %d", data_frame.count())
The error comes not from the data_frame.count() itself but rather because parsing the rows via row_parse_function yields some integers which don't fit into the specified integer type in MySchemaStructType.
Try to increase the integer types in your schema to pyspark.sql.types.LongType() or alternatively let spark infer the types by omitting the schema (this however can slow down the evaluation).

PySpark Standalone: java.lang.IllegalStateException: unread block data

I am fairly new to using pyspark, and I have been trying to run a script that worked fine in local mode with a 1000-row subset of the data, but is now throwing errors in standalone mode with all of the data, which is 1GB. I figured this would happen as more data = more problems, but I am having trouble understanding what is causing this issue. These are the details for my standalone cluster:
3 executors
20GB of memory each
spark.driver.maxResultSize=1GB (added this bc I thought this might be the issue, but it didn't solve the issue)
The script is throwing the error at the stage where I am converting the spark dataframe to a pandas dataframe to parallelize some operations. I am confused that this would cause issues, because the data is only about 1G, and my executors should have much more memory than that. Here's my code snippet - the error is happening at data = data.toPandas():
def num_cruncher(data, cols=[], target='RETAINED', lvl='univariate'):
if not cols:
cols = data.columns
del cols[data.columns.index(target)]
data = data.toPandas()
pop_mean = data.mean()[0]
if lvl=='univariate':
cols = sc.parallelize(cols)
all_df = cols.map(lambda x: calculate([x], data, target)).collect()
elif lvl=='bivariate':
cols = sc.parallelize(cols)
cols = cols.cartesian(cols).filter(lambda x: x[0]<x[1])
all_df = cols.map(lambda x: calculate(list(x), data, target)).collect()
elif lvl=='trivariate':
cols = sc.parallelize(cols)
cols = cols.cartesian(cols).cartesian(cols).filter(lambda x: x[0][0]<x[0][1] and x[0][0]<x[1] and x[0][1]<x[1]).map(lambda x: (x[0][0],x[0][1],x[1]))
all_df = cols.map(lambda x: calculate(list(x), data, target)).collect()
all_df = pd.concat(all_df)
return all_df, pop_mean
And here's the error log:
16/07/11 09:49:54 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:258)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:310)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:257)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:256)
at org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:588)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:577)
at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:170)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:104)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
So my questions are:
Why is giving the workers 20GB of memory not enough for this 1GB dataset?
In general, is it a good idea to load the data into memory like I am doing here or is there any better way to do this?
For whoever might find this post useful - it seems that the issue wasn't to give more memory to the worker/slaves, but to give more memory to the driver, as mentioned in the comments by #KartikKannapur. So in order to fix this I set:
spark.driver.maxResultSize 3g
spark.driver.memory 8g
spark.executor.memory 4g
Probably overkill, but it does the job now.

Can't run GraphX ConnectedComponents on spark 1.5.1 with large data (~4TB)

While trying to run a spark job with spark 1.5.1, using the following paramters:
--master "yarn"
--deploy-mode "cluster"
--num-executors 200
--driver-memory 14G
--executor-memory 14G
--executor-cores 1
Trying to run graphX ConnectedComponent on large data (~4TB) using the following commands:
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val edges = …
val graph = Graph.fromEdgeTuples(edges,0,edgeStorageLevel = StorageLevel.MEMORY_AND_DISK, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK)
val components = graph.connectedComponents().vertices
Some of the tasks complete successfully, and some fail with the following errors:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 2
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:460)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:456)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:456)
at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:183)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:47)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
and another error:
org.apache.spark.shuffle.FetchFailedException: Connection from phxaishdc9dn1209.stratus.phx.ebay.com/10.115.60.32:40099 closed
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:321)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:306)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:89)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$4$$anonfun$apply$5.apply(ReplicatedVertexView.scala:117)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$4$$anonfun$apply$5.apply(ReplicatedVertexView.scala:115)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection from phxaishdc9dn1209.stratus.phx.ebay.com/10.115.60.32:40099 closed
at org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
at org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
Please advise,
Thanks in advance,
Arnon

Resources