Encounter SparkException "Cannot broadcast the table that is larger than 8GB" - apache-spark

I am using Spark 2.2.0 to do data processing. I am using Dataframe.join to join 2 dataframes together, however I encountered this stack trace:
18/03/29 11:27:06 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/03/29 11:27:09 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:126)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
...........
Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 10 GB
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:86)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:103)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I searched on Internet for this error, but didn't get any hint or solution how to fix this.
Does Spark automatically broadcast Dataframe as part of the join? I am very surprise with this 8GB limit because I would have thought Dataframe supports "big data" and 8GB is not very big at all.
Thank you very much in advance for your advice on this.
Linh

After some reading, I've tried to disable the auto-broadcast and it seemed to work. Change Spark config with:
'spark.sql.autoBroadcastJoinThreshold': '-1'

Currently it is a hard limit in spark that the broadcast variable size should be less than 8GB. See here.
The 8GB size is generally big enough. If you consider that you re running a job with 100 executors, spark driver needs to send the 8GB data to 100 Nodes resulting 800GB network traffic. This cost will be much less if you don't broadcast and use simple join.

Related

Pyspark sc.binaryFiles() overwhelms the driver node

I have millions of image files saved in GCS. When I attempt to access them with sc.binaryFiles(...) in pyspark the driver node gets overwhelmed: all four CPU cores max out at 100% CPU usage and memory maxes out at 16 GB too. However, the worker nodes are all idle: basically 0% CPU usage and minimal memory impact.
I've been able to boil down the issue to just these lines:
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
# Spark boilerplate
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.getOrCreate()
images_rdd = sc.binaryFiles('gs://my_bucket/')
images_rdd.count()
I start pyspark like this. I have 52 machines, with 4 cores each. I've tried the same with 52 executors, and I get the same effect, so I think the issue is related to either GCS or binaryFiles(...).
pyspark --num-executors 104 --executor-cores 4 --py-files ~/dependencies.zip,~/model.h5
Eventually pyspark dies with this message:
: java.io.IOException: Failed to listFileInfo for 'gs://my_bucket/'
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.listFileInfo(GoogleCloudStorageFileSystem.java:1095)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.listStatus(GoogleHadoopFileSystemBase.java:999)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1804)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1849)
at org.apache.hadoop.fs.FileSystem$4.<init>(FileSystem.java:2014)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2013)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1996)
at org.apache.hadoop.mapred.LocatedFileStatusFetcher$ProcessInputDirCallable.call(LocatedFileStatusFetcher.java:226)
at org.apache.hadoop.mapred.LocatedFileStatusFetcher$ProcessInputDirCallable.call(LocatedFileStatusFetcher.java:203)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.google.cloud.hadoop.repackaged.gcs.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:552)
at com.google.cloud.hadoop.repackaged.gcs.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:513)
at com.google.cloud.hadoop.repackaged.gcs.com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.LazyExecutorService$ExecutingFutureImpl$Delegated.get(LazyExecutorService.java:529)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.LazyExecutorService$ExecutingFutureImpl$Created.get(LazyExecutorService.java:420)
at com.google.cloud.hadoop.repackaged.gcs.com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:62)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.listFileInfo(GoogleCloudStorageFileSystem.java:1075)
... 12 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Why is Spark making the driver the bottleneck and not distributing any of the work to its worker machines? I'm just doing a simple count, and I'm not pulling anything back to the driver. I don't understand how this could have any performance problems.
I'm using Spark version 2.4.4.
Update
As a test, even though this wouldn't help my code, I tried a different method that reads from GCS, and I see the same problems.
images_rdd = sc.wholeTextFiles('gs://my_bucket/')
Is it possible that Spark's integration with GCS involves sending all of the file names back to the driver node before downloading them to the workers?

Spark context stopped while waiting for backend

I have a long-running EMR step that executes spark-submit on EMR client mode. Between job executions, I manually restart the Spark context before the next execution if any configuration changes, like --executor-memory.
I'm running into the following exception when I try to restart the context with the new configuration with
currentSparkSession.close();
return SparkSession.builder().config(newConfig).getOrCreate();
19/05/23 15:52:35 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Spark context stopped while waiting for backend
at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:689)
at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:186)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:567)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:923)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:915)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:915)
.
.
.
19/05/23 15:52:35 INFO SparkContext: SparkContext already stopped.
19/05/23 15:52:35 WARN TransportChannelHandler: Exception in connection from /172.31.0.165:42556
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
I tried making the thread sleep a little in case there needs to be some time between the stop and start like:
currentSparkSession.close();
Thread.sleep(5000); // Sleep 5 seconds
return SparkSession.builder().config(newConfig).getOrCreate();
but that doesn't work either. I looked at the Spark source code and it looks like currentSparkSession.close() won't return until it's actually stopped anyways, so making the Thread sleep doesn't do anything.
I also see this in the container logs:
Error occurred during initialization of VM
Initial heap size set to a larger value than the maximum heap size
End of LogType:stdout
which confuses me because the only configured I changed between executions was --executor-memory, and I actually DECREASED it instead of increasing.
I've found similar questions on this site like Apache Spark running spark-shell on YARN error, but these suggestions look like they're essentially just turning off some resource manager validation checks that don't look very safe to me. Any suggestions?
This is because I tried sending a request with a lower --executor-memory (which happens to set Xmx, max heap size) than Xms (initial heap size), which was configured on the initial spark submit. The exception was thrown since max heap size can never be smaller than initial heap size.

Apache Spark Driver hangs when DataFrame broadcast fails due to OutOfMemoryError

I tried to broadcast a DataFrame which turned out to be larger than spark.sql.autoBroadcastJoinThreshold, and the driver logged
Exception in thread "broadcast-exchange-0" java.lang.OutOfMemoryError Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can...
However, instead of returning to Driver thread and fail, the app just hangs and the driver is stuck at:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:367)
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:144)
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:140)
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
...
...
spark.sql.broadcastTimeout is set to a quite high number due to other historic issues we had, and indeed the driver failed on timeout eventually, but still I wonder if this is the expected behavior? I tried to get my head around ThreadUtils.awaitResult but I can't find evidence that this is behavior is (explicitly) expected.
Can anyone confirm this is not a bug?

Elasticsearch write using Spark

I'm creating a document collection in Spark as an RDD and using the Spark read/write library from Elasticsearch. The Cluster that creates the collection is large so when it writes to ES I get the errors below indicating ES is overloaded, which does not surprise me. This does not seem to fail the job. The tasks may being retried and eventually succeed. In the Spark GUI the job is reported as having finishing successfully.
is there a way to somehow throttle the ES writing lib to avoid the retries (I can't change the cluster size)?
Do these errors mean that some data was not written to the index?
Here is one of many reported task failure errors, but again no job failure is reported:
2017-03-20 10:48:27,745 WARN org.apache.spark.scheduler.TaskSetManager [task-result-getter-2] - Lost task 568.1 in stage 81.0 (TID 18982, ip-172-16-2-76.ec2.internal): org.apache.spark.util.TaskCompletionListenerException: Could not write all entries [41/87360] (maybe ES was overloaded?). Bailing out...
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:112)
at org.apache.spark.scheduler.Task.run(Task.scala:102)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The lib I'm using is
org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.2"
Can you follow this link - https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
In the Spark conf property or in your elastic search property you need to increase the max number of records which can be dumped in a single post, and that should solve your problem.

spark streaming assertion failed: Failed to get records for spark-executor-a-group a-topic 7 244723248 after polling for 4096

Spark Streaming Problem with Kafka DirectStream:
spark streaming assertion failed: Failed to get records for
spark-executor-a-group a-topic 7 244723248 after polling for 4096
Tried:
1) Adjust increasing spark.streaming.kafka.consumer.poll.ms
-- from 512 to 4096, less failed, but even 10s the failed still exists
2) Adjust executor memory from 1G to 2G
-- partly work, much less failed
3) https://issues.apache.org/jira/browse/SPARK-19275
-- still got failed when streaming durations all less than 8s ("session.timeout.ms" -> "30000")
4) Try Spark 2.1
-- problem still there
with Scala 2.11.8, Kafka version : 0.10.0.0, Spark version : 2.0.2
Spark configs
.config("spark.cores.max", "4")
.config("spark.default.parallelism", "2")
.config("spark.streaming.backpressure.enabled", "true")
.config("spark.streaming.receiver.maxRate", "1024")
.config("spark.streaming.kafka.maxRatePerPartition", "256")
.config("spark.streaming.kafka.consumer.poll.ms", "4096")
.config("spark.streaming.concurrentJobs", "2")
using spark-streaming-kafka-0-10-assembly_2.11-2.1.0.jar
Error stacks:
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:228)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:194)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.foreach(KafkaRDD.scala:194)
...
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:108)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:142)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:108)
...
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Losing 1%+ blocks datum from Kafka with this failure :( pls help!
Current solution:
Increase num.network.threads in kafka/config/server.properties, default is 3
Increase spark.streaming.kafka.consumer.poll.ms value ~! a large one ...
without config spark.streaming.kafka.consumer.poll.ms, it's using the spark.network.timeout, which is 120s -- causing some problem
Optional step: Decrease the "max.poll.records", default is 500
Optional step: use Future{} to run time-cost-task in parallel
A solution that worked for me:
Decrease the Kafka Consumer max.partition.fetch.bytes.
Reasoning: When the consumer polls for records from Kafka, it prefetches data up to this amount. If the network is slow, it may not be able to prefetch the full amount and the poll in the CachedKafakConsumer will timeout.

Resources