Pyspark sc.binaryFiles() overwhelms the driver node - apache-spark

I have millions of image files saved in GCS. When I attempt to access them with sc.binaryFiles(...) in pyspark the driver node gets overwhelmed: all four CPU cores max out at 100% CPU usage and memory maxes out at 16 GB too. However, the worker nodes are all idle: basically 0% CPU usage and minimal memory impact.
I've been able to boil down the issue to just these lines:
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
# Spark boilerplate
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.getOrCreate()
images_rdd = sc.binaryFiles('gs://my_bucket/')
images_rdd.count()
I start pyspark like this. I have 52 machines, with 4 cores each. I've tried the same with 52 executors, and I get the same effect, so I think the issue is related to either GCS or binaryFiles(...).
pyspark --num-executors 104 --executor-cores 4 --py-files ~/dependencies.zip,~/model.h5
Eventually pyspark dies with this message:
: java.io.IOException: Failed to listFileInfo for 'gs://my_bucket/'
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.listFileInfo(GoogleCloudStorageFileSystem.java:1095)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.listStatus(GoogleHadoopFileSystemBase.java:999)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1804)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1849)
at org.apache.hadoop.fs.FileSystem$4.<init>(FileSystem.java:2014)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2013)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1996)
at org.apache.hadoop.mapred.LocatedFileStatusFetcher$ProcessInputDirCallable.call(LocatedFileStatusFetcher.java:226)
at org.apache.hadoop.mapred.LocatedFileStatusFetcher$ProcessInputDirCallable.call(LocatedFileStatusFetcher.java:203)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.google.cloud.hadoop.repackaged.gcs.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:552)
at com.google.cloud.hadoop.repackaged.gcs.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:513)
at com.google.cloud.hadoop.repackaged.gcs.com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.LazyExecutorService$ExecutingFutureImpl$Delegated.get(LazyExecutorService.java:529)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.LazyExecutorService$ExecutingFutureImpl$Created.get(LazyExecutorService.java:420)
at com.google.cloud.hadoop.repackaged.gcs.com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:62)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.listFileInfo(GoogleCloudStorageFileSystem.java:1075)
... 12 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Why is Spark making the driver the bottleneck and not distributing any of the work to its worker machines? I'm just doing a simple count, and I'm not pulling anything back to the driver. I don't understand how this could have any performance problems.
I'm using Spark version 2.4.4.
Update
As a test, even though this wouldn't help my code, I tried a different method that reads from GCS, and I see the same problems.
images_rdd = sc.wholeTextFiles('gs://my_bucket/')
Is it possible that Spark's integration with GCS involves sending all of the file names back to the driver node before downloading them to the workers?

Related

Apache Spark memory configuration with PySpark

I am working on an Apache Spark application on PySpark.
I have looked for so many resources but could not understand a couple of things regarding memory allocation.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.master("local[4]")\
.appName("q1 Tutorial") \
.getOrCreate()
I need to configure the memory, too.
It will run locally and in client deploy mode. I read from some sources that in this case, I should not set up the driver memory, I only should set up executor memory. And some sources mentioned that in PySpark I should not configure driver-memory and executor memory.
Could you please give me information about memory config in PySpark or share me some reliable resources?
Thanks in advance!
Most of the computational work is performed on spark executers but
when we run operations like collect() or take() then data is transferred to Spark driver.
it is always recommended to use collect() and take() lesser or for lesser data so that it wont be a overhead on driver.
But in case if you have requirement where you have to show large amount of Data using collect() or take() then you have to increase the driver memory to avoid OOM exception.
ref : Spark Driver Memory calculation
Driver memory can be configured via spark.driver.memory.
Executor memory can be configured with a combination of spark.executor.memory that sets the total amount of memory available to each executor, as well as spark.memory.fraction which splits the executor's memory between execution vs storage memory.
Note that 300 MB of executor memory is automatically reserved to safeguard against out-of-memory errors.

Memory leak in Spark Driver

I used Spark 2.1.1 and I upgraded into the latest version 2.4.4. I observed from Spark UI that the driver memory is increasing continuously and after of long running I had the following error: java.lang.OutOfMemoryError: GC overhead limit exceeded
In Spark 2.1.1 the driver memory consumption (Storage Memory tab) was extremely low and after the run of ContextCleaner and BlockManager the memory was decreasing.
Also, I tested the Spark versions 2.3.3, 2.4.3 and I had the same behavior.
HOW TO REPRODUCE THIS BEHAVIOR:
Create a very simple application(streaming count_file.py) in order to reproduce this behavior. This application reads CSV files from a directory, count the rows and then remove the processed files.
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
target_dir = "..."
spark=SparkSession.builder.appName("DataframeCount").getOrCreate()
while True:
for f in os.listdir(target_dir):
df = spark.read.load(f, format="csv")
print("Number of records: {0}".format(df.count()))
os.remove(f)
print("File {0} removed successfully!".format(f))
Submit code:
spark-submit
--master spark://xxx.xxx.xx.xxx
--deploy-mode client
--executor-memory 4g
--executor-cores 3
--queue streaming count_file.py
TESTED CASES WITH THE SAME BEHAVIOUR:
I tested with default settings (spark-defaults.conf)
Add spark.cleaner.periodicGC.interval 1min (or less)
Turn spark.cleaner.referenceTracking.blocking=false
Run the application in cluster mode
Increase/decrease the resources of the executors and driver
I tested with extraJavaOptions in driver and executor -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
DEPENDENCIES
Operation system: Ubuntu 16.04.3 LTS
Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
Python: Python 2.7.12
Finally, the increase of the memory in Spark UI was a bug in Spark version higher than 2.3.3. There is a fix.
It will affect the Spark version 2.4.5+.
Spark related issues:
Spark UI storage memory increasing overtime: https://issues.apache.org/jira/browse/SPARK-29055
Possible memory leak in Spark: https://issues.apache.org/jira/browse/SPARK-29321?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

Encounter SparkException "Cannot broadcast the table that is larger than 8GB"

I am using Spark 2.2.0 to do data processing. I am using Dataframe.join to join 2 dataframes together, however I encountered this stack trace:
18/03/29 11:27:06 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/03/29 11:27:09 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:126)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
...........
Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 10 GB
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:86)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:103)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I searched on Internet for this error, but didn't get any hint or solution how to fix this.
Does Spark automatically broadcast Dataframe as part of the join? I am very surprise with this 8GB limit because I would have thought Dataframe supports "big data" and 8GB is not very big at all.
Thank you very much in advance for your advice on this.
Linh
After some reading, I've tried to disable the auto-broadcast and it seemed to work. Change Spark config with:
'spark.sql.autoBroadcastJoinThreshold': '-1'
Currently it is a hard limit in spark that the broadcast variable size should be less than 8GB. See here.
The 8GB size is generally big enough. If you consider that you re running a job with 100 executors, spark driver needs to send the 8GB data to 100 Nodes resulting 800GB network traffic. This cost will be much less if you don't broadcast and use simple join.

Spark Out of Memory Error For MapOutputTracker serializeMapStatuses

I have a spark job which have hundred thousands (300,000 task and more)of tasks at stage 0, and then during the shuffling, the following exception throws on Driver side:
util.Utils: Suppressing exception in finally: null
java.lang.OutOfMemoryError at
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at
java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) at
java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) at
java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145) at
java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894) at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875) at
java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822) at
java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719) at
java.io.ObjectOutputStream.close(ObjectOutputStream.java:740) at
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:618) at
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319) at
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:617) at
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:560) at
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:349) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
I checked ByteArrayOutputStream code, and it throws out of memory error when the array size is larger than INTEGER.MAX which is about 2G. That means the map status serialization result should less than 2G.
I also checked the MapOutputTracker code, this map status size is related to task size and following stage task size.
I was wondering if anyone encounter this issue, how you resolve this. my understanding is we can only reduce the size of task, but my task can only stucks because less partition will delay the computation.
This is likely caused by a single block that exceeds 2GB of memory during a shuffle. This usually means your operation requires larger parallelism which will reduce the size of any individual block - hopefully below the 2GB limit (which is extremely high.)
No Spark shuffle block can be greater than 2 GB
Spark uses ByteBuffer as abstraction for storing blocks.
val buf =ByteBuffer.allocate(length.toInt)
ByteBuffer is limited by Integer.MAX_IZE(2GB)
Increasing your Parallelism
1) Repartition your data before invoking the operation that causes this error as follows:
DataFrame.repartition(400)
RDD.repartition(400)
2) Pass the number of partitions into the operation as the last argument (where supported):
import org.apache.spark.rdd.PairRDDFunctions
RDD.groupByKey(numPartitions: Int) RDD.join(other: RDD, numPartitions:Int)
3) Set the default parallelism (partitions) through the SparkConf as follows (NOT YET SUPPORTED in Databricks Cloud):
// create the SparkConf used to create the SparkContext val conf = new SparkConf()
// set the parallelism/partitions conf.set("spark.default.parallelism", 400)
// create the SparkContext with the conf val sc = new SparkContext(conf)
// check the parallelism/partitions sc.defaultParallelism
4) Set the SQL partitions through SQL as follows (default is 200):
SET spark.sql.shuffle.partitions=400;
Why 2GB?
This limit exists because of the limit on Java Integers: 2^31 == 2,147,483,647 ~= 2GB.
Spark's shuffle mechanism currently uses Java ByteArrays to transport the data across the network.
This will be enhanced in the future by either expanding Spark's shuffle to use either a larger ByteArrays using Longs, chaining together ByteArrays, or both.
https://forums.databricks.com/questions/1140/im-seeing-an-outofmemoryerror-requested-array-size.html

Prediction.io - pio train fails with OutOfMemoryError

We are getting the following error after running "pio train". It works about 20 minutes and fails on Stage 26.
[ERROR] [Executor] Exception in task 0.0 in stage 1.0 (TID 3)
[ERROR] [SparkUncaughtExceptionHandler] Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
[ERROR] [SparkUncaughtExceptionHandler] Uncaught exception in thread Thread[Executor task launch worker-4,5,main]
[WARN] [TaskSetManager] Lost task 2.0 in stage 1.0 (TID 5, localhost): java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.io.Output.<init>(Output.java:35)
at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:80)
at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:289)
at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:289)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:293)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:239)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Our server has about 30gb memory, but about 10gb is taken by hbase+elasticsearch.
We are trying to process about 20 millions of records created by Universal Recommender.
I've tried the following command to increase executor/driver memory, but it didn't help:
pio train -- --driver-memory 6g --executor-memory 8g
What options could we try to fix the issue? Is it possible to process that amount of events on server with that amount of memory?
Vertical scaling can take you only so far but you could try increasing the memory available if it's AWS by stopping and restarting with a larger instance.
CF looks at a lot of data, Since Spark gets it's speed by doing in-memory calculations (by default) you will need enough memory to hold all of your data spread over all Spark workers and in your case you have only 1.
Another thing that comes to mind is that this is a Kryo error so you might try increasing the Kryo buffer size a little, which is configured in engine.json
Also there is a Google Group for community support here: https://groups.google.com/forum/#!forum/actionml-user

Resources