I'm trying to measure the max size of variable I can broadcast using spark broadcast.
I didn't find any explanation regarding this issue.
did someone measure it? does spark has configuration for broadcast size?
Limit for broadcasting has now been increased to 8 GB. you can find the details here.
It's currently ~2GB. Anything you broadcast is converted into java byte array during serialization and as java arrays have max size Integer.MAX_VALUE you get this limit. There may currently be some effort increasing this limit: SPARK-6235
Related
I've read that the max size of kryo buffer in spark can be 2048 mb, and it should be larger than the largest object that my program will serialize (source: https://spark.apache.org/docs/latest/tuning.html). But what should I do if my largest object is larger than 2gb? Do I have to use the java serializer in that case? Or does the java serializer also have this limitation of 2g?
The main reason why Kryo cannot handle things larger than 2GB is because it uses the primitives of Java, using the Java Byte Arrays to setup the buffer. The limit of Java Byte Arrays are 2Gb. That is the main reason why Kryo has this limitation. This check done in Spark is to avoid the error to happens during execution time creating an even larger issue for you to debug and handle the code.
For more details please take a look here.
I wanted to know what will happen if we broadcast the larger table while joining it to smaller. Also, if we have two equally large tables, what will happen when we use broadcast join in that scenario?
There are few things to consider :
Spark Upper Limit : Spark supports upto 8GB of broadcast table. If your broadcast object is more than that, it would fail.
Driver and Executor Memory : Since the table will be copied in to the memory of driver and then to executors, As long as you have enough memory , it should be broadcasted successfully.
Performance : If it is broadcasted, a portion of your memory will be reserved for that. So, whatever left will be used for further operations which might make it slow. (example if executor_memory is 8 gb, broadcasted variable is 6 gb)
So, from your question, behaviour of broadcast depends on what you broadcast, doesn't matter if the Joining table is large or small. Broadcast is an independent functionality. And Spark uses this functionality in Joins.
When I am creating data stream in spark for incoming data from kafka then I am getting following warning -
WARN TaskSetManager: Stage 1 contains a task of very large size (1057 KB). The maximum recommended task size is 100 KB.
So I think I need to increase task size,So can we resolve this issue by increasing no of partitions for a RDD? And How a stage is divided into small tasks and how we can configure the size of these tasks?
Thanks in advance.
So can we resolve this issue by increasing no of partitions for a RDD?
Not at all. Task size is amount of data that is send to the exectuor. This includes function definition, and serialized closure. Modifying splits won't help you here.
In general this warning is not critical and I wouldn't worry to much, but it is a hint you should take another look at your code:
Do you reference large objects with actions / transformations? If yes, consider using broadcast variables.
Are you sure you send only things you expect to send, not enclosing scope (like large objects). If the problem is here work on the structure of your code.
I have a large dataset of 50 million rows with about 40 columns of floats.
For custom transformation reasons, I am trying to collect all float values per column using collect_list() function of pyspark, using the following pseudocode:
for column in columns:
set_values(column, df.select(collect_list(column)).first()[0])
For each column, it executes the collect_list() function and sets the values into some other internal structure.
I am running the aforementioned standalone cluster with 2 hosts of 8 cores and 64 GB RAM, allocating max 30 GB and 6 cores for 1 executor per host, and I am getting the following exception during execution, which I suspect it has to do with the size of the collected array.
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
I have tried multiple configurations in spark-defaults.conf, including allocating more memory, partition number, parallelism, even Java options, but still no luck.
So my assumption is that collect_list() is deeply related to the executors/drivers resources on larger datasets or has nothing to do with these?
Are there any settings i could use, to help me eliminate this issue, otherwise i have to use collect() function?
collect_list is not better than just calling collect in your case. Both are incredibly bad idea for large datasets. and have very little practical applications.
Both require amount of memory proportional to the number of records, and collect_list just adds overhead of shuffle.
In other words - if you don't have a choice, and you need a local structure, use select and collect and increase driver memory. It won't make things any worse:
df.select(column).rdd.map(lambda x: x[0]).collect()
When using Dataframe broadcast function or the SparkContext broadcast functions, what is the maximum object size that can be dispatched to all executors?
broadcast function :
Default is 10mb but we have used till 300 mb which is controlled by spark.sql.autoBroadcastJoinThreshold.
AFAIK, It all depends on memory available. so there is no definite answer for this. what I would say is, it should be less than large dataframe and you can estimate large or small dataframe size like below...
import org.apache.spark.util.SizeEstimator
logInfo(SizeEstimator.estimate(yourlargeorsmalldataframehere))
based on this you can pass broadcast hint to framework.
Also have a look at
scala doc from
sql/execution/SparkStrategies.scala
which says....
Broadcast: if one side of the join has an estimated physical size that is smaller than the user-configurable
[[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold or if that
side has an explicit broadcast hint (e.g. the user applied the
[[org.apache.spark.sql.functions.broadcast()]] function to a
DataFrame), then that side of the join will be broadcasted
and the other side will be streamed, with no shuffling
performed. If both sides are below the
threshold, broadcast the smaller side. If neither is smaller, BHJ is not used.
Shuffle hash join: if the average size of a single
partition is small enough to build a hash table.
Sort merge: if the matching join keys are sortable.
If there is no joining keys, Join implementations are chosen with the following precedence:
BroadcastNestedLoopJoin: if one side of the join could be broadcasted
CartesianProduct: for Inner join
BroadcastNestedLoopJoin
Also have a look at other-configuration-options
SparkContext.broadcast (TorrentBroadcast ) :
broadcast shared variable also has a property spark.broadcast.blockSize=4M
AFAIK there is no hard core limitation I have seen for this as well...
for Further information pls. see TorrentBroadcast.scala
EDIT :
However you can have look at 2GB issue Even though that was officially not declared in docs (I was not able to see anything of this kind in docs).
pls look at SPARK-6235 which is "IN PROGRESS" state & SPARK-6235_Design_V0.02.pdf .
As of Spark 2.4, there's an upper limit of 8 GB. Source Code
Update :
The 8GB limit is still valid for Spark 3.2.1 Source Code
Update:
Still valid for Spark 3.4 Source code
Like mentioned above, the upper limit is 8GB. But when you have several files you want to broadcast, spark push all the data files to driver. The driver join those files & push to to executor nodes. In this process, if the driver's available memory is less than combined broadcast file, you will end up with out of memory error.