Use of countByKeyApprox() for Partial manual broadcast hash join - apache-spark

I read about Partial manual broadcast hash join which can be used while joining Pair RDD in Spark. This is suggested to be useful if one key is so large that it can’t fit on a single partition. In this case you can use countByKeyApprox on the large RDD to get an approximate idea of which keys would most benefit from a broadcast.
You then filter the smaller RDD for only these keys, collecting the result locally in a HashMap. Using sc.broadcast you can broadcast the HashMap so that each worker only has one copy and manually perform the join against the HashMap. Using the same HashMap you can then filter your large RDD down to not include the large number of duplicate keys and perform your standard join, unioning it with the result of your manual join. This approach is quite convoluted but may allow you to handle highly skewed data you couldn’t otherwise process.
The question is about the usage of countByKeyApprox(long timeout). What is the unit of this timeout? IF I write countByKeyApprox(10), does that mean it will wait for 10 seconds or 10 ms or something else?

It's in milliseconds
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/rdd/PairRDDFunctions.html#countByKeyApprox-long-double-
Parameters:
timeout - maximum time to wait for the job, in milliseconds
confidence - the desired statistical confidence in the result

Related

Apache Spark Streaming - reduceByKey, groupByKey, aggregateByKey or combineByKey?

I have an application which generates multiple sessions each containing multiple events (in Avro format) over a 10 minute time period - each event will include a session id which could be used to find all the session data. Once I have gathered all this data I would like to then create a single session object.
My plan is to use a window in Spark Streaming to ensure I have the data available in memory for processing - unless there are any other suggestions which would be a good fit to solve my problem.
After reading the Apache Spark documentation it looks like I could achieve this using various different API's, but I am struggling to work out which one would be the best fit for my problem - so far I have come across reduceByKey / groupByKey / aggregateByKey / combineByKey.
To give you a bit more detail into the session / event data I expect there to be anywhere in the region of 1m active sessions with each session producing 5/10 event in a 10 minute period.
It would be good to get some input into which approach is a good fit for gathering all session events and producing a single session object.
Thanks in advance.
#phillip Thanks for the details. Let's go in the details of each keys:
(1). groupByKey - It can help to rank, sort and even aggregate using any key. Performance wise it is slower because does not use combiner.
groupByKey() is just to group your dataset based on a key
If you are doing any aggregation like sum, count, min, max then this is not preferable.
(2). reduceBykey - It supports only aggregations like sum, mix, max. Uses combiner so faster than groupbykey. Data shuffled is very less.
reduceByKey() is something like grouping + aggregation.
reduceByKey can be used when we run on large data set.
(3). aggregateByKey - Similar to reduceBykey, It supports only aggregations like sum, mix, max. is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,”six”) as output
I believe you require only grouping and no aggregations, then I believe you are left with no choice then to use groupBykey()

In spark, how to estimate the number of elements in a dataframe quickly

In spark, is there a fast way to get an approximate count of the number of elements in a Dataset ? That is, faster than Dataset.count() does.
Maybe we could calculate this information from the number of partitions of the DataSet, could we ?
You could try to use countApprox on RDD API, altough this also launches a Spark job, it should be faster as it just gives you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):
example usage:
val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.90)
val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)
You have to play a bit with the parameters timeout and confidence. The higher the timeout, the more accurate is the estimated count.
If you have a truly enormous number of records, you can get an approximate count using something like HyperLogLog and this might be faster than count(). However you won't be able to get any result without kicking off a job.
When using Spark there are two kinds of RDD operations: transformations and actions. Roughly speaking, transformations modify an RDD and return a new RDD. Actions calculate or generate some result. Transformations are lazily evaluated, so they don't kick off a job until an action is called at the end of a sequence of transformations.
Because Spark is a distributed batch programming framework, there is a lot of overhead for running jobs. If you need something that feels more like "real time" whatever that means, either use basic Scala (or Python) if your data is small enough, or move to a streaming approach and do something like update a counter as new records flow through.

Understanding shuffle managers in Spark

Let me help to clarify about shuffle in depth and how Spark uses shuffle managers. I report some very helpful resources:
https://trongkhoanguyenblog.wordpress.com/
https://0x0fff.com/spark-architecture-shuffle/
https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md
Reading them, I understood there are different shuffle managers. I want to focus about two of them: hash manager and sort manager(which is the default manager).
For expose my question, I want to start from a very common transformation:
val rdd = reduceByKey(_ + _)
This transformation causes map-side aggregation and then shuffle for bringing all the same keys into the same partition.
My questions are:
Is Map-Side aggregation implemented using internally a mapPartition transformation and thus aggregating all the same keys using the combiner function or is it implemented with a AppendOnlyMap or ExternalAppendOnlyMap?
If AppendOnlyMap or ExternalAppendOnlyMap maps are used for aggregating, are they used also for reduce side aggregation that happens into the ResultTask?
What exaclty the purpose about these two kind of maps (AppendOnlyMap or ExternalAppendOnlyMap)?
Are AppendOnlyMap or ExternalAppendOnlyMap used from all shuffle managers or just from the sortManager?
I read that after AppendOnlyMap or ExternalAppendOnlyMap are full, are spilled into a file, how exactly does this steps happen?
Using the Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory is fill up, we start sorting map, spilling it to disk and then clean up the map, my question is : what is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system, but they are treat differently, Shuffle write records, are not put into the appendOnlyMap.
Can you explain in depth what happen when reduceByKey being executed, explaining me all the steps involved for to accomplish that? Like for example all the steps for map side aggregation, shuffling and so on.
It follows the description of reduceByKey step-by-step:
reduceByKey calls combineByKeyWithTag, with identity combiner and identical merge value and create value
combineByKeyWithClassTag creates an Aggregator and returns ShuffledRDD. Both "map" and "reduce" side aggregations use internal mechanism and don't utilize mapPartitions.
Agregator uses ExternalAppendOnlyMap for both combineValuesByKey ("map side reduction") and combineCombinersByKey ("reduce side reduction")
Both methods use ExternalAppendOnlyMap.insertAllMethod
ExternalAppendOnlyMap keeps track of spilled parts and the current in-memory map (SizeTrackingAppendOnlyMap)
insertAll method updates in-memory map and checks on insert if size estimated size of the current map exceeds the threshold. It uses inherited Spillable.maybeSpill method. If threshold is exceeded this method calls spill as a side effect, and insertAll initializes clean SizeTrackingAppendOnlyMap
spill calls spillMemoryIteratorToDisk which gets DiskBlockObjectWriter object from the block manager.
insertAll steps are applied for both map and reduce side aggregations with corresponding Aggregator functions with shuffle stage in between.
As of Spark 2.0 there is only sort based manager: SPARK-14667

DataFrame orderBy followed by limit in Spark

I am having a program take generate a DataFrame on which it will run something like
Select Col1, Col2...
orderBy(ColX) limit(N)
However, when i collect the data in end, i find that it is causing the driver to OOM if I take a enough large top N
Also another observation is that if I just do sort and top, this problem will not happen. So this happen only when there is sort and top at the same time.
I am wondering why it could be happening? And particular, what is really going underneath this two combination of transforms? How does spark will evaluate query with both sorting and limit and what is corresponding execution plan underneath?
Also just curious does spark handle sort and top different between DataFrame and RDD?
EDIT,
Sorry i didn't mean collect,
what i original just mean that when i call any action to materialize the data, regardless of whether it is collect (or any action sending data back to driver) or not (So the problem is definitely not on the output size)
While it is not clear why this fails in this particular case there multiple issues you may encounter:
When you use limit it simply puts all data on a single partition, no matter how big n is. So while it doesn't explicitly collect it almost as bad.
On top of that orderBy requires a full shuffle with range partitioning which can result in a different issues when data distribution is skewed.
Finally when you collect results can be larger than the amount of memory available on the driver.
If you collect anyway there is not much you can improve here. At the end of the day driver memory will be a limiting factor but there still some possible improvements:
First of all don't use limit.
Replace collect with toLocalIterator.
use either orderBy |> rdd |> zipWithIndex |> filter or if exact number of values is not a hard requirement filter data directly based on approximated distribution as shown in Saving a spark dataframe in multiple parts without repartitioning (in Spark 2.0.0+ there is handy approxQuantile method).

Spark broadcast to all keys - updateStateByKey

UpdateStateByKey is useful but what if I want to perform an operation to all existing keys (not only the ones in this RDD).
Word count for example - is there a way to decrease all words seen so far by 1?
I was thinking of keeping a static class per node with the count information and issuing a broadcast command to take a certain action, but could not find a broadcast-to-all-nodes functionality.
Spark will perform an updateStateByKey to all existing keys anyway.
Good to also note that if the updateStateByKey function returns None (in Scala) then the key-value pair will be eliminated.

Resources