Apache Spark Streaming - reduceByKey, groupByKey, aggregateByKey or combineByKey?

I have an application which generates multiple sessions each containing multiple events (in Avro format) over a 10 minute time period - each event will include a session id which could be used to find all the session data. Once I have gathered all this data I would like to then create a single session object.
My plan is to use a window in Spark Streaming to ensure I have the data available in memory for processing - unless there are any other suggestions which would be a good fit to solve my problem.
After reading the Apache Spark documentation it looks like I could achieve this using various different API's, but I am struggling to work out which one would be the best fit for my problem - so far I have come across reduceByKey / groupByKey / aggregateByKey / combineByKey.
To give you a bit more detail into the session / event data I expect there to be anywhere in the region of 1m active sessions with each session producing 5/10 event in a 10 minute period.
It would be good to get some input into which approach is a good fit for gathering all session events and producing a single session object.
Thanks in advance.

#phillip Thanks for the details. Let's go in the details of each keys:
(1). groupByKey - It can help to rank, sort and even aggregate using any key. Performance wise it is slower because does not use combiner.
groupByKey() is just to group your dataset based on a key
If you are doing any aggregation like sum, count, min, max then this is not preferable.
(2). reduceBykey - It supports only aggregations like sum, mix, max. Uses combiner so faster than groupbykey. Data shuffled is very less.
reduceByKey() is something like grouping + aggregation.
reduceByKey can be used when we run on large data set.
(3). aggregateByKey - Similar to reduceBykey, It supports only aggregations like sum, mix, max. is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,”six”) as output
I believe you require only grouping and no aggregations, then I believe you are left with no choice then to use groupBykey()


A question about spark distributied aggregation

I am reading up on spark from here
At one point the blog says:
consider an app that wants to count the occurrences of each word in a corpus and pull the results into the driver as a map. One approach, which can be accomplished with the aggregate action, is to compute a local map at each partition and then merge the maps at the driver. The alternative approach, which can be accomplished with aggregateByKey, is to perform the count in a fully distributed way, and then simply collectAsMap the results to the driver.
So, as I understand this, the two approaches described are:
Approach 1:
Create a hash map for within each executor
Collect key 1 from all the executors on the driver and aggregate
Collect key 2 from all the executors on the driver and aggregate
and so on and so forth
This is where the problem is. I do not think this approach 1 ever happens in spark unless the user was hell-bent on doing it and start using collect along with filter to get the data key by key on the driver and then writing code on the driver to merge the results
Approach 2 (I think this is what usually happens in spark unless you use groupBy wherein the combiner is not run. This is typical reduceBy mechanism):
Compute first level of aggregation on map side
Compute second level of aggregation from all the partially aggregated results from the step 1
Which leads me to believe that I am misunderstanding the approach 1 and what the author is trying to say. Can you please help me understand what the approach 1 in the quoted text is?

Spark Dataset join performance

I receive a Dataset and I am required to join it with another Table. Hence the most simple solution that came to my mind was to create a second Dataset for the other table and perform the joinWith.
def joinFunction(dogs: Dataset[Dog]): Dataset[(Dog, Cat)] = {
val cats: Dataset[Cat] = spark.table("dev_db.cat").as[Cat]
dogs.joinWith(cats, ...)
Here my main concern is with spark.table("dev_db.cat"), as it feels like we are referring to all of the cat data as
SELECT * FROM dev_db.cat
and then doing a join at a later stage. Or will the query optimizer directly perform the join with out referring to the whole table? Is there a better solution?
Here are some suggestions for your case:
a. If you have where, filter, limit, take etc operations try to apply them before joining the two datasets. Spark can't push down these kind of filters therefore you have to do by your own reducing as much as possible the amount of target records. Here an excellent source of information over the Spark optimizer.
b. Try to co-locate the datasets and minimize the shuffled data by using repartition function. The repartition should be based on the keys that participate in join i.e:
dogs.repartition(1024, "key_col1", "key_col2")
dogs.join(cats, Seq("key_col1", "key_col2"), "inner")
c. Try to use broadcast for the smaller dataset if you are sure that it can fit in memory (or increase the value of spark.broadcast.blockSize). This consists a certain boost for the performance of your Spark program since it will ensure the co-existense of two datasets within the same node.
If you can't apply any of the above then Spark doesn't have a way to know which records should be excluded and therefore will scan all the available rows from both datasets.
You need to do an explain and see if predicate push down is used. Then you can judge your concern to be correct or not.
However, in general now, if no complex datatypes are used and/or datatype mismatches are not evident, then push down takes place. You can see that with simple createOrReplaceTempView as well. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html

Use of countByKeyApprox() for Partial manual broadcast hash join

I read about Partial manual broadcast hash join which can be used while joining Pair RDD in Spark. This is suggested to be useful if one key is so large that it can’t fit on a single partition. In this case you can use countByKeyApprox on the large RDD to get an approximate idea of which keys would most benefit from a broadcast.
You then filter the smaller RDD for only these keys, collecting the result locally in a HashMap. Using sc.broadcast you can broadcast the HashMap so that each worker only has one copy and manually perform the join against the HashMap. Using the same HashMap you can then filter your large RDD down to not include the large number of duplicate keys and perform your standard join, unioning it with the result of your manual join. This approach is quite convoluted but may allow you to handle highly skewed data you couldn’t otherwise process.
The question is about the usage of countByKeyApprox(long timeout). What is the unit of this timeout? IF I write countByKeyApprox(10), does that mean it will wait for 10 seconds or 10 ms or something else?
It's in milliseconds
timeout - maximum time to wait for the job, in milliseconds
confidence - the desired statistical confidence in the result

Does joining additional columns in Spark scale horizontally?

I have a dataset with about 2.4M rows, with a unique key for each row. I have performed some complex SQL queries on some other tables, producing a dataset with two columns, a key and the value true. This dataset is about 500 rows. Now I would like to (outer) join this dataset with my original table.
This produces a new table with a very sparse set of values (true in about 500 rows, null elsewhere).
Finally, I would like to do this about 200 times, giving me a final table of about 201 columns (the key, plus the 200 sparse columns).
When I run this, I notice that as it runs it gets considerably slower. The first join takes 2 seconds, then 4s, then 6s, then 10s, then 20s and after about 30 joins the system never recovers. Of course, the actual numbers are irrelevant as that depends on the cluster I'm running, but I'm wondering:
Is this slowdown is expected?
I am using parquet as a data storage format (columnar storage) so I was hopeful that adding more columns would scale horizontally, is that a correct assumption?
All the columns I've joined so far are not needed for the Nth join, can they be unloaded from memory?
Are there other things I can do when combining lots of columns in spark?
Calling explain on each join in the loop shows that each join is getting more complex (appears to include all previous joins and it also includes the complex sql queries, even though those have been checkpointed). Is there a way to really checkpoint so each join is just a join? I am actually calling show() after each join, so I assumed the join is actually happening at that point.
Is this slowdown is expected
Yes, to some extent it is. Joins belong to the most expensive operations in a data intensive systems (it is not a coincidence that products which claim linear scalability usually take joins out of the table). Join-like operation in a distributed system typically require data exchange between nodes hitting a bunch of high latency numbers.
In Spark SQL there is also additional cost of computing execution plan, which has larger than linear complexity.
I am using parquet as a data storage format (columnar storage) so I was hopeful that adding more columns would scale horizontally, is that a correct assumption?
No. Input format doesn't affect join logic at all.
All the columns I've joined so far are not needed for the Nth join, can they be unloaded from memory?
If truly excluded from the final output they will be pruned from the execution plan. But since you for a reason, I assume it is not the case and there are required for the final output.
Is there a way to really checkpoint so each join is just a join? I am actually calling show() after each join, so I assumed the join is actually happening at that point.
show computes only a small subset of data required for the output. It doesn't cache, although shuffle files might be reused.
(appears to include all previous joins and it also includes the complex sql queries, even though those have been checkpointed).
Checkpoints are created only if data is fully computed and don't remove stages from the execution plan. If you want to do it explicitly, write partial result to persistent storage and read it back at the beginning of each iteration (it is probably an overkill).
Are there other things I can do when combining lots of columns in spark?
The best thing you can do is to find a way to avoid joins completely. If key is always the same then single shuffle, and operation on groups / partitions (with byKey method, window functions) might be better choice.
However if you
have a dataset with about 2.4M rows
then using non-distributed system that supports in-place modification might be much better choice.
In the most naive implementation you can compute each aggregate separately, sort by key and write to disk. Then data can be merged together line by line with negligible memory footprint.

In spark, how to estimate the number of elements in a dataframe quickly

In spark, is there a fast way to get an approximate count of the number of elements in a Dataset ? That is, faster than Dataset.count() does.
Maybe we could calculate this information from the number of partitions of the DataSet, could we ?
You could try to use countApprox on RDD API, altough this also launches a Spark job, it should be faster as it just gives you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):
example usage:
val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.90)
val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)
You have to play a bit with the parameters timeout and confidence. The higher the timeout, the more accurate is the estimated count.
If you have a truly enormous number of records, you can get an approximate count using something like HyperLogLog and this might be faster than count(). However you won't be able to get any result without kicking off a job.
When using Spark there are two kinds of RDD operations: transformations and actions. Roughly speaking, transformations modify an RDD and return a new RDD. Actions calculate or generate some result. Transformations are lazily evaluated, so they don't kick off a job until an action is called at the end of a sequence of transformations.
Because Spark is a distributed batch programming framework, there is a lot of overhead for running jobs. If you need something that feels more like "real time" whatever that means, either use basic Scala (or Python) if your data is small enough, or move to a streaming approach and do something like update a counter as new records flow through.
