A question about spark distributied aggregation

A question about spark distributied aggregation - apache-spark

I am reading up on spark from here
At one point the blog says:
consider an app that wants to count the occurrences of each word in a corpus and pull the results into the driver as a map. One approach, which can be accomplished with the aggregate action, is to compute a local map at each partition and then merge the maps at the driver. The alternative approach, which can be accomplished with aggregateByKey, is to perform the count in a fully distributed way, and then simply collectAsMap the results to the driver.
So, as I understand this, the two approaches described are:
Approach 1:
Create a hash map for within each executor
Collect key 1 from all the executors on the driver and aggregate
Collect key 2 from all the executors on the driver and aggregate
and so on and so forth
This is where the problem is. I do not think this approach 1 ever happens in spark unless the user was hell-bent on doing it and start using collect along with filter to get the data key by key on the driver and then writing code on the driver to merge the results
Approach 2 (I think this is what usually happens in spark unless you use groupBy wherein the combiner is not run. This is typical reduceBy mechanism):
Compute first level of aggregation on map side
Shuffle
Compute second level of aggregation from all the partially aggregated results from the step 1
Which leads me to believe that I am misunderstanding the approach 1 and what the author is trying to say. Can you please help me understand what the approach 1 in the quoted text is?

Related

Apache Spark Streaming - reduceByKey, groupByKey, aggregateByKey or combineByKey?

I have an application which generates multiple sessions each containing multiple events (in Avro format) over a 10 minute time period - each event will include a session id which could be used to find all the session data. Once I have gathered all this data I would like to then create a single session object.
My plan is to use a window in Spark Streaming to ensure I have the data available in memory for processing - unless there are any other suggestions which would be a good fit to solve my problem.
After reading the Apache Spark documentation it looks like I could achieve this using various different API's, but I am struggling to work out which one would be the best fit for my problem - so far I have come across reduceByKey / groupByKey / aggregateByKey / combineByKey.
To give you a bit more detail into the session / event data I expect there to be anywhere in the region of 1m active sessions with each session producing 5/10 event in a 10 minute period.
It would be good to get some input into which approach is a good fit for gathering all session events and producing a single session object.
Thanks in advance.

#phillip Thanks for the details. Let's go in the details of each keys:
(1). groupByKey - It can help to rank, sort and even aggregate using any key. Performance wise it is slower because does not use combiner.
groupByKey() is just to group your dataset based on a key
If you are doing any aggregation like sum, count, min, max then this is not preferable.
(2). reduceBykey - It supports only aggregations like sum, mix, max. Uses combiner so faster than groupbykey. Data shuffled is very less.
reduceByKey() is something like grouping + aggregation.
reduceByKey can be used when we run on large data set.
(3). aggregateByKey - Similar to reduceBykey, It supports only aggregations like sum, mix, max. is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,”six”) as output
I believe you require only grouping and no aggregations, then I believe you are left with no choice then to use groupBykey()

What is your approach for querying Cassandra with Spark (in R or Python)?

I am working with about a TB of data stored in Cassandra and trying to query it using Spark and R (could be Python).
My preference for querying the data would be to abstract the Cassandra table I'm querying from as a Spark RDD (using sparklyr and the spark-cassandra-connector with spark-sql) and simply doing an inner join on the column of interest (it is a partition key column). The company I'm working with says that this approach is a bad idea as it will translate into an IN clause in CQL and thus cause a big slow-down.
Instead I'm using their preferred method: write a closure that will extract the data for a single id in the partition key using a jdbc connection and then apply that closure 200k times for each id I'm interested in. I use spark_apply to apply that closure in parallel for each executor. I also set my spark.executor.cores to 1 so I get a lot of parellelization.
I'm having a lot of trouble with this approach and am wondering what the best practice is. Is it true that Spark SQL does not account for the slowdown associated with pulling multiple ids from a partition key column (IN operator)?

A few points here:
Working with Spark-SQL is not always the most performant option, the
optimized might not always as good of a job than a job you write
yourself
Check the logs carefully during your work, always check how your high-level queries are translated to CQL queries. In particular, make sure you avoid a full table scan if you can.
If you joining on the partition key, you should look into leveraging the methods: repartitionByCassandraReblica, and joinWithCassandraTable. Have a look at the official doc here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md and Tip4 of this blog post: https://www.instaclustr.com/cassandra-connector-for-spark-5-tips-for-success/
Finale note, it's quite common to have 2 Cassandra data center when using Spark. The first one serves regular read / write, the second one is used for running Spark. It's a separation of concern best practice (at the cost of an additional DC of course).
Hope it helps!

reducer concept in Spark

I'm coming from a Hadoop background and have limited knowledge about Spark. BAsed on what I learn so far, Spark doesn't have mapper/reducer nodes and instead it has driver/worker nodes. The worker are similar to the mapper and driver is (somehow) similar to reducer. As there is only one driver program, there will be one reducer. If so, how simple programs like word count for very big data sets can get done in spark? Because driver can simply run out of memory.

The driver is more of a controller of the work, only pulling data back if the operator calls for it. If the operator you're working on returns an RDD/DataFrame/Unit, then the data remains distributed. If it returns a native type then it will indeed pull all of the data back.
Otherwise, the concept of map and reduce are a bit obsolete here (from a type of work persopective). The only thing that really matters is whether the operation requires a data shuffle or not. You can see the points of shuffle by the stage splits either in the UI or via a toDebugString (where each indentation level is a shuffle).
All that being said, for a vague understanding, you can equate anything that requires a shuffle to a reducer. Otherwise it's a mapper.
Last, to equate to your word count example:
sc.textFile(path)
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_+_)
In the above, this will be done in one stage as the data loading (textFile), splitting(flatMap), and mapping can all be done independent of the rest of the data. No shuffle is needed until the reduceByKey is called as it will need to combine all of the data to perform the operation...HOWEVER, this operation has to be associative for a reason. Each node will perform the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network overhead.
NOTE that reduceByKey returns an RDD and is thus a transformation, so the data is shuffled via a HashPartitioner. All of the data does NOT pull back to the driver, it merely moves to nodes that have the same keys so that it can have its final value merged.
Now, if you use an action such as reduce or worse yet, collect, then you will NOT get an RDD back which means the data pulls back to the driver and you will need room for it.
Here is my fuller explanation of reduceByKey if you want more. Or how this breaks down in something like combineByKey

Spark broadcast to all keys - updateStateByKey

UpdateStateByKey is useful but what if I want to perform an operation to all existing keys (not only the ones in this RDD).
Word count for example - is there a way to decrease all words seen so far by 1?
I was thinking of keeping a static class per node with the count information and issuing a broadcast command to take a certain action, but could not find a broadcast-to-all-nodes functionality.

Spark will perform an updateStateByKey to all existing keys anyway.
Good to also note that if the updateStateByKey function returns None (in Scala) then the key-value pair will be eliminated.

How to find the number of keys created in map part?

I am trying to write Spark application that would find me the number of keys that has been created in the map function. I could find no function that would allow me to do that.
One way I've thought of is using accumulator where I'd add 1 to the accumulator variable in the reduce function. My idea is based on the assumption that accumulator variables are shared across nodes as counters.
Please guide.

if you are looking something like the Hadoop counters in spark, the most accurate approximation is an Accumulator that you can increase in every task, but you do not have any information of the amount of data that Spark has processed so far.
If you only want to know how many distinct keys do you have in your rdd, you could do something like a count of the distinct mapped keys (rdd.map(t=>t_1)).distinct.count)
Hope this will be useful for you

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string