Spark broadcast to all keys - updateStateByKey - apache-spark

UpdateStateByKey is useful but what if I want to perform an operation to all existing keys (not only the ones in this RDD).
Word count for example - is there a way to decrease all words seen so far by 1?
I was thinking of keeping a static class per node with the count information and issuing a broadcast command to take a certain action, but could not find a broadcast-to-all-nodes functionality.

Spark will perform an updateStateByKey to all existing keys anyway.
Good to also note that if the updateStateByKey function returns None (in Scala) then the key-value pair will be eliminated.

Related

A question about spark distributied aggregation

I am reading up on spark from here
At one point the blog says:
consider an app that wants to count the occurrences of each word in a corpus and pull the results into the driver as a map. One approach, which can be accomplished with the aggregate action, is to compute a local map at each partition and then merge the maps at the driver. The alternative approach, which can be accomplished with aggregateByKey, is to perform the count in a fully distributed way, and then simply collectAsMap the results to the driver.
So, as I understand this, the two approaches described are:
Approach 1:
Create a hash map for within each executor
Collect key 1 from all the executors on the driver and aggregate
Collect key 2 from all the executors on the driver and aggregate
and so on and so forth
This is where the problem is. I do not think this approach 1 ever happens in spark unless the user was hell-bent on doing it and start using collect along with filter to get the data key by key on the driver and then writing code on the driver to merge the results
Approach 2 (I think this is what usually happens in spark unless you use groupBy wherein the combiner is not run. This is typical reduceBy mechanism):
Compute first level of aggregation on map side
Shuffle
Compute second level of aggregation from all the partially aggregated results from the step 1
Which leads me to believe that I am misunderstanding the approach 1 and what the author is trying to say. Can you please help me understand what the approach 1 in the quoted text is?

Use of countByKeyApprox() for Partial manual broadcast hash join

I read about Partial manual broadcast hash join which can be used while joining Pair RDD in Spark. This is suggested to be useful if one key is so large that it can’t fit on a single partition. In this case you can use countByKeyApprox on the large RDD to get an approximate idea of which keys would most benefit from a broadcast.
You then filter the smaller RDD for only these keys, collecting the result locally in a HashMap. Using sc.broadcast you can broadcast the HashMap so that each worker only has one copy and manually perform the join against the HashMap. Using the same HashMap you can then filter your large RDD down to not include the large number of duplicate keys and perform your standard join, unioning it with the result of your manual join. This approach is quite convoluted but may allow you to handle highly skewed data you couldn’t otherwise process.
The question is about the usage of countByKeyApprox(long timeout). What is the unit of this timeout? IF I write countByKeyApprox(10), does that mean it will wait for 10 seconds or 10 ms or something else?
It's in milliseconds
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/rdd/PairRDDFunctions.html#countByKeyApprox-long-double-
Parameters:
timeout - maximum time to wait for the job, in milliseconds
confidence - the desired statistical confidence in the result

reducer concept in Spark

I'm coming from a Hadoop background and have limited knowledge about Spark. BAsed on what I learn so far, Spark doesn't have mapper/reducer nodes and instead it has driver/worker nodes. The worker are similar to the mapper and driver is (somehow) similar to reducer. As there is only one driver program, there will be one reducer. If so, how simple programs like word count for very big data sets can get done in spark? Because driver can simply run out of memory.
The driver is more of a controller of the work, only pulling data back if the operator calls for it. If the operator you're working on returns an RDD/DataFrame/Unit, then the data remains distributed. If it returns a native type then it will indeed pull all of the data back.
Otherwise, the concept of map and reduce are a bit obsolete here (from a type of work persopective). The only thing that really matters is whether the operation requires a data shuffle or not. You can see the points of shuffle by the stage splits either in the UI or via a toDebugString (where each indentation level is a shuffle).
All that being said, for a vague understanding, you can equate anything that requires a shuffle to a reducer. Otherwise it's a mapper.
Last, to equate to your word count example:
sc.textFile(path)
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_+_)
In the above, this will be done in one stage as the data loading (textFile), splitting(flatMap), and mapping can all be done independent of the rest of the data. No shuffle is needed until the reduceByKey is called as it will need to combine all of the data to perform the operation...HOWEVER, this operation has to be associative for a reason. Each node will perform the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network overhead.
NOTE that reduceByKey returns an RDD and is thus a transformation, so the data is shuffled via a HashPartitioner. All of the data does NOT pull back to the driver, it merely moves to nodes that have the same keys so that it can have its final value merged.
Now, if you use an action such as reduce or worse yet, collect, then you will NOT get an RDD back which means the data pulls back to the driver and you will need room for it.
Here is my fuller explanation of reduceByKey if you want more. Or how this breaks down in something like combineByKey

How to find the number of keys created in map part?

I am trying to write Spark application that would find me the number of keys that has been created in the map function. I could find no function that would allow me to do that.
One way I've thought of is using accumulator where I'd add 1 to the accumulator variable in the reduce function. My idea is based on the assumption that accumulator variables are shared across nodes as counters.
Please guide.
if you are looking something like the Hadoop counters in spark, the most accurate approximation is an Accumulator that you can increase in every task, but you do not have any information of the amount of data that Spark has processed so far.
If you only want to know how many distinct keys do you have in your rdd, you could do something like a count of the distinct mapped keys (rdd.map(t=>t_1)).distinct.count)
Hope this will be useful for you

Cassandra - iterate over all Row Keys without duplicates on random partitioner

get_range_slices iterates over all keys also in case of random partitioner. As I understand result of this query will not return duplicated keys, because it goes ascending over ring. Since keys are hashed, Cassandra would need additional "index" to be able to execute such query - like each key would need to keep references to next key (which is not the case).
Could someone give me some hints on how Cassandra realizes iteration over all keys in case of random partitioner?
Results are returned in random order. Or more specifically, token order (the hashed value of the keys).
EDIT: I am not sure I understood the original question as if you have 100 nodes, you would never from a single node want to run get_range_slices. Typically you would install hadoop map/reduce on top of cassandra with cassandra's adapter so you can process all keys in parallel.
get_range_slices in general is never used for getting "all" the keys on random partitioner. Instead, map/reduce is utilized as it is MUCH MUCH faster to send your binary code to each machine and each machine executes in parallel so you can traverse the entire data set much much faster.
ie. maybe you need to look into map/reduce instead of get_range_slices?
Another option is PlayOrm's partitioning if you use PlayOrm since you can use storm and you can have a machine processing each partition. AND you can do a
PARTITIONS(:partitionId) SELECT * FROM Table
to get all the rows for a partition.
You can of course do joins and such too and they are fast as they read from multiple disks in parallel and dealing with disks, you want that parallel action to speed things up.

Resources