How to find the number of keys created in map part? - apache-spark

I am trying to write Spark application that would find me the number of keys that has been created in the map function. I could find no function that would allow me to do that.
One way I've thought of is using accumulator where I'd add 1 to the accumulator variable in the reduce function. My idea is based on the assumption that accumulator variables are shared across nodes as counters.
Please guide.

if you are looking something like the Hadoop counters in spark, the most accurate approximation is an Accumulator that you can increase in every task, but you do not have any information of the amount of data that Spark has processed so far.
If you only want to know how many distinct keys do you have in your rdd, you could do something like a count of the distinct mapped keys (rdd.map(t=>t_1)).distinct.count)
Hope this will be useful for you

Related

A question about spark distributied aggregation

I am reading up on spark from here
At one point the blog says:
consider an app that wants to count the occurrences of each word in a corpus and pull the results into the driver as a map. One approach, which can be accomplished with the aggregate action, is to compute a local map at each partition and then merge the maps at the driver. The alternative approach, which can be accomplished with aggregateByKey, is to perform the count in a fully distributed way, and then simply collectAsMap the results to the driver.
So, as I understand this, the two approaches described are:
Approach 1:
Create a hash map for within each executor
Collect key 1 from all the executors on the driver and aggregate
Collect key 2 from all the executors on the driver and aggregate
and so on and so forth
This is where the problem is. I do not think this approach 1 ever happens in spark unless the user was hell-bent on doing it and start using collect along with filter to get the data key by key on the driver and then writing code on the driver to merge the results
Approach 2 (I think this is what usually happens in spark unless you use groupBy wherein the combiner is not run. This is typical reduceBy mechanism):
Compute first level of aggregation on map side
Shuffle
Compute second level of aggregation from all the partially aggregated results from the step 1
Which leads me to believe that I am misunderstanding the approach 1 and what the author is trying to say. Can you please help me understand what the approach 1 in the quoted text is?

Spark (large dataset) groupBy, sort, and then map

With a Spark rdd is there a way to groupByKey, then sort within each group, and then map for large datasets. The naive way of doing this maps over each group and creates a list for each group and sorts it. However this creation of a list will potentially cause out of memory problems for groups with many entries. Is there a way to have Spark do the sorting so as to avoid out of memory issues.
It sounds like you are getting a data skew error. This can happen when an executor runs out of memory because too much data is associated with that key. A solution to that problem would be to adjust/play with the number of executors and amount of RAM allocated to each...
However I believe this would be the solution to your problem:
JavaPairRDD<Key, Iterable<Value>> pair = ...;
JavaRDD<Iterable<Value>> values = pair.map(t2 -> t2._2());
JavaRDD<Value> onlyValues = values.flatMap(it -> it);
source: Convert iterable to RDD
Please follow up with this possible solution. I am genuinely curious.

Global variable value update at task level in spark

I have a requirement of doing some integer variable updates based on some computations in my transformations . For eg if I get some discrepancy in record matching then I want to increment a value and use it then and there .
I have explored the use of accumulators,
but its value can only be used in driver
which will be very tedious for me as I am dealing with billions of rows.
Please suggest me a possible solution for global variable updates in spark like COUNTERS in MapReduce framework.
Accumulators is the best alternative of counters in spark.but have to use in action instead of transformation as computations inside transformations are evaluated lazily.you can find example in below link.
https://github.com/prithvirajbose/spark-dev/blob/master/src/main/scala/examples/PurchaseLogAnalysis.scala
I faced the same problem.Using Accumulators is a good practice for this use case and it wont affect the spark performance and safer to use.

DataFrame orderBy followed by limit in Spark

I am having a program take generate a DataFrame on which it will run something like
Select Col1, Col2...
orderBy(ColX) limit(N)
However, when i collect the data in end, i find that it is causing the driver to OOM if I take a enough large top N
Also another observation is that if I just do sort and top, this problem will not happen. So this happen only when there is sort and top at the same time.
I am wondering why it could be happening? And particular, what is really going underneath this two combination of transforms? How does spark will evaluate query with both sorting and limit and what is corresponding execution plan underneath?
Also just curious does spark handle sort and top different between DataFrame and RDD?
EDIT,
Sorry i didn't mean collect,
what i original just mean that when i call any action to materialize the data, regardless of whether it is collect (or any action sending data back to driver) or not (So the problem is definitely not on the output size)
While it is not clear why this fails in this particular case there multiple issues you may encounter:
When you use limit it simply puts all data on a single partition, no matter how big n is. So while it doesn't explicitly collect it almost as bad.
On top of that orderBy requires a full shuffle with range partitioning which can result in a different issues when data distribution is skewed.
Finally when you collect results can be larger than the amount of memory available on the driver.
If you collect anyway there is not much you can improve here. At the end of the day driver memory will be a limiting factor but there still some possible improvements:
First of all don't use limit.
Replace collect with toLocalIterator.
use either orderBy |> rdd |> zipWithIndex |> filter or if exact number of values is not a hard requirement filter data directly based on approximated distribution as shown in Saving a spark dataframe in multiple parts without repartitioning (in Spark 2.0.0+ there is handy approxQuantile method).

when should groupByKey API used in spark programming?

GroupByKey suffers from shuffling the data.And GroupByKey functionality can be achieved either by using combineByKey or reduceByKey.So When should this API be used ? Is there any use case ?
Combine and reduce will also eventually shuffle, but they have better memory and speed performance characteristics because they are able to do more work to reduce the volume of data before the shuffle.
Consider if you had to sum a numeric attribute by a group RDD[(group, num)]. groupByKey will give you RDD[(group, List[num])] which you can then manually reduce using map. The shuffle would need to move all the individual nums to the destination partitions/nodes to get that list - many rows being shuffled.
Because reduceByKey knows that what you are doing with the nums (ie. summing them), it can sum each individual partition before the shuffle - so you'd have at most one row per group being written out to shuffle partition/node.
According to the link below, GroupByKey should be avoided.
Avoid GroupByKey
Avoid GroupByKey when the data in the merge field will be reduced to single value . Eg. In case of sum for a particular key.
Use GroupByKey when you know that merge field is not going to be reduced to single value. Eg: List reduce(_++_) --> Avoid this.
The reason being reduce a list will create memory both map side and reduce side. Memory that is created on executor that doesn't own the key will be wasted during shuffle.
Good example would be TopN.
More on this -
https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey
I woud say if groupByKey is last transformation in your chain of work (or you do anything after that has narrow dependency only), they you may consider it.
The reason reducebyKey is preferred is
1. Combine as alister mentioned above
2. ReduceByKey also partitions the data so that sum/agg becomes narrow ie can happen within partitions

Resources