job hangs when using pipe() with reduceByKey() - apache-spark

I meet a situation:
When I use
val a = rdd.pipe("./my_cpp_program").persist()
a.count() // just use it to persist a
val b = => (s, 1)).reduceByKey().count()
it 's so fast
but when I use
val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey().count()
it is so slow....
and there are many such log in my executors:
15/10/31 19:53:58 INFO collection.ExternalSorter: Thread 78 spilling in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:14 INFO collection.ExternalSorter: Thread 74 spilling in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:17 INFO collection.ExternalSorter: Thread 79 spilling in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:29 INFO collection.ExternalSorter: Thread 77 spilling in-memory map of 633.1 MB to disk (8 times so far)
15/10/31 19:54:50 INFO collection.ExternalSorter: Thread 76 spilling in-memory map of 633.1 MB to disk (9 times so far)

You haven't passed a function to reduceByKey(). From the docs for reduceByKey:
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
In this case, you want to pass the anonymous function (a, b) => a + b to aggregate values across your keys (can also be written as _ + _ using Scala's shortened underscore notation).
Since you are calling count() though (which essentially will count the number of unique keys after a reduceByKey()), it probably makes sense for you to just use distinct() instead. The implementation of distinct is actually very similar to what you're currently trying to do (mapping instead to (s, null) and then calling reduceByKey) but from a code readability standpoint, a distinct would better indicate what your end goal is. Something like this would work:
val b = rdd.pipe("./my_cpp_program").distinct().count()
Since you may actually also be interested in the counts per each unique key, there are other functions in the PairRDDFunctions class that can help out with this. I would check out countByKey(), countByKeyApprox(), and countApproxDistinctByKey(). Each have different use cases but offer interesting solutions to their respective problems.


How to calculate space for number of records

I am trying to calculate space required by a dataset using below formula, but I am getting wrong somewhere when I cross check it with the existing dataset in the system. Please help me
1st Dataset:
Record format . . . : VB
Record length . . . : 445
Block size . . . . : 32760
Number of records....: 51560
Using below formula to calculate
optimal block length (OBL) = 32760/record length = 32760/449 = 73
As there are two blocks on the track, hence (TOBL) = 2 * OBL = 73*2 = 146
Find number of physical records (PR) = Number of records/TOBL = 51560/146 = 354
Number of tracks = PR/2 = 354/2 = 177
But I can below in the dataset information
Current Allocation
Allocated tracks . : 100
Allocated extents . : 1
Current Utilization
Used tracks . . . . : 100
Used extents . . . : 1
2nd Dataset :
Record format . . . : VB
Record length . . . : 445
Block size . . . . : 27998
Number of Records....: 127,252
Using below formula to calculate
optimal block length (OBL) = 27998/record length = 27998/449 = 63
As there are two blocks on the track, hence (TOBL) = 2 * OBL = 63*2 = 126
Find number of physical records (PR) = Number of records/TOBL = 127252/126 = 1010
Number of tracks = PR/2 = 1010/2 = 505
Number of Cylinders = 505/15 = 34
But I can below in the dataset information
Current Allocation
Allocated cylinders : 69
Allocated extents . : 1
Current Utilization
Used cylinders . . : 69
Used extents . . . : 1
A few observations on your approach.
First, since your dealing with records that are variable length it would be helpful to know the "average" record length as that would help to formulate a more accurate prediction of storage. Your approach assumes a worst case scenario of all records being at maximum which is fine for planning purposes but in reality you'll likely see the actual allocation would be lower if the average of the record lengths is lower than the maximum.
The approach you are taking is reasonable but consider that you can inform z/OS of the space requirements in blocks, records, DASD geometry or let DFSMS perform the calculation on your behalf. Refer to this article to get some additional information on options.
Back to your calculations:
You Optimum Block Length (OBL) is really a records per block (RPB) number. Block size divided maximum record length yields the number of records at full length that can be stored in the block. If your average record length is less then you can store more records per block.
The assumption of two blocks per track may be true for your situation but it depends on the actual device type that will be used for the underlying allocation. Here is a link to some of the geometries for supported DASD devices and their geometries.
Your assumption of two blocks per track depends on the device is not correct for 3390's as you would need 64k for two blocks on a track but as you can see the 3390's max out at 56k so you would only get one block per track on the device.
Also, it looks like you did factor in the RDW by adding 4 bytes but someone looking at the question might be confused if they are not familiar with V records on z/OS.In the case of your calculation that would be 61 records per block at 27998 (which is the "optimal block length" so two blocks can fit comfortable on a track).
I'll use the following values:
MaximumRecordLength = RecordLength + 4 for RDW
TotalRecords = Total Records at Maximum Length (worst case)
BlockSize = modeled blocksize
RecordsPerBlock = number of records that can fit in a block (worst case)
BlocksNeeded = number of blocks needed to contain estimated records (worst case)
BlocksPerTrack = from IBM device geometry information
TracksNeeded = TotalRecords / RecordsPerBlock / BlocksPerTrack
Cylinders = Device Tracks per cylinder (15 for most devices)
Example 1:
Total Records = 51,560
BlockSize = 32,760
BlocksPerTrack = 1 (from device table)
RecordsPerBlock: 32,760 / 449 = 72.96 (72)
Total Blocks = 51,560 / 72 = 716.11 (717)
Total Tracks = 717 * 1 = 717
Cylinders = 717 / 15 = 47.8 (48)
Example 2:
Total Records = 127,252
BlockSize = 27,998
BlocksPerTrack = 2 (from device table)
RecordsPerBlock: 27,998 / 449 = 62.35 (62)
Total Blocks = 127,252 / 62 = 2052.45 (2,053)
Total Tracks = 2,053 / 2 = 1,026.5 (1,027)
Cylinders = 1027 / 15 = 68.5 (69)
Now, as to the actual allocation. It depends on how you allocated the space, the size of the records. Assuming it was in JCL you could use the RLSE subparameter of the SPACE= to release space when the is created and closed. This should release unused resources.
Given that the records are Variable the estimates are worst case and you would need to know more about the average record lengths to understand the actual allocation in terms of actual space used.
Final thought, all of the work you're doing can be overridden by your storage administrator through ACS routines. I believe that most people today would specify a BLKSIZE=0 and let DFSMS do all of the hard work because that component has more information about where a file will go, what the underlying devices are and the most efficient way of doing the allocation. The days of disk geometry and allocation are more of a campfire story unless your environment has not been administered to do these things for you.
Instead of trying to calculate tracks or cylinders, go for MBs, or KBs. z/OS (DFSMS) will calculate for you, how many tracks or cylinders are needed.
In JCL it is not straight forward but also not too complicated, once you got it.
There is a DD statement parameter called AVGREC=, which is the trigger. Let me do an example for your first case above:
//anydd DD DISP=(NEW,CATLG),
// SPACE=(445,(51560,1000)),AVGREC=U
//* | | | |
//* V V V V
//* (1) (2) (3) (4)
Parameter AVGREC=U (4) tells the system three things:
Firstly, the first subparameter in SPACE= (1) shall be interpreted as an average record length. (Note that this value is completely independend of the value specified in LRECL=.)
Secondly, it tells the system, that the second (2), and third (3) SPACE= subparameter are the number of records of average length (1) that the data set shall be able to store.
Thirdly, it tells the system that numbers (2), and (3) are in records (AVGREC=U). Alternatives are thousands (AVGREC=M), and millions (AVGREC=M).
So, this DD statement will allocate enough space to hold the estimated number of records. You don't have to care for track capacity, block capacity, device geometry, etc.
Given the number of records you expect and the (average) record length, you can easily calculate the number of kilobytes or megabytes you need. Unfortunately, you cannot directly specify KB, or MB in JCL, but there is a way using AVGREC= as follows.
Your first data set will get 51560 records of (maximum) length 445, i.e. 22'944'200 bytes, or ~22'945 KB, or ~23 MB. The JCL for an allocation in KB looks like this:
//anydd DD DISP=(NEW,CATLG),
// SPACE=(1,(22945,10000)),AVGREC=K
//* | | | |
//* V V V V
//* (1) (2) (3) (4)
You want the system to allocate primary space for 22945 (2) thousands (4) records of length 1 byte (1), which is 22945 KB, and secondary space for 10'000 (3) thousands (4) records of length 1 byte (1), i.e. 10'000 KB.
Now the same alloation specifying MB:
//anydd DD DISP=(NEW,CATLG),
// SPACE=(1,(23,10)),AVGREC=M
//* | | | |
//* V V V V
//* (1) (2)(3) (4)
You want the system to allocate primary space for 23 (2) millions (4) records of length 1 byte (1), which is 23 MB, and secondary space for 10 (3) millions (4) records of length 1 byte (1), i.e. 10 MB.
I rarely use anything other than the latter.
In ISPF, it is even easier: Data Set Allocation (3.2) allows KB, and MB as space units (amongst all the old ones).
A useful and usually simpler alternative to using SPACE and AVGREC etc is to simply use a DATACLAS for space if your site has appropriate sized ones defined. If you look at ISMF Option 4 you can list available DATACLAS's and see what space values etc they provide. You'd expect to see a number of ranges in size, and some with or without Extended Format and/or Compression. Even if a DATACLAS overallocates a bit then it is likely the overallocated space will be released by the MGMTCLAS assigned to the dataset at close or during space management. And you do have an option to code DATACLAS AND SPACE in which case any coded space (or other) value will override the DATACLAS, which helps with exceptions. It still depends how your Storage Admin's have coded the ACS routines but generally Users are allowed to specify a DATACLAS and it will be honored by the ACS routines.
For basic dataset size calculation I just use LRECL times the expected Max Record Count divided by 1000 a couple of times to get a rough MB figure. Obviously variable records/blks add 4bytes each for RDW and/or BDW but unless the number of records is massive or DASD is extremely tight for space wise it shouldn't be significant enough to matter.
=(51560*445)/1000/1000 shows as ~23MB
Also, don't expect your allocation to be exactly what you requested because the minimum allocation on Z/OS is 1 track or ~56k. The BLKSIZE also comes into effect by adding interblock gaps of ~32bytes per block. With SDB (system Determined Blocksize) invoked by omitting BLKSIZE or coding BLKSIZE=0, it will always try to provide half track blocking as close to 28k as possible so two blocks per track which is the most space efficient. That does matter, a BLKSIZE of 80bytes wastes ~80% of a track with interblock gaps. The BLKSIZE is also the unit of transfer when doing read/write to disk so generally the larger the better with some exceptions such as KSDS's being randomly access by key for example which might result in more data transfer than desired in an OLTP transaction.

Memory footprint of splitOn?

I wrote a file indexing program that should read thousands of text file lines as records and finally group those records by fingerprint. It uses Data.List.Split.splitOn to split the lines at tabs and retrieve the record fields. The program consumes 10-20 GB of memory.
Probably there is not much I can do to reduce that huge memory footprint, but I cannot explain why a function like splitOn (breakDelim) can consume that much memory:
Mon Dec 9 21:07 2019 Time and Allocation Profiling Report (Final)
group +RTS -p -RTS file1 file2 -o 2 -h
total time = 7.40 secs (7399 ticks # 1000 us, 1 processor)
total alloc = 14,324,828,696 bytes (excludes profiling overheads)
fileToPairs.linesIncludingEmptyLines ImageFileRecordParser ImageFileRecordParser.hs:35:7-47 25.0 33.8
breakDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(151,1)-(156,36) 24.9 39.3
sortAndGroup Aggregations Aggregations.hs:6:1-85 12.9 1.7
fileToPairs ImageFileRecordParser ImageFileRecordParser.hs:(33,1)-(42,14) 8.2 10.7
matchDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(73,1)-(77,23) 7.4 0.4
onSublist Data.List.Split.Internals src/Data/List/Split/Internals.hs:278:1-72 3.6 0.0
toHashesView ImageFileRecordStatistics ImageFileRecordStatistics.hs:(48,1)-(51,24) 3.0 6.3
main Main group.hs:(47,1)-(89,54) 2.9 0.4
numberOfUnique ImageFileRecord ImageFileRecord.hs:37:1-40 1.6 0.1
toHashesView.sortedLines ImageFileRecordStatistics ImageFileRecordStatistics.hs:50:7-30 1.4 0.1
imageFileRecordFromFields ImageFileRecordParser ImageFileRecordParser.hs:(11,1)-(30,5) 1.1 0.3
toHashView ImageFileRecord ImageFileRecord.hs:(67,1)-(69,23) 0.7 1.7
Or is type [Char] too memory inefficient (compared to Text), causing splitOn to take that much memory?
UPDATE 1 (+RTS -s suggestion of user HTNW)
23,446,268,504 bytes allocated in the heap
10,753,363,408 bytes copied during GC
1,456,588,656 bytes maximum residency (22 sample(s))
29,282,936 bytes maximum slop
3620 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 45646 colls, 0 par 4.055s 4.059s 0.0001s 0.0013s
Gen 1 22 colls, 0 par 4.034s 4.035s 0.1834s 1.1491s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.477s ( 7.475s elapsed)
GC time 8.089s ( 8.094s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.114s ( 0.114s elapsed)
Total time 15.687s ( 15.683s elapsed)
%GC time 51.6% (51.6% elapsed)
Alloc rate 3,135,625,407 bytes per MUT second
Productivity 48.4% of total user, 48.4% of total elapsed
The processed text files are smaller than usual (UTF-8 encoded, 37 MB). But still 3 GB of memory are used.
UPDATE 2 (critical part of the code)
Explanation: fileToPairs processes a text file. It returns a list of key-value pairs (key: fingerprint of record, value: record).
sortAndGroup associations = Map.fromListWith (++) [(k, [v]) | (k, v) <- associations]
main = do
CommandLineArguments{..} <- cmdArgs $ CommandLineArguments {
ignored_paths_file = def &= typFile,
files = def &= typ "FILES" &= args,
number_of_occurrences = def &= name "o",
minimum_number_of_occurrences = def &= name "l",
maximum_number_of_occurrences = def &= name "u",
number_of_hashes = def &= name "n",
having_record_errors = def &= name "e",
hashes = def
&= summary "Group image/video files"
&= program "group"
let ignoredPathsFilenameMaybe = ignored_paths_file
let filenames = files
let hashesMaybe = hashes
ignoredPaths <- case ignoredPathsFilenameMaybe of
Just ignoredPathsFilename -> ioToLines (readFile ignoredPathsFilename)
_ -> return []
recordPairs <- mapM (fileToPairs ignoredPaths) filenames
let allRecordPairs = concat recordPairs
let groupMap = sortAndGroup allRecordPairs
let statisticsPairs = map toPair (Map.toList groupMap) where toPair item = (fst item, imageFileRecordStatisticsFromRecords . snd $ item)
let filterArguments = FilterArguments {
numberOfOccurrencesMaybe = number_of_occurrences,
minimumNumberOfOccurrencesMaybe = minimum_number_of_occurrences,
maximumNumberOfOccurrencesMaybe = maximum_number_of_occurrences,
numberOfHashesMaybe = number_of_hashes,
havingRecordErrorsMaybe = having_record_errors
let filteredPairs = filterImageRecords filterArguments statisticsPairs
let filteredMap = Map.fromList filteredPairs
case hashesMaybe of
Just True -> mapM_ putStrLn (map toHashesView (map snd filteredPairs))
_ -> Char8.putStrLn (encodePretty filteredMap)
As I'm sure you're aware, there's not really enough information here for us to help you make your program more efficient. It might be worth posting some (complete, self-contained) code on the Code Review site for that.
However, I think I can answer your specific question about why splitOn allocates so much memory. In fact, there's nothing particularly special about splitOn or how it's been implemented. Many straightforward Haskell functions will allocate lots of memory, and this in itself doesn't indicate that they've been poorly written or are running inefficiently. In particular, splitOn's memory usage seems similar to other straightforward approaches to splitting a string based on delimiters.
The first thing to understand is that GHC compiled code works differently than other compiled code you're likely to have seen. If you know a lot of C and understand stack frames and heap allocation, or if you've studied some JVM implementations, you might reasonably expect that some of that understanding would translate to GHC executables, but you'd be mostly wrong.
A GHC program is more or less an engine for allocating heap objects, and -- with a few exceptions -- that's all it really does. Nearly every argument passed to a function or constructor (as well as the constructor application itself) allocates a heap object of at least 16 bytes, and often more. Take a simple function like:
fact :: Int -> Int
fact 0 = 1
fact n = n * fact (n-1)
With optimization turned off, it compiles to the following so-called "STG" form (simplified from the actual -O0 -ddump-stg output):
fact = \n -> case n of I# n' -> case n' of
0# -> I# 1#
_ -> let sat1 = let sat2 = let one = I#! 1# in n-one
in fact sat2;
in n*sat1
Everywhere you see a let, that's a heap allocation (16+ bytes), and there are presumably more hidden in the (-) and (*) calls. Compiling and running this program with:
main = print $ fact 1000000
113,343,544 bytes allocated in the heap
44,309,000 bytes copied during GC
25,059,648 bytes maximum residency (5 sample(s))
29,152 bytes maximum slop
23 MB total memory in use (0 MB lost due to fragmentation)
meaning that each iteration allocates over a hundred bytes on the heap, though it's literally just performing a comparison, a subtraction, a multiplication, and a recursive call,
This is what #HTNW meant in saying that total allocation in a GHC program is a measure of "work". A GHC program that isn't allocating probably isn't doing anything (again, with some rare exceptions), and a typical GHC program that is doing something will usually allocate at a relatively constant rate of several gigabytes per second when it's not garbage collecting. So, total allocation has more to do with total runtime than anything else, and it isn't a particularly good metric for assessing code efficiency. Maximum residency is also a poor measure of overall efficiency, though it can be helpful for assessing whether or not you have a space leak, if you find that it tends to grow linearly (or worse) with the size of the input where you expect the program should run in constant memory regardless of input size.
For most programs, the most important true efficiency metric in the +RTS -s output is probably the "productivity" rate at the bottom -- it's the amount of time the program spends not garbage collecting. And, admittedly, your program's productivity of 48% is pretty bad, which probably means that it is, technically speaking, allocating too much memory, but it's probably only allocating two or three times the amount it should be, so, at a guess, maybe it should "only" be allocating around 7-8 Gigs instead of 23 Gigs for this workload (and, consequently, running for about 5 seconds instead of 15 seconds).
With that in mind, if you consider the following simple breakDelim implementation:
breakDelim :: String -> [String]
breakDelim str = case break (=='\t') str of
(a,_:b) -> a : breakDelim b
(a,[]) -> [a]
and use it like so in a simple tab-to-comma delimited file converter:
main = interact (unlines . map (intercalate "," . breakDelim) . lines)
Then, unoptimized and run on a file with 10000 lines of 1000 3-character fields each, it allocates a whopping 17 Gigs:
17,227,289,776 bytes allocated in the heap
2,807,297,584 bytes copied during GC
127,416 bytes maximum residency (2391 sample(s))
32,608 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
and profiling it places a lot of blame on breakDelim:
main Main Delim.hs:8:1-71 57.7 72.6
breakDelim Main Delim.hs:(4,1)-(6,16) 42.3 27.4
In this case, compiling with -O2 doesn't make much difference. The key efficiency metric, productivity, is only 46%. All these results seem to be in line with what you're seeing in your program.
The split package has a lot going for it, but looking through the code, it's pretty clear that little effort has been made to make it particularly efficient or fast, so it's no surprise that splitOn performs no better than my quick-and-dirty custom breakDelim function. And, as I said before, there's nothing special about splitOn that makes it unusually memory hungry -- my simple breakDelim has similar behavior.
With respect to inefficiencies of the String type, it can often be problematic. But, it can also participate in optimizations like list fusion in ways that Text can't. The utility above could be rewritten in simpler form as:
main = interact $ map (\c -> if c == '\t' then ',' else c)
which uses String but runs pretty fast (about a quarter as fast as a naive C getchar/putchar implementation) at 84% productivity, while allocating about 5 Gigs on the heap.
It's quite likely that if you just take your program and "convert it to Text", you'll find it's slower and more memory hungry than the original! While Text has the potential to be much more efficient than String, it's a complicated package, and the way in which Text objects behave with respect to allocation when they're sliced and diced (as when you're chopping a big Text file up into little Text fields) makes it more difficult to get right.
So, some take-home lessons:
Total allocation is a poor measure of efficiency. Most well written GHC programs can and should allocate several gigabytes per second of runtime.
Many innocuous Haskell functions will allocate lots of memory because of the way GHC compiled code works. This isn't necessarily a sign that there's something wrong with the function.
The split package provides a flexible framework for all manner of cool list splitting manipulations, but it was not designed with speed in mind, and it may not be the best method of processing a tab-delimited file.
The String data type has a potential for terrible inefficiency, but isn't always inefficient, and Text is a complicated package that won't be a plug-in replacement to fix your String performance woes.
Most importantly:
Unless your program is too slow for its intended purpose, its run-time statistics and the theoretical advantages of Text over String are largely irrelevant.

Internals of reduce function in spark-shell

Input file contains 20 lines. I am trying to count total number of records using reduce function. Can anyone please explain me why there is difference in the results? Because here value of y is nothing but only 1.
Default number of partitions : 4
scala> rdd = sc.textFile("D:\LearningPythonTomaszDenny\Codebase\\wholeTextFiles\\names1.txt")
scala>>1).reduce((acc,y) => acc+1)
res17: Int = 8
scala>>1).reduce((acc,y) => acc+y)
res18: Int = 20
Because here value of y is nothing but only 1.
That is simply not true. reduce consist of three stages (not in a strict Spark meaning of the word):
Distributed reduce on each partition.
Collection of the partial results to the driver (synchronous or asynchronous depending on the backend).
Local driver reduction.
In your case the results of the first and second stage will be the same, but the first approach will simply ignore the partial results. In other words, no matter what was the result for the partition, it will always add only 1.
Such approach would work only with non-parallel, non-sequential reduce implementations.

Spark flatmap: how much memory can a map task get?

Hi I am using have an rdd containing tuple of arrays, i.e. of type
RDD[(Array[Int], Array[Int])]
rdd = sc.parallelize(Array( (Array(1, 2, 3), Array(3,4, 5))
(Array(5, 6, 7), Array(4,5, 6))
and I am trying to do the following :
rdd.flatMap{ case (arr1, arr2) =>
(for(i <- arr1; j <- arr2) yield (i, j) )
And I noticed that as I increase the sizes of the arrays from 500 to 5000, the runtime increase from several minutes to about 10 minutes,
however, if I increase the sizes of the arrays from 5K to 6K, the runtime of this operations increase to several hours.
So I am wondering why I am getting such a big increase in runtime from 5K to 6K, while from 1k to 5k runtime increase smoothly?
I am suspecting that may be the memory limit of map task is reached, and disk operations are involved, resulting in the long runtime, but the sizes is not really big, since I allocated 14G memory and 8 cores to Spark in local mode.

Spark difference between reduceByKey vs. groupByKey vs. aggregateByKey vs. combineByKey

Can anyone explain the difference between reducebykey, groupbykey, aggregatebykey and combinebykey? I have read the documents regarding this, but couldn't understand the exact differences.
An explanation with examples would be great.
.flatMap(line => line.split(" ") )
.map(word => (word,1))
.map((x,y) => (x,sum(y)))
groupByKey can cause out of disk problems as data is sent over the network and collected on the reduced workers.
.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((x,y)=> (x+y))
Data are combined at each partition, with only one output for one key at each partition to send over the network. reduceByKey required combining all your values into another value with the exact same type.
same as reduceByKey, which takes an initial value.
3 parameters as input
initial value
Combiner logic
sequence op logic
val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C", "bar=D", "bar=D")
val data = sc.parallelize(keysWithValuesList)
//Create key value pairs
val kv ="=")).map(v => (v(0), v(1))).cache()
val initialCount = 0;
val addToCounts = (n: Int, v: String) => n + 1
val sumPartitionCounts = (p1: Int, p2: Int) => p1 + p2
val countByKey = kv.aggregateByKey(initialCount)(addToCounts, sumPartitionCounts)
Aggregate By Key sum Results
bar -> 3
foo -> 5
3 parameters as input
Initial value: unlike aggregateByKey, need not pass constant always, we can pass a function that will return a new value.
merging function
combine function
val result = rdd.combineByKey(
(v) => (v,1),
( (acc:(Int,Int),v) => acc._1 +v , acc._2 +1 ) ,
( acc1:(Int,Int),acc2:(Int,Int) => (acc1._1+acc2._1) , (acc1._2+acc2._2))
).map( { case (k,v) => (k,v._1/v._2.toDouble) })
reduceByKey,aggregateByKey,combineByKey preferred over groupByKey
Avoid groupByKey
groupByKey() is just to group your dataset based on a key. It will result in data shuffling when RDD is not already partitioned.
reduceByKey() is something like grouping + aggregation. We can say reduceByKey() equivalent to It will shuffle less data unlike groupByKey().
aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have an input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output. It also takes zero-value that will be applied at the beginning of each key.
Note: One similarity is they all are wide operations.
While both reducebykey and groupbykey will produce the same answer, the
reduceByKey example works much better on a large dataset. That's
because Spark knows it can combine output with a common key on each
partition before shuffling the data.
On the other hand, when calling groupByKey - all the key-value pairs
are shuffled around. This is a lot of unnessary data to being
transferred over the network.
for more detailed check this below link
Although both of them will fetch the same results, there is a significant difference in the performance of both the functions. reduceByKey() works better with larger datasets when compared to groupByKey().
In reduceByKey(), pairs on the same machine with the same key are combined (by using the function passed into reduceByKey()) before the data is shuffled. Then the function is called again to reduce all the values from each partition to produce one final result.
In groupByKey(), all the key-value pairs are shuffled around. This is a lot of unnecessary data to being transferred over the network.
ReduceByKey reduceByKey(func, [numTasks])-
Data is combined so that at each partition there should be at least one value for each key.
And then shuffle happens and it is sent over the network to some particular executor for some action such as reduce.
GroupByKey - groupByKey([numTasks])
It doesn't merge the values for the key but directly the shuffle process happens
and here lot of data gets sent to each partition, almost same as the initial data.
And the merging of values for each key is done after the shuffle.
Here lot of data stored on final worker node so resulting in out of memory issue.
AggregateByKey - aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
It is similar to reduceByKey but you can provide initial values when performing aggregation.
Use of reduceByKey
reduceByKey can be used when we run on large data set.
reduceByKey when the input and output value types are of same type
over aggregateByKey
Moreover it recommended not to use groupByKey and prefer reduceByKey. For details you can refer here.
You can also refer this question to understand in more detail how reduceByKey and aggregateByKey.
Then apart from these 4, we have
foldByKey which is same as reduceByKey but with a user defined Zero Value.
AggregateByKey takes 3 parameters as input and uses 2 functions for merging(one for merging on same partitions and another to merge values across partition. The first parameter is ZeroValue)
ReduceBykey takes 1 parameter only which is a function for merging.
CombineByKey takes 3 parameter and all 3 are functions. Similar to aggregateBykey except it can have a function for ZeroValue.
GroupByKey takes no parameter and groups everything. Also, it is an overhead for data transfer across partitions.
