How to force spark to perform reduction locally - apache-spark

I'm looking for a trick to force Spark to perform reduction operation locally between all the tasks executed by the worker cores before making it for all tasks.
Indeed, it seems my driver node and the network bandwitch are overload because of big task results (=400MB).
val arg0 = sc.broadcast(fs.read(0, 4))
val arg1 = sc.broadcast(fs.read(1, 4))
val arg2 = fs.read(5, 4)
val index = info.sc.parallelize(0.toLong to 10000-1 by 1)
val mapres = index.map{ x => function(arg0.value, arg1.value, x, arg2) }
val output = mapres.reduce(Util.bitor)
The driver distributes 1 partition by processor core so 8 partitions by worker.

There is nothing to force because reduce applies reduction locally for each partition. Only the final merge is applied on the driver. Not to mention 400MB shouldn't be a problem in any sensible configuration.
Still if you want to perform more work on the workers you can use treeReduce although with 8 partitions there is almost nothing to gain.

Related

Why the Spark stage only have one function. Instead of so Many transformation in a map stage

I have confused that why a stage only have one function.
In the Code below, the map stage should contain two map functions instead of one;
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("test").setMaster("local[2]")
val sc = new SparkContext(conf)
val data = Array("Runoob", "Baidu", "Google")
val distData = sc.parallelize(data).map(x => (x, 1))
.map(x => x._2 + 1)
.collect()
distData.length
}
A stage is a set of independent tasks all computing the same function that need to run as part of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the DAGScheduler runs these stages in topological order.
The Stage meaning in this url; enter link description here
Consecutive map operations can often be combined to a single map operation. Presumably, Spark realized that the combined operations
x => (x, 1)
and
x => x._2 + 1
is equivalent to
x => (x, 1) => 1 + 1
i.e. x => 2, which is a single function. That's why you only saw one single function in the Spark stage.

Internals of reduce function in spark-shell

Input file contains 20 lines. I am trying to count total number of records using reduce function. Can anyone please explain me why there is difference in the results? Because here value of y is nothing but only 1.
Default number of partitions : 4
scala> rdd = sc.textFile("D:\LearningPythonTomaszDenny\Codebase\\wholeTextFiles\\names1.txt")
scala> rdd.map(x=>1).reduce((acc,y) => acc+1)
res17: Int = 8
scala> rdd.map(x=>1).reduce((acc,y) => acc+y)
res18: Int = 20
Because here value of y is nothing but only 1.
That is simply not true. reduce consist of three stages (not in a strict Spark meaning of the word):
Distributed reduce on each partition.
Collection of the partial results to the driver (synchronous or asynchronous depending on the backend).
Local driver reduction.
In your case the results of the first and second stage will be the same, but the first approach will simply ignore the partial results. In other words, no matter what was the result for the partition, it will always add only 1.
Such approach would work only with non-parallel, non-sequential reduce implementations.

Spark flatmap: how much memory can a map task get?

Hi I am using have an rdd containing tuple of arrays, i.e. of type
RDD[(Array[Int], Array[Int])]
rdd = sc.parallelize(Array( (Array(1, 2, 3), Array(3,4, 5))
(Array(5, 6, 7), Array(4,5, 6))
....
))
and I am trying to do the following :
rdd.flatMap{ case (arr1, arr2) =>
(for(i <- arr1; j <- arr2) yield (i, j) )
}
And I noticed that as I increase the sizes of the arrays from 500 to 5000, the runtime increase from several minutes to about 10 minutes,
however, if I increase the sizes of the arrays from 5K to 6K, the runtime of this operations increase to several hours.
So I am wondering why I am getting such a big increase in runtime from 5K to 6K, while from 1k to 5k runtime increase smoothly?
I am suspecting that may be the memory limit of map task is reached, and disk operations are involved, resulting in the long runtime, but the sizes is not really big, since I allocated 14G memory and 8 cores to Spark in local mode.

Spark difference between reduceByKey vs. groupByKey vs. aggregateByKey vs. combineByKey

Can anyone explain the difference between reducebykey, groupbykey, aggregatebykey and combinebykey? I have read the documents regarding this, but couldn't understand the exact differences.
An explanation with examples would be great.
groupByKey:
Syntax:
sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" ") )
.map(word => (word,1))
.groupByKey()
.map((x,y) => (x,sum(y)))
groupByKey can cause out of disk problems as data is sent over the network and collected on the reduced workers.
reduceByKey:
Syntax:
sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((x,y)=> (x+y))
Data are combined at each partition, with only one output for one key at each partition to send over the network. reduceByKey required combining all your values into another value with the exact same type.
aggregateByKey:
same as reduceByKey, which takes an initial value.
3 parameters as input
initial value
Combiner logic
sequence op logic
Example:
val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C", "bar=D", "bar=D")
val data = sc.parallelize(keysWithValuesList)
//Create key value pairs
val kv = data.map(_.split("=")).map(v => (v(0), v(1))).cache()
val initialCount = 0;
val addToCounts = (n: Int, v: String) => n + 1
val sumPartitionCounts = (p1: Int, p2: Int) => p1 + p2
val countByKey = kv.aggregateByKey(initialCount)(addToCounts, sumPartitionCounts)
ouput:
Aggregate By Key sum Results
bar -> 3
foo -> 5
combineByKey:
3 parameters as input
Initial value: unlike aggregateByKey, need not pass constant always, we can pass a function that will return a new value.
merging function
combine function
Example:
val result = rdd.combineByKey(
(v) => (v,1),
( (acc:(Int,Int),v) => acc._1 +v , acc._2 +1 ) ,
( acc1:(Int,Int),acc2:(Int,Int) => (acc1._1+acc2._1) , (acc1._2+acc2._2))
).map( { case (k,v) => (k,v._1/v._2.toDouble) })
result.collect.foreach(println)
reduceByKey,aggregateByKey,combineByKey preferred over groupByKey
Reference:
Avoid groupByKey
groupByKey() is just to group your dataset based on a key. It will result in data shuffling when RDD is not already partitioned.
reduceByKey() is something like grouping + aggregation. We can say reduceByKey() equivalent to dataset.group(...).reduce(...). It will shuffle less data unlike groupByKey().
aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have an input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output. It also takes zero-value that will be applied at the beginning of each key.
Note: One similarity is they all are wide operations.
While both reducebykey and groupbykey will produce the same answer, the
reduceByKey example works much better on a large dataset. That's
because Spark knows it can combine output with a common key on each
partition before shuffling the data.
On the other hand, when calling groupByKey - all the key-value pairs
are shuffled around. This is a lot of unnessary data to being
transferred over the network.
for more detailed check this below link
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
Although both of them will fetch the same results, there is a significant difference in the performance of both the functions. reduceByKey() works better with larger datasets when compared to groupByKey().
In reduceByKey(), pairs on the same machine with the same key are combined (by using the function passed into reduceByKey()) before the data is shuffled. Then the function is called again to reduce all the values from each partition to produce one final result.
In groupByKey(), all the key-value pairs are shuffled around. This is a lot of unnecessary data to being transferred over the network.
ReduceByKey reduceByKey(func, [numTasks])-
Data is combined so that at each partition there should be at least one value for each key.
And then shuffle happens and it is sent over the network to some particular executor for some action such as reduce.
GroupByKey - groupByKey([numTasks])
It doesn't merge the values for the key but directly the shuffle process happens
and here lot of data gets sent to each partition, almost same as the initial data.
And the merging of values for each key is done after the shuffle.
Here lot of data stored on final worker node so resulting in out of memory issue.
AggregateByKey - aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
It is similar to reduceByKey but you can provide initial values when performing aggregation.
Use of reduceByKey
reduceByKey can be used when we run on large data set.
reduceByKey when the input and output value types are of same type
over aggregateByKey
Moreover it recommended not to use groupByKey and prefer reduceByKey. For details you can refer here.
You can also refer this question to understand in more detail how reduceByKey and aggregateByKey.
Then apart from these 4, we have
foldByKey which is same as reduceByKey but with a user defined Zero Value.
AggregateByKey takes 3 parameters as input and uses 2 functions for merging(one for merging on same partitions and another to merge values across partition. The first parameter is ZeroValue)
whereas
ReduceBykey takes 1 parameter only which is a function for merging.
CombineByKey takes 3 parameter and all 3 are functions. Similar to aggregateBykey except it can have a function for ZeroValue.
GroupByKey takes no parameter and groups everything. Also, it is an overhead for data transfer across partitions.

Modify an existing RDD without replicating memory

I am trying to implement a distributed algorithm using Spark. It is a computer vision algorithm with tens of thousands of images. The images are divided into "partitions" that are processed in a distributed fashion, with the help of a master process. The pseudocode goes something like this:
# Iterate
for t = 1 ... T
# Each partition
for p = 1 ... P
d[p] = f1(b[p], z[p], u[p])
# Master
y = f2(d)
# Each partition
for p = 1 ... P
u[p] = f3(u[p], y)
# Each partition
for p = 1 ... P
# Iterate
for t = 1 ... T
z[p] = f4(b[p], y, v[p])
v[p] = f5(z[p])
where b[p] contains the pth partition of the images which is a numpy ndarray, z[p] contains some function of b[p] and also a numpy ndarray, y is computed on the master knowing all the partitions of d, and then u[p] is updated on each partition knowing y. In my attempted implementation, all of b, z, and u are separate RDDs with corresponding keys (e.g. (1, b[1]), (1,z[1]) and (1, u[1]) correspond to the first partition, etc.).
The problem now with using Spark is that b and z are extremely large, in the order of GBs. Since RDDs are immutable, whenever I want to "join" them (e.g. bring z[1] and b[1] onto the same machine for processing) they are replicated i.e. new copies are returned from the numpy arrays. This just multiplies the amount of memory needed, and limits the number of images that can be processed.
I thought a way to avoid the joins is to have an RDD that combines all of variables e.g. (p, (z[p], b[p], u[p], v[p])), but then the immutability problem is still there.
So my question: is there a workaround to update the RDD in place? For example, if I have the RDD as (p, (z[p], b[p], u[p], v[p])), I can update z[p] in-memory?

Resources