Can Spark map function assign dynamic variable? - apache-spark

I know map function can do like
val a=5
map(data=>data+5)
Is that possible variable a can be dynamic?
For example, the value of variable a is between 1 to 5 so a=1,2,3,4,5.
When I call map function, it can distributed execute like
data + 1
data + 2
data + 3
data + 4
data + 5

If I'm understanding your question correctly, it doesn't make sense from a Spark perspective. What you're asking for makes sense in a non-distributed, sequential processing environment (where each datum can be deterministically applied a different function). However, Spark applies transformations across distributed datasets and the functions applied by these transformations are identical.
One way to achieve what you are trying to do is to use some inherent qualities of the input in transforming your data. This way, even if your transformation function is identical, the arguments provided to it will allow it do behave like (what you described as) a "dynamic variable". In your example, the zipWithIndex() function can suffice. Though it is important to note that if ordering is not guaranteed, the indexes are subject to change on each run of the transformation.
scala> val rdd = sc.parallelize(Array(1,1,1,1,1,1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> val newRDD = rdd.zipWithIndex().map { case (elem, idx) => elem + idx }
...
scala> newRDD.take(6)
...
res0: Array[Long] = Array(1, 2, 3, 4, 5, 6)

Related

spark parallelize(List(1,2,3,4),2) always partition the list in order?

I've run the code below and the result is 37.
val z = sc.parallelize(List(1,2,7,4,30,6), 2)
z.aggregate(0)(math.max(_, _), _ + _)
res40: Int = 37
It seems that spark partitions the list into 2 lists:[1,2,7] , [4,30,6].
Then I changed the order of 7 and 4 in the list and I got 34.
scala> val z = sc.parallelize(List(1,2,4,7,30,6), 2)
z: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at parallelize at <console>:24
scala> z.aggregate(0)(math.max(_, _), _ + _)
res11: Int = 34
What I want to know is if spark always keeps the order of the elements in the list when partitioning?
Thanks!
There are two different concepts here.
Order of items which is persevered when using parallelize and applying transformations which don't require shuffling.
Order of items during aggregation which is not preserved and is non deterministic. While each partition is aggregated sequentially order of merging partial result is arbitrary.
In general don't depend on the order of values and operations unless you enforce it explicitly (for example by sorting) or you know exactly what you're doing.

Can we prevent laziness of Apache Spark Transformation?

Recently, one employer ask me a question that how can we prevent laziness of Apache Spark transformation. I know that we can persists and cache RDD data-set but in case of failure, it recompute from parent.
Can anyone please explain me, is there any function to stop the laziness of Spark transformation?
By design, Spark transformations are lazy, and you must use an action in order to retrieve a concrete value out of them.
For example, the following transformations will always remain lazy:
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
Functions like map return RDDs, and you can only turn those RDDs into real values by performing actions, such as reduce:
int totalLength = lineLengths.reduce((a, b) -> a + b);
There is no flag that will make map return a concrete value (for example, a list of integers).
The bottom line is that you can use collect or any other Spark action to 'prevent the laziness' of a transformation:
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
List<Integer> collectedLengths = lineLengths.collect()
Remember, though, the using collect on a large dataset will probably be a very bad practice, making your driver run out of memory.

How do I split an RDD into two or more RDDs?

I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD.
If you're familiar with SAS, something like this:
data work.split1, work.split2;
set work.preSplit;
if (condition1)
output work.split1
else if (condition2)
output work.split2
run;
which resulted in two distinct data sets. It would have to be immediately persisted to get the results I intend...
It is not possible to yield multiple RDDs from a single transformation*. If you want to split a RDD you have to apply a filter for each split condition. For example:
def even(x): return x % 2 == 0
def odd(x): return not even(x)
rdd = sc.parallelize(range(20))
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If you have only a binary condition and computation is expensive you may prefer something like this:
kv_rdd = rdd.map(lambda x: (x, odd(x)))
kv_rdd.cache()
rdd_odd = kv_rdd.filter(lambda kv: kv[1]).keys()
rdd_even = kv_rdd.filter(lambda kv: not kv[1]).keys()
It means only a single predicate computation but requires additional pass over all data.
It is important to note that as long as an input RDD is properly cached and there no additional assumptions regarding data distribution there is no significant difference when it comes to time complexity between repeated filter and for-loop with nested if-else.
With N elements and M conditions number of operations you have to perform is clearly proportional to N times M. In case of for-loop it should be closer to (N + MN) / 2 and repeated filter is exactly NM but at the end of the day it is nothing else than O(NM). You can see my discussion** with Jason Lenderman to read about some pros-and-cons.
At the very high level you should consider two things:
Spark transformations are lazy, until you execute an action your RDD is not materialized
Why does it matter? Going back to my example:
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If later I decide that I need only rdd_odd then there is no reason to materialize rdd_even.
If you take a look at your SAS example to compute work.split2 you need to materialize both input data and work.split1.
RDDs provide a declarative API. When you use filter or map it is completely up to Spark engine how this operation is performed. As long as the functions passed to transformations are side effects free it creates multiple possibilities to optimize a whole pipeline.
At the end of the day this case is not special enough to justify its own transformation.
This map with filter pattern is actually used in a core Spark. See my answer to How does Sparks RDD.randomSplit actually split the RDD and a relevant part of the randomSplit method.
If the only goal is to achieve a split on input it is possible to use partitionBy clause for DataFrameWriter which text output format:
def makePairs(row: T): (String, String) = ???
data
.map(makePairs).toDF("key", "value")
.write.partitionBy($"key").format("text").save(...)
* There are only 3 basic types of transformations in Spark:
RDD[T] => RDD[T]
RDD[T] => RDD[U]
(RDD[T], RDD[U]) => RDD[W]
where T, U, W can be either atomic types or products / tuples (K, V). Any other operation has to be expressed using some combination of the above. You can check the original RDD paper for more details.
** https://chat.stackoverflow.com/rooms/91928/discussion-between-zero323-and-jason-lenderman
*** See also Scala Spark: Split collection into several RDD?
As other posters mentioned above, there is no single, native RDD transform that splits RDDs, but here are some "multiplex" operations that can efficiently emulate a wide variety of "splitting" on RDDs, without reading multiple times:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions
Some methods specific to random splitting:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions
Methods are available from open source silex project:
https://github.com/willb/silex
A blog post explaining how they work:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/
def muxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[U],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => Iterator.single(itr.next()(j)) } }
}
def flatMuxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[TraversableOnce[U]],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => itr.next()(j).toIterator } }
}
As mentioned elsewhere, these methods do involve a trade-off of memory for speed, because they operate by computing entire partition results "eagerly" instead of "lazily." Therefore, it is possible for these methods to run into memory problems on large partitions, where more traditional lazy transforms will not.
One way is to use a custom partitioner to partition the data depending upon your filter condition. This can be achieved by extending Partitioner and implementing something similar to the RangePartitioner.
A map partitions can then be used to construct multiple RDDs from the partitioned RDD without reading all the data.
val filtered = partitioned.mapPartitions { iter => {
new Iterator[Int](){
override def hasNext: Boolean = {
if(rangeOfPartitionsToKeep.contains(TaskContext.get().partitionId)) {
false
} else {
iter.hasNext
}
}
override def next():Int = iter.next()
}
Just be aware that the number of partitions in the filtered RDDs will be the same as the number in the partitioned RDD so a coalesce should be used to reduce this down and remove the empty partitions.
If you split an RDD using the randomSplit API call, you get back an array of RDDs.
If you want 5 RDDs returned, pass in 5 weight values.
e.g.
val sourceRDD = val sourceRDD = sc.parallelize(1 to 100, 4)
val seedValue = 5
val splitRDD = sourceRDD.randomSplit(Array(1.0,1.0,1.0,1.0,1.0), seedValue)
splitRDD(1).collect()
res7: Array[Int] = Array(1, 6, 11, 12, 20, 29, 40, 62, 64, 75, 77, 83, 94, 96, 100)

Are multiple reduceByKey on the same RDD compiled into a single scan?

Suppose I have an RDD (50M records/dayredu) which I want to summarize in several different ways.
The RDD records are 4-tuples: (keep, foo, bar, baz).
keep - boolean
foo, bar, baz - 0/1 int
I want to count how many of each of the foo &c are kept and dropped, i.e., I have to do the following for foo (and the same for bar and baz):
rdd.filter(lambda keep, foo, bar, baz: foo == 1)
.map(lambda keep, foo, bar, baz: keep, 1)
.reduceByKey(operator.add)
which would return (after collect) a list like [(True,40000000),(False,10000000)].
The question is: is there an easy way to avoid scanning rdd 3 times (once for each of foo, bar, baz)?
What I mean is not a way to rewrite the above code to handle all 3 fields, but telling spark to process all 3 pipelines in a single pass.
It's possible to execute the three pipelines in parallel by submitting the job with different threads, but this will pass through the RDD three times and require up to 3x more resources on the cluster.
It's possible to get the job done in one pass by rewriting the job to handle all counts at once - the answer regarding aggregate is an option. Splitting the data in pairs (keep, foo) (keep, bar), (keep, baz) would be another.
It's not possible to get the job done in one pass without any code changes, as there would not be a way for Spark to know that those jobs relate to the same dataset. At most, the speed of subsequent jobs after the first one could be improved by caching the initial rdd with rdd.cache before the .filter().map().reduce() steps; this will still pass through the RDD 3 times, but the 2nd and 3rd time will be potentially a lot faster if all data fits in the memory of the cluster:
rdd.cache
// first reduceByKey action will trigger the cache and rdd data will be kept in memory
val foo = rdd.filter(fooFilter).map(fooMap).reduceByKey(???)
// subsequent operations will execute faster as the rdd is now available in mem
val bar = rdd.filter(barFilter).map(barMap).reduceByKey(???)
val baz = rdd.filter(bazFilter).map(bazMap).reduceByKey(???)
If I were doing this, I would create pairs of the relevant data and count them in a single pass:
// We split the initial tuple into pairs keyed by the data type ("foo", "bar", "baz") and the keep information. dataPairs will contain data like: (("bar",true),1), (("foo",false),1)
val dataPairs = rdd.flatmap{case (keep, foo, bar, baz) =>
def condPair(name:String, x:Int):Option[((String,Boolean), Int)] = if (x==1) Some(((name,keep),x)) else None
Seq(condPair("foo",foo), condPair("bar",bar), condPair("baz",baz)).flatten
}
val totals = dataPairs.reduceByKey(_ + _)
This is easy and will pass over the data only once, but requires rewriting of the code. I'd say it scores 66,66% in answering the question.
If I'm reading your question correctly, you want RDD.aggregate.
val zeroValue = (0L, 0L, 0L, 0L, 0L, 0L) // tfoo, tbar, tbaz, ffoo, fbar, fbaz
rdd.aggregate(zeroValue)(
(prior, current) => if (current._1) {
(prior._1 + current._2, prior._2 + current._3, prior._3 + current._4,
prior._4, prior._5, prior._6)
} else {
(prior._1, prior._2, prior._3,
prior._4 + current._2, prior._5 + current._3, prior._6 + current._4)
},
(left, right) =>
(left._1 + right._1,
left._2 + right._2,
left._3 + right._3,
left._4 + right._4,
left._5 + right._5,
left._6 + right._6)
)
Aggregate is conceptually like the conceptual reduce function on a list, but RDDs aren't lists, they're distributed, so you provide two function arguments, one to operate on each partition, and one to combine the results of processing the partitions.

pyspark fold method output

I'm surprised at this output from fold, I can't imagine what it's doing.
I would expect that something.fold(0, lambda a,b: a+1) would return the number of elements in something, since the fold starts at 0 and adds 1 for each element.
sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
8
I'm coming from Scala, where fold works as the way I've described. So how is fold supposed to work in pyspark? Thanks for your thoughts.
To understand what's going on here, let's look at the definition of Spark's fold operation. Since you're using PySpark, I'm going to show the Python version of the code, but the Scala version exhibits the exact same behavior (you can also browse the source on GitHub):
def fold(self, zeroValue, op):
"""
Aggregate the elements of each partition, and then the results for all
the partitions, using a given associative function and a neutral "zero
value."
The function C{op(t1, t2)} is allowed to modify C{t1} and return it
as its result value to avoid object allocation; however, it should not
modify C{t2}.
>>> from operator import add
>>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
15
"""
def func(iterator):
acc = zeroValue
for obj in iterator:
acc = op(obj, acc)
yield acc
vals = self.mapPartitions(func).collect()
return reduce(op, vals, zeroValue)
(For comparison, see the Scala implementation of RDD.fold).
Spark's fold operates by first folding each partition and then folding the results. The problem is that an empty partition gets folded down to the zero element, so the final driver-side fold ends up folding one value for every partition rather than one value for each non-empty partition. This means that the result of fold is sensitive to the number of partitions:
>>> sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
100
>>> sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
50
>>> sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
1
In this last case, what's happening is that the single partition is being folded down to the correct value, then that value is folded with the zero-value at the driver to yield 1.
It seems that Spark's fold() operation actually requires the fold function to be commutative in addition to associative. There are actually other places in Spark that impose this requirement, such as the fact that the ordering of elements within a shuffled partition can be non-deterministic across runs (see SPARK-5750).
I've opened a Spark JIRA ticket to investigate this issue: https://issues.apache.org/jira/browse/SPARK-6416.
Lets me try to give simple examples to explain fold method of spark. I will be using pyspark here.
rdd1 = sc.parallelize(list([]),1)
Above line is going to create an empty rdd with one partition
rdd1.fold(10, lambda x,y:x+y)
This yield output as 20
rdd2 = sc.parallelize(list([1,2,3,4,5]),2)
Above line is going to create rdd with values 1 to 5 and will be having a total of 2 partitions
rdd2.fold(10, lambda x,y:x+y)
This yields output as 45
So in above case for sake of simplicity what is happening here is you are having zeroth element as 10. So the sum that you would otherwise get of all numbers in the RDD, is now added by 10(i.e. zeroth element+all other elements => 10+1+2+3+4+5 = 25). Also now we have two partitions(i.e. number of partitions*zeroth element => 2*10 = 20)
Final output that fold emits is 25+20 = 45
Using similar process its clear why the fold operation on rdd1 yielded 20 as output.
Reduce fails when we have empty list something like rdd1.reduce(lambda x,y:x+y)
ValueError: Can not reduce() empty RDD
Fold can be used if we think we can have empty list in the rdd
rdd1.fold(0, lambda x,y:x+y)
As expected this will yield output as 0.

Resources