RDD Sliding error not understood - apache-spark

Given that this works:
(1 to 5).iterator.sliding(3).toList
Then why does this not work?
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
val z = rdd1.iterator.sliding(3).toList
I get the following error and try to apply the fix but that does not work either!
notebook:3: error: missing argument list for method iterator in class RDD
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `iterator _` or
`iterator(_,_)` instead of `iterator`.
val z = rdd1.iterator.sliding(3).toList
^
I am just trying examples and this I cannot really follow.

It doesn't work, because iterator RDD is not a Collection and it's iterator method has different signature:
final def iterator(split: Partition, context: TaskContext): Iterator[T]
Internal method to this RDD; will read from cache if applicable, or otherwise compute it. This should not be called by users directly, but is available for implementors of custom subclasses of RDD.
If you want to convert RDD to local Iterator use toLocalIterator:
def toLocalIterator: Iterator[T]
Return an iterator that contains all of the elements in this RDD.
rdd1.toLocalIterator
but what you probably want is RDDFunctions.sliding - Operate on neighbor elements in RDD in Spark.

Related

CombineByKey object is not iterable [duplicate]

If I have a rdd, how do I understand the data is in key:value
format? is there a way to find the same - something like
type(object) tells me an object's type. I tried print
type(rdd.take(1)), but it just says <type 'list'>.
Let's say I have a data like (x,1),(x,2),(y,1),(y,3) and I use
groupByKey and got (x,(1,2)),(y,(1,3)). Is there a way to define
(1,2) and (1,3) as values where x and y are keys? Or does a key has to be a single value? I noted that if I use reduceByKey and sum function to get the data ((x,3),(y,4)) then it becomes much easier to define this data as a key-value pair
Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:
k, v = kv
Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:
key_value.py
class KeyValue(object):
def __init__(self, k, v):
self.k = k
self.v = v
def __iter__(self):
for x in [self.k, self.v]:
yield x
from key_value import KeyValue
rdd = sc.parallelize(
[KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)])
rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]
and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.
Also type(rdd.take(1)) returns a list of length n so its type will be always the same.

How do you get Apache Spark to reduce before finishing map to reduce memory usage?

I'm doing a map-reduce job with Apache Spark, but the mapping step produces a structure, which uses up a lot of memory. How can I get it to reduce and delete from memory the map before adding additional mapped objects to memory?
I'm basically doing myrdd.map(f).reduce(r). However, f returns a very big object, so I need the reducer to run and then delete the mapped objects from memory before too many pile up. Can I do this somehow?
Similar to combiner in MapReduce, when working with key/value pairs, combineByKey() interface can be used to customize the combiner functionality. Methods like reduceByKey() by default use their own combiner to combine the data locally in each Partition, for a given key
Similar to aggregate()(which is used with single element RDD), combineByKey() allows user to return different RDD element type compared to the element type of Input RDD.
trait SmallThing
trait BigThing
val mapFunction: SmallThing => BigThing = ???
val reduceFunction: (BigThing, BigThing) => BigThing = ???
val rdd: RDD[SmallThing] = ???
//initial implementation:
val result1: BigThing = rdd.map(mapFunction).reduce(reduceFunction)
//equivalent implementation:
val emptyBigThing: BigThing = ???
val result2: BigThing = rdd.aggregate(emptyBigThing)(seqOp = (agg, small) => reduceFunction(agg, mapFunction(small)), combOp = reduceFunction)

Too many Values to Unpack in SortByKey [duplicate]

If I have a rdd, how do I understand the data is in key:value
format? is there a way to find the same - something like
type(object) tells me an object's type. I tried print
type(rdd.take(1)), but it just says <type 'list'>.
Let's say I have a data like (x,1),(x,2),(y,1),(y,3) and I use
groupByKey and got (x,(1,2)),(y,(1,3)). Is there a way to define
(1,2) and (1,3) as values where x and y are keys? Or does a key has to be a single value? I noted that if I use reduceByKey and sum function to get the data ((x,3),(y,4)) then it becomes much easier to define this data as a key-value pair
Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:
k, v = kv
Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:
key_value.py
class KeyValue(object):
def __init__(self, k, v):
self.k = k
self.v = v
def __iter__(self):
for x in [self.k, self.v]:
yield x
from key_value import KeyValue
rdd = sc.parallelize(
[KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)])
rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]
and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.
Also type(rdd.take(1)) returns a list of length n so its type will be always the same.

Use of implicit parameter in spark

Below code is used to find the average of values.I am not sure why implicit num: Numeric[T] parameter used in average function.
Code:
val data = List(("32540b03",-0.00699), ("a93dec11",0.00624),
("32cc6532",0.02337) , ("32540b03",0.256023),
("32cc6532",-0.03591),("32cc6532",-0.03591))
val rdd = sc.parallelize(data.toSeq).groupByKey().sortByKey()
def average[T]( ts: Iterable[T] )**( implicit num: Numeric[T] )** = {
num.toDouble( ts.sum ) / ts.size
}
val avgs = rdd.map(x => (x._1, average(x._2)))
Please help to know the reason for using ( implicit num: Numeric[T]) parameter.
Scala does not have a super class for numeric types. This means you can't limit T <: Number for the average to make sense (you can't really do an average of generic objects). By adding implicit you make sure it has the toDouble method which converts to double.
You could pass that conversion function always but that would mean an additional parameter so instead numeric is used. If you would do something like average(List("bla")) you would get a complaint that it can't find a num.
See also https://twitter.github.io/scala_school/advanced-types.html#otherbounds

Broadcasting a single value Spark(python)

Lets say I am having the following list and a single value:
alist = [1,2,3,4,5]
alistRDD = sc.parallelize(alist)
single_value = 3
and I got the following function:
def a_fun(x,y):
return x+y
And I am doing the following:
alistRDD.map(lambda x:a_fun(x,single_value))
So I am using for this function as second argument the single_value. Does it make sense to broadcast this single_value in order to be in all the nodes?
When the driver submit this transformation to the workers it would simply pass the value it self and not the parameter. So from performance perspective it may even be better. There is no value in broadcasting data that is assigned without any logic. You be better simply pass the variable and let the serialization process turn it into the value itself. Hope this answers your question.

Resources