flatMap() function returns RDD[Char] instead RDD[String] - apache-spark

I am trying to understand how map and flatMap works but got stuck at below piece of code. flatMap() function returns an RDD[Char] but I was expecting the RDD[String] instead.
Can someone explain why it yields the RDD[Char] ?
scala> val inputRDD = sc.parallelize(Array(Array("This is Spark"), Array("It is a processing language"),Array("Very fast"),Array("Memory operations")))
scala> val mapRDD = inputRDD.map(x => x(0))
mapRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[28] at map at <console>:26
scala> mapRDD.collect
res27: Array[String] = Array(This is Spark, It is a processing language, Very fast, Memory operations)
scala> val mapRDD = inputRDD.flatMap(x => x(0))
mapRDD: org.apache.spark.rdd.RDD[Char] = MapPartitionsRDD[29] at flatMap at <console>:26
scala> mapRDD.collect
res28: Array[Char] = Array(T, h, i, s, , i, s, , S, p, a, r, k, I, t, , i, s, , a, , p, r, o, c, e, s, s, i, n, g, , l, a, n, g, u, a, g, e, V, e, r, y, , f, a, s, t, M, e, m, o, r, y, , o, p, e, r, a, t, i, o, n, s)

Take a look at this answer: https://stackoverflow.com/a/22510434/1547734
Basically flatmap transforms an RDD of N elements into (logically) an RDD of N collections and then flattens it into an RDD of all ELEMENTS of the internal collections.
So when you do inputRDD.flatMap(x => x(0)) then you convert each element into a string. A string is a collection of characters so the "flattening" portion would turn the entire RDD into an RDD of the resulting characters.
Since RDD are based on scala collections the following http://www.brunton-spall.co.uk/post/2011/12/02/map-map-and-flatmap-in-scala/ might help understanding it more.

The goal of flatMap is to convert a single item into multiple items (i.e. a one-to-many relationship). For example, for an RDD[Order], where each order is likely to have multiple items, I can use flatMap to get an RDD[Item] (rather than an RDD[Seq[Item]]).
In your case, a String is effectively a Seq[Char]. It therefore assumes that what you want to do is take that one string and break it up into its constituent characters.
Now, if what you want is to use flatMap to get all of the raw Strings in your RDD, your flatMap function should probably look like this: x => x.

Related

Can any one implement CombineByKey() instead of GroupByKey() in Spark in order to group elements?

i am trying to group elements of an RDD that i have created. one simple but expensive way is to use GroupByKey(). but recently i learned that CombineByKey() can do this work more efficiently. my RDD is very simple. it looks like this:
(1,5)
(1,8)
(1,40)
(2,9)
(2,20)
(2,6)
val grouped_elements=first_RDD.groupByKey()..mapValues(x => x.toList)
the result is:
(1,List(5,8,40))
(2,List(9,20,6))
i want to group them based on the first element (key).
can any one help me to do it with CombineByKey() function? i am really confused by CombineByKey()
To begin with take a look at API Refer docs
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
So it accepts three functions which I have defined below
scala> val createCombiner = (v:Int) => List(v)
createCombiner: Int => List[Int] = <function1>
scala> val mergeValue = (a:List[Int], b:Int) => a.::(b)
mergeValue: (List[Int], Int) => List[Int] = <function2>
scala> val mergeCombiners = (a:List[Int],b:List[Int]) => a.++(b)
mergeCombiners: (List[Int], List[Int]) => List[Int] = <function2>
Once you define these then you can use it in your combineByKey call as below
scala> val list = List((1,5),(1,8),(1,40),(2,9),(2,20),(2,6))
list: List[(Int, Int)] = List((1,5), (1,8), (1,40), (2,9), (2,20), (2,6))
scala> val temp = sc.parallelize(list)
temp: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[41] at parallelize at <console>:30
scala> temp.combineByKey(createCombiner,mergeValue, mergeCombiners).collect
res27: Array[(Int, List[Int])] = Array((1,List(8, 40, 5)), (2,List(20, 9, 6)))
Please note that I tried this out in Spark Shell and hence you can see the outputs below the commands executed. They will help build you your understanding.

Search for sequence of items in a list

Is there an easy way to search for a sequence of strings in a list? For example:
testlist = [a,b,c,d,e,f,g,a,b,c,d,j,k,j]
and I want to search for the sequence abc and getting the index returned. So to clarify if the string I want to search consists of more than one element of the list. For some context: I have a list with datablocks and I want to find out how big each datablock is therefore searching for a reoccuring string in the list.
There are many good string search algorithms: KMP, Boyer-Moore, Rabin-Karp. You can use the builtin str.index function on ''.join(L) if you are dealing with characters (str.index implements Boyer-Moore algorithm in CPython: https://github.com/python/cpython/blob/3.7/Objects/stringlib/fastsearch.h).
But in most cases, the naive algorithm is good enough. Check every index of the haystack to find the needle:
>>> a, b, c, d, e, f, g, j, k = [object() for _ in range(9)]
>>> haystack = [a, b, c, d, e, f, g, a, b, c, d, j, k, j]
>>> needle = [a, b, c]
>>> for i in range(len(haystack)-len(needle)+1):
... if haystack[i:i+len(needle)] == needle:
... print(i)
...
0
7
The complexity is O(|haystack|*|needle|).

Scala count chars in a string logical error

here is the code:
val a = "abcabca"
a.groupBy((c: Char) => a.count( (d:Char) => d == c))
here is the result I want:
scala.collection.immutable.Map[Int,String] = Map(2 -> b, 2 -> c, 3 -> a)
but the result I get is
scala.collection.immutable.Map[Int,String] = Map(2 -> bcbc, 3 -> aaa)
why?
thank you.
Write an expression like
"abcabca".groupBy(identity).collect{
case (k,v) => (k,v.length)
}
which will give output as
res0: scala.collection.immutable.Map[Char,Int] = Map(b -> 2, a -> 3, c -> 2)
Let's dissect your initial attempt :
a.groupBy((c: Char) => a.count( (d:Char) => d == c))
So, you're grouping by something which is what ? the result of a.count(...), so the key of your Map will be an Int. For the char a, we will get 3, for the chars b and c, we'll get 2.
Now, the original String will be traversed and for the results accumulated, char by char.
So after traversing the first "ab", the current state is "2-> b, 3->c". (Note that for each char in the string, the .count() is called, which is a n² wasteful algorithm, but anyway).
The string is progressively traversed, and at the end the accumulated results is shown. As it turns out, the 3 "a" have been sent under the "3" key, and the b and c have been sent to the key "2", in the order the string was traversed, which is the left to right order.
Now, a usual groupBy on a list returns something like Map[T, List[T]], so you may have expected a List[Char] somewhere. It doesn't happen (because the Repr for String is String), and your list of chars is effectively recombobulated into a String, and is given to you as such.
Hence your final result !
Your question header reads as "Scala count chars in a string logical error". But you are using Map and you wanted counts as keys. Equal keys are not allowed in Map objects. Hence equal keys get eliminated in the resulting Map, keeping just one, because no duplicate keys are allowed. What you want may be a Seq of tuples like (count, char) like List[Int,Char]. Try this.
val x = "abcabca"
x.groupBy(identity).mapValues(_.size).toList.map{case (x,y)=>(y,x)}
In Scal REPL:
scala> x.groupBy(identity).mapValues(_.size).toList.map{case (x,y)=>(y,x)}
res13: List[(Int, Char)] = List((2,b), (3,a), (2,c))
The above gives a list of counts and respective chars as a list of tuples.So this is what you may really wanted.
If you try converting this to a Map:
scala> x.groupBy(identity).mapValues(_.size).toList.map{case (x,y)=>(y,x)}.toMap
res14: scala.collection.immutable.Map[Int,Char] = Map(2 -> c, 3 -> a)
So this is not what you want obviously.
Even more concisely use:
x.distinct.map(v=>(x.filter(_==v).size,v))
scala> x.distinct.map(v=>(x.filter(_==v).size,v))
res19: scala.collection.immutable.IndexedSeq[(Int, Char)] = Vector((3,a), (2,b), (2,c))
The problem with your approach is you are mapping count to characters. Which is:
In case of
val str = abcabca
While traversing the string str a has count 3, b has count 2 and c has count 2 while creating the map (with the use of groupBy) it will put all the characters in the value which has the same key that is.
Map(3->aaa, 2->bc)
That’s the reason you are getting such output for your program.
As you can see in the definition of the groupBy function:
def
groupBy[K](f: (A) ⇒ K): immutable.Map[K, Repr]
Partitions this traversable collection into a map of traversable collections according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will always force the view and return a new traversable collection.
K
the type of keys returned by the discriminator function.
f
the discriminator function.
returns
A map from keys to traversable collections such that the following invariant holds:
(xs groupBy f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a traversable collection of those elements x for which f(x) equals k.
GroupBy returns a Map which holds the following invariant.
(xs groupBy f)(k) = xs filter (x => f(x) == k)
Which means it return collection of elements for which the key is same.

How to find all two pairs of sets and elements in a collection using MapReduce in Spark?

I have a collection of sets, each set contains many items. I want to retrieve all pairs of sets and elements using Spark where each pair after reduce processing will contains two items and two sets
for example:
If I have this list of sets
Set A={1,2,3,4 }
Set B={1,2,4,5}
Set C= {2,3,5,6}
The map process will be:
(A,1)
(A,2)
(A,3)
(B,1)
(B,2)
(B,4)
(B,5)
(C,2)
(C,3)
(C,5)
(C,6)
The target result after reduce is:
(A B, 1 2) // since 1 2 exist in both A and B
(A B, 1 4)
(A B, 2 4)
(A C,2 3)
(B C,2 5)
here (A B,1 3) not in the result because 1 3 not exists in B
Could you help me to solve this problem in Spark in one map and one reduce functions in any language ( Python, Scala, or Java)?
Lets break this problem into multiple parts, I consider the transformation from input lists to map output trivial. So let us start from there,
That you have a list of (String, int) looking like
("A", 1)
("A", 2)
....
Lets forget you need 2 integer elements in result set first, and lets solve for getting intersection set between any 2 keys from the mapped output.
Result from your input would look like
(AB, Set(1,2,4))
(BC, Set(2,5))
(AC, Set(2,3))
To do this, first, extract all keys from your mapped output (mappedOutput) that is an RDD of (String, Int), convert to set, and get all combinations of 2 elements (I am using a stupid method here, a good way to do this that scales would be to use a combination generator)
val combinations = mappedOutput.map(x => x._1).collect.toSet
.subsets.filter(x => x.size == 2).toList
.map(x => x.mkString(""))
output would be List(ab,ac,bc), these combination codes will serve as keys to be joined.
convert mapped output to list of set key (a,b,c) => set of elements
val step1 = mappedOutput.groupByKey().map(x => (x._1, x._2.toSet))
Attach combination codes as key to step1
val step2 = step1.map(x => combinations.filter(y => y.contains(x._1)).map(y => (y, x))).flatMap(x => x)
output would be (ab, (a, set of elements in a)), (ac, (a, set of elements in a)) etc. Because of the filter, we will not attach combination code bc to set a.
Now obtain the result I want using a reduce
val result = step2.reduceByKey((a, b) => ("", a.intersect(b))).map(x => (x._1, x._2._2))
So we have the output I mentioned we want at the start now. What is left is to transform this result to what you need, which is very simple to do.
val transformed = result.map(x => x._2.subsets.filter(x => x.size == 2).map(y => (x._1, y.mkString(" ")))).flatMap(x => x)
end :)

How to replace contents of RDD with another while preserving order?

I have two RDDs, one (a, b, a, c, b, c, a) and the other - a paired RDD ((a, 0), (b, 1), (c, 2)).
I want to replace the as, bs and cs in first RDD with 0,1,2 (which are the values of keys a,b,c respectively in second RDD), respectively. I'd like to preserve the order of the events in first RDD.
How to achieve it in Spark?
For example like this:
val rdd1 = sc.parallelize(Seq("a", "b", "a", "c", "b", "c", "a"))
val rdd2 = sc.parallelize(Seq(("a", 0), ("b", 1), ("c", 2)))
rdd1
.map((_, 1)) // Map first to PairwiseRDD with dummy values
.join(rdd2)
.map { case (_, (_, x)) => x } // Drop keys and dummy values
If mapping RDD is small it can be faster to broadcast and map:
val bd = sc.broadcast(rdd2.collectAsMap)
// This assumes all values are present. If not use get / getOrElse
// or map withDefault
rdd1.map(bd.value)
It will also preserve an order of elements.
In case of join you can add increasing identifiers (zipWithIndex / zipWithUniqueId) to be able to restore an initial ordering but it is substantially more expensive.
You can do this by using join.
First to simulate your RDDs:
val rdd = sc.parallelize(List("a","b","a","c","b","c","a"))
val mapping = sc.parallelize(List(("a",0),("b",1),("c",2)))
You can only join pairRDDs, so map the original rdd to a pairRDDand then join with mapping
rdd.map(s => (s, None)).join(mapping).map{case(_, (_, intValue)) => intValue}

Resources