Why is count function is not working with mapvalues in Spark? - apache-spark

I am doing some basic handson in spark using scala.
I would like to know why the count function is not working with mapValues and map function
When I apply sum,min,max then it works.. Also Is there any place where I can refer all the applicable functions that can be applied on Iterable[String] from groupbykeyRDD?
MyCode:
scala> val records = List( "CHN|2", "CHN|3" , "BNG|2","BNG|65")
records: List[String] = List(CHN|2, CHN|3, BNG|2, BNG|65)
scala> val recordsRDD = sc.parallelize(records)
recordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[119] at parallelize at <console>:23
scala> val mapRDD = recordsRDD.map(elem => elem.split("\\|"))
mapRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[120] at map at <console>:25
scala> val keyvalueRDD = mapRDD.map(elem => (elem(0),elem(1)))
keyvalueRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at map at <console>:27
scala> val groupbykeyRDD = keyvalueRDD.groupByKey()
groupbykeyRDD: org.apache.spark.rdd.RDD[(String, Iterable[String])] = ShuffledRDD[122] at groupByKey at <console>:29
scala> groupbykeyRDD.mapValues(elem => elem.count).collect
<console>:32: error: missing arguments for method count in trait TraversableOnce;
follow this method with `_' if you want to treat it as a partially applied function
groupbykeyRDD.mapValues(elem => elem.count).collect
^
scala> groupbykeyRDD.map(elem => (elem._1 ,elem._2.count)).collect
<console>:32: error: missing arguments for method count in trait TraversableOnce;
follow this method with `_' if you want to treat it as a partially applied function
groupbykeyRDD.map(elem => (elem._1 ,elem._2.count)).collect
Expected output :
Array((CHN,2) ,(BNG,2))

The error you are having has nothing to do with spark, it's a pure scala compilation error.
You can try in a scala (no spark at all) console :
scala> val iterableTest: Iterable[String] = Iterable("test")
iterableTest: Iterable[String] = List(test)
scala> iterableTest.count
<console>:29: error: missing argument list for method count in trait TraversableOnce
This is because Iterable does not define a count (with no arguments) method. It does define a count method, though, but which needs a predicate function argument, which is why you get this specific error about partially unapplied functions.
It does have a size method though, that you could swap in your sample to make it work.

Elem you are getting is of type Iteratable[String] then try length method or size method because Iteratable does not have count method if it does not work
you can cast Iteratable [String] to List and try length method
Count method avalaible for RDD

count - counts the occurrence of values provided in parameter condition (Boolean)
count with your code: here it counts # of occurrences of "2", "3"
scala> groupbykeyRDD.collect().foreach(println)
(CHN,CompactBuffer(2, 3))
(BNG,CompactBuffer(2, 65))
scala> groupbykeyRDD.map(elem => (elem._1 ,elem._2.count(_ == "2"))).collect
res14: Array[(String, Int)] = Array((CHN,1), (BNG,1))
scala> groupbykeyRDD.map(elem => (elem._1 ,elem._2.count(_ == "3"))).collect
res15: Array[(String, Int)] = Array((CHN,1), (BNG,0))
count with with small fix to your code: if you twist you code this way than count should give you expected results:
val keyvalueRDD = mapRDD.map(elem => (elem(0),1))
Test:
scala> val groupbykeyRDD = mapRDD.map(elem => (elem(0),1)).groupByKey()
groupbykeyRDD: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[9] at groupByKey at <console>:18
scala> groupbykeyRDD.collect().foreach(println)
(CHN,CompactBuffer(1, 1))
(BNG,CompactBuffer(1, 1))
scala> groupbykeyRDD.map(elem => (elem._1 ,elem._2.count(_ == 1))).collect
res18: Array[(String, Int)] = Array((CHN,2), (BNG,2))

Related

Spark Scala convert RDD with Case Class to simple RDD

This is fine:
case class trans(atm : String, num: Int)
val array = Array((20254552,"ATM",-5100), (20174649,"ATM",5120))
val rdd = sc.parallelize(array)
val rdd1 = rdd.map(x => (x._1, trans(x._2, x._3)))
How to convert back to a simple RDD like rdd again?
E.g. rdd: org.apache.spark.rdd.RDD[(Int, String, Int)]
I can do this, for sure:
val rdd2 = rdd1.mapValues(v => (v.atm, v.num)).map(x => (x._1, x._2._1, x._2._2))
but what if there is a big record for the class? E.g. dynamically.
Not sure exactly how generic you want to go, but in your example of an RDD[(Int, trans)] you can make use of the unapply method of the trans companion object in order to flatten your case class to a tuple.
So, if you have your setup:
case class trans(atm : String, num: Int)
val array = Array((20254552,"ATM",-5100), (20174649,"ATM",5120))
val rdd = sc.parallelize(array)
val rdd1 = rdd.map(x => (x._1, trans(x._2, x._3)))
You can do the following:
import shapeless.syntax.std.tuple._
val output = rdd1.map{
case (myInt, myTrans) => {
myInt +: trans.unapply(myTrans).get
}
}
output
res15: org.apache.spark.rdd.RDD[(Int, String, Int)]
We're importing shapeless.syntax.std.tuple._ in order to be able to make a tuple from our Int + flattened tuple (the myInt +: trans.unapply(myTrans).get operation).
Case class method "productIterator" can help convert to array:
case class trans(atm : String, num: Int)
val value = trans("ATM", 5120)
val rdd = spark.sparkContext.parallelize(Seq(value))
rdd
.map(_.productIterator.toArray)

Can any one implement CombineByKey() instead of GroupByKey() in Spark in order to group elements?

i am trying to group elements of an RDD that i have created. one simple but expensive way is to use GroupByKey(). but recently i learned that CombineByKey() can do this work more efficiently. my RDD is very simple. it looks like this:
(1,5)
(1,8)
(1,40)
(2,9)
(2,20)
(2,6)
val grouped_elements=first_RDD.groupByKey()..mapValues(x => x.toList)
the result is:
(1,List(5,8,40))
(2,List(9,20,6))
i want to group them based on the first element (key).
can any one help me to do it with CombineByKey() function? i am really confused by CombineByKey()
To begin with take a look at API Refer docs
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
So it accepts three functions which I have defined below
scala> val createCombiner = (v:Int) => List(v)
createCombiner: Int => List[Int] = <function1>
scala> val mergeValue = (a:List[Int], b:Int) => a.::(b)
mergeValue: (List[Int], Int) => List[Int] = <function2>
scala> val mergeCombiners = (a:List[Int],b:List[Int]) => a.++(b)
mergeCombiners: (List[Int], List[Int]) => List[Int] = <function2>
Once you define these then you can use it in your combineByKey call as below
scala> val list = List((1,5),(1,8),(1,40),(2,9),(2,20),(2,6))
list: List[(Int, Int)] = List((1,5), (1,8), (1,40), (2,9), (2,20), (2,6))
scala> val temp = sc.parallelize(list)
temp: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[41] at parallelize at <console>:30
scala> temp.combineByKey(createCombiner,mergeValue, mergeCombiners).collect
res27: Array[(Int, List[Int])] = Array((1,List(8, 40, 5)), (2,List(20, 9, 6)))
Please note that I tried this out in Spark Shell and hence you can see the outputs below the commands executed. They will help build you your understanding.

How to get the specified output without combineByKey and aggregateByKey in spark RDD

Below is my data:
val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", bar=C","bar=D", "bar=D")
Now I want below types of output but without using combineByKey and aggregateByKey:
1) Array[(String, Int)] = Array((foo,5), (bar,3))
2) Array((foo,Set(B, A)),
(bar,Set(C, D)))
Below is my attempt:
scala> val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C",
| "bar=D", "bar=D")
scala> val sample=keysWithValuesList.map(_.split("=")).map(p=>(p(0),(p(1))))
sample: Array[(String, String)] = Array((foo,A), (foo,A), (foo,A), (foo,A), (foo,B), (bar,C), (bar,D), (bar,D))
Now when I type the variable name followed by tab to see the applicable methods for the mapped RDD I can see the below options out of which none can satisfy my requirement:
scala> sample.
apply asInstanceOf clone isInstanceOf length toString update
So how can I achieve this ??
Here is a standard approach.
Point to note: you need to be working with an RDD. I think that is the bottleneck.
Here you go:
val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C","bar=D", "bar=D")
val sample=keysWithValuesList.map(_.split("=")).map(p=>(p(0),(p(1))))
val sample2 = sc.parallelize(sample.map(x => (x._1, 1)))
val sample3 = sample2.reduceByKey(_+_)
sample3.collect()
val sample4 = sc.parallelize(sample.map(x => (x._1, x._2))).groupByKey()
sample4.collect()
val sample5 = sample4.map(x => (x._1, x._2.toSet))
sample5.collect()

How to execute Column expression in spark without dataframe

Is there any way that I can evaluate my Column expression if I am only using Literal (no dataframe columns).
For example, something like:
val result: Int = someFunction(lit(3) * lit(5))
//result: Int = 15
or
import org.apache.spark.sql.function.sha1
val result: String = someFunction(sha1(lit("5")))
//result: String = ac3478d69a3c81fa62e60f5c3696165a4e5e6ac4
I am able to evaluate using a dataframes
val result = Seq(1).toDF.select(sha1(lit("5"))).as[String].first
//result: String = ac3478d69a3c81fa62e60f5c3696165a4e5e6ac4
But is there any way to get the same results without using dataframe?
To evaluate a literal column you can convert it to an Expression and eval without providing input row:
scala> sha1(lit("1").cast("binary")).expr.eval()
res1: Any = 356a192b7913b04c54574d18c28d46e6395428ab
As long as the function is an UserDefinedFunction it will work the same way:
scala> val f = udf((x: Int) => x)
f: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(IntegerType)))
scala> f(lit(3) * lit(5)).expr.eval()
res3: Any = 15
The following code can help:
val isUuid = udf((uuid: String) => uuid.matches("[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}"))
df.withColumn("myCol_is_uuid",isUuid(col("myCol")))
.filter("myCol_is_uuid = true")
.show(10, false)

RDD of Tuple and RDD of Row differences

I have two different RDDs and apply a foreach on both of them and note a difference that I cannot resolve.
First one:
val data = Array(("CORN",6), ("WHEAT",3),("CORN",4),("SOYA",4),("CORN",1),("PALM",2),("BEANS",9),("MAIZE",8),("WHEAT",2),("PALM",10))
val rdd = sc.parallelize(data,3) // NOT sorted
rdd.foreach{ x => {
println (x)
}}
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[103] at parallelize at command-325897530726166:8
Works fine in this sense.
Second one:
rddX.foreach{ x => {
val prod = x(0)
val vol = x(1)
val prt = counter
val cnt = counter * 100
println(prt,cnt,prod,vol)
}}
rddX: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[128] at rdd at command-686855653277634:51
Works fine.
Question: why can I not do val prod = x(0) as in the second case on the first example? And how could I do that with the foreach? Or would we need to use map for the first case always? Due to Row internals on the second example?
As you can see the difference in datatypes
First one is RDD[(String, Int)]
This is an RDD of Tuple2 which contains (String, Int) so you can access this as val prod = x._1 for first value as String and x._2 for second Integer value.
Since it is a Tuple you can't access as val prod = x(0)
and second one is RDD[org.apache.spark.sql.Row] which can be access a
val prod = x.getString(0) or val prod = x(0)
I hope this helped!

Resources