Creating Pair RDDs - apache-spark

From the below RDD , I would like to create a pair RDD.
val line = sc.parallelize(Array("2,SMITH,AARON"))
I used the below code:
val pair = line.map(x => (x.split(",")(0).toInt, x))
The output generated is Array[(Int, String)] = Array((2,2,SMITH,AARON))
But I would like the desired output to be Array[(Int, String)] = Array((2,SMITH,AARON))
Pls help me out.
I am a newbie.

Just take the rest:
val pair = line.map(x => x.split(",") match {
case Array(x, xs # _ *) => (x.toInt, xs.join(",")}
})

Easy way to do this is to split and get the array in each position
line.map(r => {
val split = r.split(",")
(split(0).toInt, (split.tail.mkString(",")))
})

Related

Can any one implement CombineByKey() instead of GroupByKey() in Spark in order to group elements?

i am trying to group elements of an RDD that i have created. one simple but expensive way is to use GroupByKey(). but recently i learned that CombineByKey() can do this work more efficiently. my RDD is very simple. it looks like this:
(1,5)
(1,8)
(1,40)
(2,9)
(2,20)
(2,6)
val grouped_elements=first_RDD.groupByKey()..mapValues(x => x.toList)
the result is:
(1,List(5,8,40))
(2,List(9,20,6))
i want to group them based on the first element (key).
can any one help me to do it with CombineByKey() function? i am really confused by CombineByKey()
To begin with take a look at API Refer docs
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
So it accepts three functions which I have defined below
scala> val createCombiner = (v:Int) => List(v)
createCombiner: Int => List[Int] = <function1>
scala> val mergeValue = (a:List[Int], b:Int) => a.::(b)
mergeValue: (List[Int], Int) => List[Int] = <function2>
scala> val mergeCombiners = (a:List[Int],b:List[Int]) => a.++(b)
mergeCombiners: (List[Int], List[Int]) => List[Int] = <function2>
Once you define these then you can use it in your combineByKey call as below
scala> val list = List((1,5),(1,8),(1,40),(2,9),(2,20),(2,6))
list: List[(Int, Int)] = List((1,5), (1,8), (1,40), (2,9), (2,20), (2,6))
scala> val temp = sc.parallelize(list)
temp: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[41] at parallelize at <console>:30
scala> temp.combineByKey(createCombiner,mergeValue, mergeCombiners).collect
res27: Array[(Int, List[Int])] = Array((1,List(8, 40, 5)), (2,List(20, 9, 6)))
Please note that I tried this out in Spark Shell and hence you can see the outputs below the commands executed. They will help build you your understanding.

RDD of Tuple and RDD of Row differences

I have two different RDDs and apply a foreach on both of them and note a difference that I cannot resolve.
First one:
val data = Array(("CORN",6), ("WHEAT",3),("CORN",4),("SOYA",4),("CORN",1),("PALM",2),("BEANS",9),("MAIZE",8),("WHEAT",2),("PALM",10))
val rdd = sc.parallelize(data,3) // NOT sorted
rdd.foreach{ x => {
println (x)
}}
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[103] at parallelize at command-325897530726166:8
Works fine in this sense.
Second one:
rddX.foreach{ x => {
val prod = x(0)
val vol = x(1)
val prt = counter
val cnt = counter * 100
println(prt,cnt,prod,vol)
}}
rddX: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[128] at rdd at command-686855653277634:51
Works fine.
Question: why can I not do val prod = x(0) as in the second case on the first example? And how could I do that with the foreach? Or would we need to use map for the first case always? Due to Row internals on the second example?
As you can see the difference in datatypes
First one is RDD[(String, Int)]
This is an RDD of Tuple2 which contains (String, Int) so you can access this as val prod = x._1 for first value as String and x._2 for second Integer value.
Since it is a Tuple you can't access as val prod = x(0)
and second one is RDD[org.apache.spark.sql.Row] which can be access a
val prod = x.getString(0) or val prod = x(0)
I hope this helped!

How to do non-random Dataset splitting on Apache Spark?

I know I can do random splitting with randomSplit method:
val splittedData: Array[Dataset[Row]] =
preparedData.randomSplit(Array(0.5, 0.3, 0.2))
Can I split the data into consecutive parts with some 'nonRandomSplit method'?
Apache Spark 2.0.1.
Thanks in advance.
UPD: data order is important, I'm going to train my model on data with 'smaller IDs' and test it on data with 'larger IDs'. So I want to split data into consecutive parts without shuffling.
e.g.
my dataset = (0,1,2,3,4,5,6,7,8,9)
desired splitting = (0.8, 0.2)
splitting = (0,1,2,3,4,5,6,7), (8,9)
The only solution I can think of is to use count and limit, but there probably is a better one.
This is the solution I've implemented: Dataset -> Rdd -> Dataset.
I'm not sure whether it is the most effective way to do it, so I'll be glad to accept a better solution.
val count = allData.count()
val trainRatio = 0.6
val trainSize = math.round(count * trainRatio).toInt
val dataSchema = allData.schema
// Zipping with indices and skipping rows with indices > trainSize.
// Could have possibly used .limit(n) here
val trainingRdd =
allData
.rdd
.zipWithIndex()
.filter { case (_, index) => index < trainSize }
.map { case (row, _) => row }
// Can't use .limit() :(
val testRdd =
allData
.rdd
.zipWithIndex()
.filter { case (_, index) => index >= trainSize }
.map { case (row, _) => row }
val training = MySession.createDataFrame(trainingRdd, dataSchema)
val test = MySession.createDataFrame(testRdd, dataSchema)

Document Count of a Word in Spark/Scala

I have a text variable which is an RDD of String in scala
val data = sc.parallelize(List("i am a good boy.Are you a good boy.","You are also working here.","I am posting here today.You are good."))
I have another variable in Scala Map(given below)
//list of words for which doc count needs to be found,initial doc count is 1
val dictionary = Map( """good""" -> 1,"""working""" -> 1,"""posting""" -> 1 ).
I want to do a document count of each of the dictionary terms and get the output in key value format
My output should be like below for the above data.
(good,2)
(working,1)
(posting,1)
What i have tried is
dictionary.map { case(k,v) => k -> k.r.findFirstIn(data.map(line => line.trim()).collect().mkString(",")).size}
I am getting counts as 1 for all the words.
Please help me in fixing the above line
Thanks in advance.
Why not use flatMap to create the dictionary and then you can query that.
val dictionary = data.flatMap {case line => line.split(" ")}.map {case word => (word, 1)}.reduceByKey(_+_)
If I collect this in the REPL I get the following result:
res9: Array[(String, Int)] = Array((here,1), (good.,1), (good,2), (here.,1), (You,1), (working,1), (today.You,1), (boy.Are,1), (are,2), (a,2), (posting,1), (i,1), (boy.,1), (also,1), (I,1), (am,2), (you,1))
Obviously you would need to do a better split than in my simple example.
First of all your dictionary should be a Set, because in general sense you need to map the Set of terms to the number of documents which contain them.
So your data should look like:
scala> val docs = List("i am a good boy.Are you a good boy.","You are also working here.","I am posting here today.You are good.")
docs: List[String] = List(i am a good boy.Are you a good boy., You are also working here., I am posting here today.You are good.)
Your dictionary should look like:
scala> val dictionary = Set("good", "working", "posting")
dictionary: scala.collection.immutable.Set[String] = Set(good, working, posting)
Then you have to implement your transformation, for the simplest logic of the contains function it might look like:
scala> dictionary.map(k => k -> docs.count(_.contains(k))) toMap
res4: scala.collection.immutable.Map[String,Int] = Map(good -> 2, working -> 1, posting -> 1)
For better solution I'd recommend you to implement specific function for your requirements
(String, String) => Boolean
to determine the presence of the term in the document:
scala> def foo(doc: String, term: String): Boolean = doc.contains(term)
foo: (doc: String, term: String)Boolean
Then final solution will look like:
scala> dictionary.map(k => k -> docs.count(d => foo(d, k))) toMap
res3: scala.collection.immutable.Map[String,Int] = Map(good -> 2, working -> 1, posting -> 1)
The last thing you have to do is to calculate the result map using SparkContext. First of all you have to define what data you want to have parallelised. Let's assume we want to parallelize the collection of the documents, then solution might be like following:
val docsRDD = sc.parallelize(List(
"i am a good boy.Are you a good boy.",
"You are also working here.",
"I am posting here today.You are good."
))
docsRDD.mapPartitions(_.map(doc => dictionary.collect {
case term if doc.contains(term) => term -> 1
})).map(_.toMap) reduce { case (m1, m2) => merge(m1, m2) }
def merge(m1: Map[String, Int], m2: Map[String, Int]) =
m1 ++ m2 map { case (k, v) => k -> (v + m1.getOrElse(k, 0)) }

HowTo get a Map from a csv string

I'm fairly new to Scala, but I'm doing my exercises now.
I have a string like "A>Augsburg;B>Berlin". What I want at the end is a map
val mymap = Map("A"->"Augsburg", "B"->"Berlin")
What I did is:
val st = locations.split(";").map(dynamicListExtract _)
with the function
private def dynamicListExtract(input: String) = {
if (input contains ">") {
val split = input split ">"
Some(split(0), split(1)) // return key , value
} else {
None
}
}
Now I have an Array[Option[(String, String)
How do I elegantly convert this into a Map[String, String]
Can anybody help?
Thanks
Just change your map call to flatMap:
scala> sPairs.split(";").flatMap(dynamicListExtract _)
res1: Array[(java.lang.String, java.lang.String)] = Array((A,Augsburg), (B,Berlin))
scala> Map(sPairs.split(";").flatMap(dynamicListExtract _): _*)
res2: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map((A,Augsburg), (B,Berlin))
For comparison:
scala> Map("A" -> "Augsburg", "B" -> "Berlin")
res3: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map((A,Augsburg), (B,Berlin))
In 2.8, you can do this:
val locations = "A>Augsburg;B>Berlin"
val result = locations.split(";").map(_ split ">") collect { case Array(k, v) => (k, v) } toMap
collect is like map but also filters values that aren't defined in the partial function. toMap will create a Map from a Traversable as long as it's a Traversable[(K, V)].
It's also worth seeing Randall's solution in for-comprehension form, which might be clearer, or at least give you a better idea of what flatMap is doing.
Map.empty ++ (for(possiblePair<-sPairs.split(";"); pair<-dynamicListExtract(possiblePair)) yield pair)
A simple solution (not handling error cases):
val str = "A>Aus;B>Ber"
var map = Map[String,String]()
str.split(";").map(_.split(">")).foreach(a=>map += a(0) -> a(1))
but Ben Lings' is better.
val str= "A>Augsburg;B>Berlin"
Map(str.split(";").map(_ split ">").map(s => (s(0),s(1))):_*)
--or--
str.split(";").map(_ split ">").foldLeft(Map[String,String]())((m,s) => m + (s(0) -> s(1)))

Resources