RDD of Tuple and RDD of Row differences

RDD of Tuple and RDD of Row differences - apache-spark

I have two different RDDs and apply a foreach on both of them and note a difference that I cannot resolve.
First one:
val data = Array(("CORN",6), ("WHEAT",3),("CORN",4),("SOYA",4),("CORN",1),("PALM",2),("BEANS",9),("MAIZE",8),("WHEAT",2),("PALM",10))
val rdd = sc.parallelize(data,3) // NOT sorted
rdd.foreach{ x => {
println (x)
}}
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[103] at parallelize at command-325897530726166:8
Works fine in this sense.
Second one:
rddX.foreach{ x => {
val prod = x(0)
val vol = x(1)
val prt = counter
val cnt = counter * 100
println(prt,cnt,prod,vol)
}}
rddX: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[128] at rdd at command-686855653277634:51
Works fine.
Question: why can I not do val prod = x(0) as in the second case on the first example? And how could I do that with the foreach? Or would we need to use map for the first case always? Due to Row internals on the second example?

As you can see the difference in datatypes
First one is RDD[(String, Int)]
This is an RDD of Tuple2 which contains (String, Int) so you can access this as val prod = x._1 for first value as String and x._2 for second Integer value.
Since it is a Tuple you can't access as val prod = x(0)
and second one is RDD[org.apache.spark.sql.Row] which can be access a
val prod = x.getString(0) or val prod = x(0)
I hope this helped!

Related

Spark Scala convert RDD with Case Class to simple RDD

This is fine:
case class trans(atm : String, num: Int)
val array = Array((20254552,"ATM",-5100), (20174649,"ATM",5120))
val rdd = sc.parallelize(array)
val rdd1 = rdd.map(x => (x._1, trans(x._2, x._3)))
How to convert back to a simple RDD like rdd again?
E.g. rdd: org.apache.spark.rdd.RDD[(Int, String, Int)]
I can do this, for sure:
val rdd2 = rdd1.mapValues(v => (v.atm, v.num)).map(x => (x._1, x._2._1, x._2._2))
but what if there is a big record for the class? E.g. dynamically.

Not sure exactly how generic you want to go, but in your example of an RDD[(Int, trans)] you can make use of the unapply method of the trans companion object in order to flatten your case class to a tuple.
So, if you have your setup:
case class trans(atm : String, num: Int)
val array = Array((20254552,"ATM",-5100), (20174649,"ATM",5120))
val rdd = sc.parallelize(array)
val rdd1 = rdd.map(x => (x._1, trans(x._2, x._3)))
You can do the following:
import shapeless.syntax.std.tuple._
val output = rdd1.map{
case (myInt, myTrans) => {
myInt +: trans.unapply(myTrans).get
}
}
output
res15: org.apache.spark.rdd.RDD[(Int, String, Int)]
We're importing shapeless.syntax.std.tuple._ in order to be able to make a tuple from our Int + flattened tuple (the myInt +: trans.unapply(myTrans).get operation).

Case class method "productIterator" can help convert to array:
case class trans(atm : String, num: Int)
val value = trans("ATM", 5120)
val rdd = spark.sparkContext.parallelize(Seq(value))
rdd
.map(_.productIterator.toArray)

How to add Spark DataFrames to a Seq() one by one in Scala

I created an empty Seq() using
scala> var x = Seq[DataFrame]()
x: Seq[org.apache.spark.sql.DataFrame] = List()
I have a function called createSamplesForOneDay() that returns a DataFrame, which I would like to add to this Seq() x .
val temp = createSamplesForOneDay(some_inputs) // this returns a Spark DF
x = x + temp // this throws an error
I get the below error -
scala> x = x + temp
<console>:59: error: type mismatch;
found : org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
required: String
x = x + temp
What I am trying to do is create a Seq() of dataframes using a for loop and at the end union them all using something like this -
val newDFs = Seq(DF1,DF2,DF3)
newDFs.reduce(_ union _)
as mentioned here - scala - Spark : How to union all dataframe in loop

you cannot append to a List using +, you can append like this :
x = x :+ temp
But as you have a List, you should rather prepend your elements:
x = temp +: x
Instead of adding elements one by one, you can write it more functional if you pack your inputs in a sequence too:
val inputs = Seq(....) // create Seq of inputs
val x = inputs.map(i => createSamplesForOneDay(i))

How to execute Column expression in spark without dataframe

Is there any way that I can evaluate my Column expression if I am only using Literal (no dataframe columns).
For example, something like:
val result: Int = someFunction(lit(3) * lit(5))
//result: Int = 15
or
import org.apache.spark.sql.function.sha1
val result: String = someFunction(sha1(lit("5")))
//result: String = ac3478d69a3c81fa62e60f5c3696165a4e5e6ac4
I am able to evaluate using a dataframes
val result = Seq(1).toDF.select(sha1(lit("5"))).as[String].first
//result: String = ac3478d69a3c81fa62e60f5c3696165a4e5e6ac4
But is there any way to get the same results without using dataframe?

To evaluate a literal column you can convert it to an Expression and eval without providing input row:
scala> sha1(lit("1").cast("binary")).expr.eval()
res1: Any = 356a192b7913b04c54574d18c28d46e6395428ab
As long as the function is an UserDefinedFunction it will work the same way:
scala> val f = udf((x: Int) => x)
f: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(IntegerType)))
scala> f(lit(3) * lit(5)).expr.eval()
res3: Any = 15

The following code can help:
val isUuid = udf((uuid: String) => uuid.matches("[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}"))
df.withColumn("myCol_is_uuid",isUuid(col("myCol")))
.filter("myCol_is_uuid = true")
.show(10, false)

Creating Pair RDDs

From the below RDD , I would like to create a pair RDD.
val line = sc.parallelize(Array("2,SMITH,AARON"))
I used the below code:
val pair = line.map(x => (x.split(",")(0).toInt, x))
The output generated is Array[(Int, String)] = Array((2,2,SMITH,AARON))
But I would like the desired output to be Array[(Int, String)] = Array((2,SMITH,AARON))
Pls help me out.
I am a newbie.

Just take the rest:
val pair = line.map(x => x.split(",") match {
case Array(x, xs # _ *) => (x.toInt, xs.join(",")}
})

Easy way to do this is to split and get the array in each position
line.map(r => {
val split = r.split(",")
(split(0).toInt, (split.tail.mkString(",")))
})

HowTo get a Map from a csv string

I'm fairly new to Scala, but I'm doing my exercises now.
I have a string like "A>Augsburg;B>Berlin". What I want at the end is a map
val mymap = Map("A"->"Augsburg", "B"->"Berlin")
What I did is:
val st = locations.split(";").map(dynamicListExtract _)
with the function
private def dynamicListExtract(input: String) = {
if (input contains ">") {
val split = input split ">"
Some(split(0), split(1)) // return key , value
} else {
None
}
}
Now I have an Array[Option[(String, String)
How do I elegantly convert this into a Map[String, String]
Can anybody help?
Thanks

Just change your map call to flatMap:
scala> sPairs.split(";").flatMap(dynamicListExtract _)
res1: Array[(java.lang.String, java.lang.String)] = Array((A,Augsburg), (B,Berlin))
scala> Map(sPairs.split(";").flatMap(dynamicListExtract _): _*)
res2: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map((A,Augsburg), (B,Berlin))
For comparison:
scala> Map("A" -> "Augsburg", "B" -> "Berlin")
res3: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map((A,Augsburg), (B,Berlin))

In 2.8, you can do this:
val locations = "A>Augsburg;B>Berlin"
val result = locations.split(";").map(_ split ">") collect { case Array(k, v) => (k, v) } toMap
collect is like map but also filters values that aren't defined in the partial function. toMap will create a Map from a Traversable as long as it's a Traversable[(K, V)].

It's also worth seeing Randall's solution in for-comprehension form, which might be clearer, or at least give you a better idea of what flatMap is doing.
Map.empty ++ (for(possiblePair<-sPairs.split(";"); pair<-dynamicListExtract(possiblePair)) yield pair)

A simple solution (not handling error cases):
val str = "A>Aus;B>Ber"
var map = Map[String,String]()
str.split(";").map(_.split(">")).foreach(a=>map += a(0) -> a(1))
but Ben Lings' is better.

val str= "A>Augsburg;B>Berlin"
Map(str.split(";").map(_ split ">").map(s => (s(0),s(1))):_*)
--or--
str.split(";").map(_ split ">").foldLeft(Map[String,String]())((m,s) => m + (s(0) -> s(1)))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

RDD of Tuple and RDD of Row differences - apache-spark

Related

Spark Scala convert RDD with Case Class to simple RDD

How to add Spark DataFrames to a Seq() one by one in Scala

How to execute Column expression in spark without dataframe

Creating Pair RDDs

HowTo get a Map from a csv string

Categories

Resources