How to execute Column expression in spark without dataframe - apache-spark

Is there any way that I can evaluate my Column expression if I am only using Literal (no dataframe columns).
For example, something like:
val result: Int = someFunction(lit(3) * lit(5))
//result: Int = 15
or
import org.apache.spark.sql.function.sha1
val result: String = someFunction(sha1(lit("5")))
//result: String = ac3478d69a3c81fa62e60f5c3696165a4e5e6ac4
I am able to evaluate using a dataframes
val result = Seq(1).toDF.select(sha1(lit("5"))).as[String].first
//result: String = ac3478d69a3c81fa62e60f5c3696165a4e5e6ac4
But is there any way to get the same results without using dataframe?

To evaluate a literal column you can convert it to an Expression and eval without providing input row:
scala> sha1(lit("1").cast("binary")).expr.eval()
res1: Any = 356a192b7913b04c54574d18c28d46e6395428ab
As long as the function is an UserDefinedFunction it will work the same way:
scala> val f = udf((x: Int) => x)
f: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(IntegerType)))
scala> f(lit(3) * lit(5)).expr.eval()
res3: Any = 15

The following code can help:
val isUuid = udf((uuid: String) => uuid.matches("[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}"))
df.withColumn("myCol_is_uuid",isUuid(col("myCol")))
.filter("myCol_is_uuid = true")
.show(10, false)

Related

How to add Spark DataFrames to a Seq() one by one in Scala

I created an empty Seq() using
scala> var x = Seq[DataFrame]()
x: Seq[org.apache.spark.sql.DataFrame] = List()
I have a function called createSamplesForOneDay() that returns a DataFrame, which I would like to add to this Seq() x .
val temp = createSamplesForOneDay(some_inputs) // this returns a Spark DF
x = x + temp // this throws an error
I get the below error -
scala> x = x + temp
<console>:59: error: type mismatch;
found : org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
required: String
x = x + temp
What I am trying to do is create a Seq() of dataframes using a for loop and at the end union them all using something like this -
val newDFs = Seq(DF1,DF2,DF3)
newDFs.reduce(_ union _)
as mentioned here - scala - Spark : How to union all dataframe in loop
you cannot append to a List using +, you can append like this :
x = x :+ temp
But as you have a List, you should rather prepend your elements:
x = temp +: x
Instead of adding elements one by one, you can write it more functional if you pack your inputs in a sequence too:
val inputs = Seq(....) // create Seq of inputs
val x = inputs.map(i => createSamplesForOneDay(i))

RDD of Tuple and RDD of Row differences

I have two different RDDs and apply a foreach on both of them and note a difference that I cannot resolve.
First one:
val data = Array(("CORN",6), ("WHEAT",3),("CORN",4),("SOYA",4),("CORN",1),("PALM",2),("BEANS",9),("MAIZE",8),("WHEAT",2),("PALM",10))
val rdd = sc.parallelize(data,3) // NOT sorted
rdd.foreach{ x => {
println (x)
}}
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[103] at parallelize at command-325897530726166:8
Works fine in this sense.
Second one:
rddX.foreach{ x => {
val prod = x(0)
val vol = x(1)
val prt = counter
val cnt = counter * 100
println(prt,cnt,prod,vol)
}}
rddX: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[128] at rdd at command-686855653277634:51
Works fine.
Question: why can I not do val prod = x(0) as in the second case on the first example? And how could I do that with the foreach? Or would we need to use map for the first case always? Due to Row internals on the second example?
As you can see the difference in datatypes
First one is RDD[(String, Int)]
This is an RDD of Tuple2 which contains (String, Int) so you can access this as val prod = x._1 for first value as String and x._2 for second Integer value.
Since it is a Tuple you can't access as val prod = x(0)
and second one is RDD[org.apache.spark.sql.Row] which can be access a
val prod = x.getString(0) or val prod = x(0)
I hope this helped!

Why is count function is not working with mapvalues in Spark?

I am doing some basic handson in spark using scala.
I would like to know why the count function is not working with mapValues and map function
When I apply sum,min,max then it works.. Also Is there any place where I can refer all the applicable functions that can be applied on Iterable[String] from groupbykeyRDD?
MyCode:
scala> val records = List( "CHN|2", "CHN|3" , "BNG|2","BNG|65")
records: List[String] = List(CHN|2, CHN|3, BNG|2, BNG|65)
scala> val recordsRDD = sc.parallelize(records)
recordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[119] at parallelize at <console>:23
scala> val mapRDD = recordsRDD.map(elem => elem.split("\\|"))
mapRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[120] at map at <console>:25
scala> val keyvalueRDD = mapRDD.map(elem => (elem(0),elem(1)))
keyvalueRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at map at <console>:27
scala> val groupbykeyRDD = keyvalueRDD.groupByKey()
groupbykeyRDD: org.apache.spark.rdd.RDD[(String, Iterable[String])] = ShuffledRDD[122] at groupByKey at <console>:29
scala> groupbykeyRDD.mapValues(elem => elem.count).collect
<console>:32: error: missing arguments for method count in trait TraversableOnce;
follow this method with `_' if you want to treat it as a partially applied function
groupbykeyRDD.mapValues(elem => elem.count).collect
^
scala> groupbykeyRDD.map(elem => (elem._1 ,elem._2.count)).collect
<console>:32: error: missing arguments for method count in trait TraversableOnce;
follow this method with `_' if you want to treat it as a partially applied function
groupbykeyRDD.map(elem => (elem._1 ,elem._2.count)).collect
Expected output :
Array((CHN,2) ,(BNG,2))
The error you are having has nothing to do with spark, it's a pure scala compilation error.
You can try in a scala (no spark at all) console :
scala> val iterableTest: Iterable[String] = Iterable("test")
iterableTest: Iterable[String] = List(test)
scala> iterableTest.count
<console>:29: error: missing argument list for method count in trait TraversableOnce
This is because Iterable does not define a count (with no arguments) method. It does define a count method, though, but which needs a predicate function argument, which is why you get this specific error about partially unapplied functions.
It does have a size method though, that you could swap in your sample to make it work.
Elem you are getting is of type Iteratable[String] then try length method or size method because Iteratable does not have count method if it does not work
you can cast Iteratable [String] to List and try length method
Count method avalaible for RDD
count - counts the occurrence of values provided in parameter condition (Boolean)
count with your code: here it counts # of occurrences of "2", "3"
scala> groupbykeyRDD.collect().foreach(println)
(CHN,CompactBuffer(2, 3))
(BNG,CompactBuffer(2, 65))
scala> groupbykeyRDD.map(elem => (elem._1 ,elem._2.count(_ == "2"))).collect
res14: Array[(String, Int)] = Array((CHN,1), (BNG,1))
scala> groupbykeyRDD.map(elem => (elem._1 ,elem._2.count(_ == "3"))).collect
res15: Array[(String, Int)] = Array((CHN,1), (BNG,0))
count with with small fix to your code: if you twist you code this way than count should give you expected results:
val keyvalueRDD = mapRDD.map(elem => (elem(0),1))
Test:
scala> val groupbykeyRDD = mapRDD.map(elem => (elem(0),1)).groupByKey()
groupbykeyRDD: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[9] at groupByKey at <console>:18
scala> groupbykeyRDD.collect().foreach(println)
(CHN,CompactBuffer(1, 1))
(BNG,CompactBuffer(1, 1))
scala> groupbykeyRDD.map(elem => (elem._1 ,elem._2.count(_ == 1))).collect
res18: Array[(String, Int)] = Array((CHN,2), (BNG,2))

Scala runtime string interpolation/formatting

Are there any standard library facilities to do string interpolation/formatting at runtime? I'd like the formatting to behave exactly the same as the macro based s"scala ${implementation} except that my string format is loaded at runtime from a config file.
val format = config.getString("my.key")
val stringContext = parseFormat(format)
val res = stringContext.f("arg1", "arg2")
with parseFormat returning a StringContext.
I imagine, worst case, I could just split the string on "{}" sequences and use the parts to construct the StringContext.
// untested
def parseFormat(format: String): StringContext =
new StringContext("""{}""".r.split(format): _*)
Is there an obvious solution that I'm missing or would the above hack do the trick?
There are no silly questions. Only Sunday mornings.
First, don't use String.format.
scala> val s = "Count to %d"
s: String = Count to %d
scala> String format (s, 42)
<console>:9: error: overloaded method value format with alternatives:
(x$1: java.util.Locale,x$2: String,x$3: Object*)String <and>
(x$1: String,x$2: Object*)String
cannot be applied to (String, Int)
String format (s, 42)
^
scala> s format 42
res1: String = Count to 42
But formatting can be expensive. So with your choice of escape handling:
scala> StringContext("Hello, {}. Today is {}." split "\\{}" : _*).s("Bob", "Tuesday")
res2: String = Hello, Bob. Today is Tuesday.
scala> StringContext("""Hello, \"{}.\" Today is {}.""" split "\\{}" : _*).s("Bob", "Tuesday")
res3: String = Hello, "Bob." Today is Tuesday.
scala> StringContext("""Hello, \"{}.\" Today is {}.""" split "\\{}" : _*).raw("Bob", "Tuesday")
res4: String = Hello, \"Bob.\" Today is Tuesday.
It turns out that split doesn't quite hack it.
scala> StringContext("Count to {}" split "\\{}" : _*) s 42
java.lang.IllegalArgumentException: wrong number of arguments (1) for interpolated string with 1 parts
at scala.StringContext.checkLengths(StringContext.scala:65)
at scala.StringContext.standardInterpolator(StringContext.scala:121)
at scala.StringContext.s(StringContext.scala:94)
... 33 elided
So given
scala> val r = "\\{}".r
r: scala.util.matching.Regex = \{}
scala> def parts(s: String) = r split s
parts: (s: String)Array[String]
Maybe
scala> def f(parts: Seq[String], args: Any*) = (parts zip args map (p => p._1 + p._2)).mkString
f: (parts: Seq[String], args: Any*)String
So
scala> val count = parts("Count to {}")
count: Array[String] = Array("Count to ")
scala> f(count, 42)
res7: String = Count to 42
scala> f(parts("Hello, {}. Today is {}."), "Bob", "Tuesday")
res8: String = Hello, Bob. Today is Tuesday
Hey, wait!
scala> def f(parts: Seq[String], args: Any*) = (parts.zipAll(args, "", "") map (p => p._1 + p._2)).mkString
f: (parts: Seq[String], args: Any*)String
scala> f(parts("Hello, {}. Today is {}."), "Bob", "Tuesday")
res9: String = Hello, Bob. Today is Tuesday.
or
scala> def f(parts: Seq[String], args: Any*) = (for (i <- 0 until (parts.size max args.size)) yield (parts.applyOrElse(i, (_: Int) => "") + args.applyOrElse(i, (_: Int) => ""))).mkString
f: (parts: Seq[String], args: Any*)String
or
scala> def f(parts: Seq[String], args: Any*) = { val sb = new StringBuilder ; for (i <- 0 until (parts.size max args.size) ; ss <- List(parts, args)) { sb append ss.applyOrElse(i, (_: Int) => "") } ; sb.toString }
f: (parts: Seq[String], args: Any*)String
scala> f(parts("Hello, {}. Today is {}. {}"), "Bob", "Tuesday", "Bye!")
res16: String = Hello, Bob. Today is Tuesday. Bye!
A. As of Scala 2.10.3, you can't use StringContext.f unless you know the number of arguments at compile time since the .f method is a macro.
B. Use String.format, just like you would in the good ol' days of Java.
I had a similar requirement where I was loading a Seq[String] from a config file which would become a command to be executed (using scala.sys.process). To simplify the format and ignore any potential escaping problems I also made the variable names a configurable option too.
The config looked something like this:
command = ["""C:\Program Files (x86)\PuTTY\pscp.exe""", "-P", "2222", "-i",
".vagrant/machines/default/virtualbox/private_key", "$source", "~/$target"]
source = "$source"
target = "$target"
I couldn't find a nice (or efficient) way of using the StringContext or "string".format so I rolled my own VariableCommand which is quite similar to StringContext however a single variable can appear zero or more times in any order and in any of the items.
The basic idea was to create a function which took the variable values and then would either take part of the string (e.g. "~/") or take the variable value (e.g. "test.conf") repeatedly to build up the result (e.g. "~/test.conf"). This function is created once which is where all the complexity is and then at substitution time it is really simple (and hopefully fast although I haven't done any performance testing, or much testing at all for that matter).
For those that might wonder why I was doing this it was for running automation tests cross platform using ansible (which doesn't support Windows control machines) for provisioning. This allowed me to copy the files to the target machine and run ansible locally.

HowTo get a Map from a csv string

I'm fairly new to Scala, but I'm doing my exercises now.
I have a string like "A>Augsburg;B>Berlin". What I want at the end is a map
val mymap = Map("A"->"Augsburg", "B"->"Berlin")
What I did is:
val st = locations.split(";").map(dynamicListExtract _)
with the function
private def dynamicListExtract(input: String) = {
if (input contains ">") {
val split = input split ">"
Some(split(0), split(1)) // return key , value
} else {
None
}
}
Now I have an Array[Option[(String, String)
How do I elegantly convert this into a Map[String, String]
Can anybody help?
Thanks
Just change your map call to flatMap:
scala> sPairs.split(";").flatMap(dynamicListExtract _)
res1: Array[(java.lang.String, java.lang.String)] = Array((A,Augsburg), (B,Berlin))
scala> Map(sPairs.split(";").flatMap(dynamicListExtract _): _*)
res2: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map((A,Augsburg), (B,Berlin))
For comparison:
scala> Map("A" -> "Augsburg", "B" -> "Berlin")
res3: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map((A,Augsburg), (B,Berlin))
In 2.8, you can do this:
val locations = "A>Augsburg;B>Berlin"
val result = locations.split(";").map(_ split ">") collect { case Array(k, v) => (k, v) } toMap
collect is like map but also filters values that aren't defined in the partial function. toMap will create a Map from a Traversable as long as it's a Traversable[(K, V)].
It's also worth seeing Randall's solution in for-comprehension form, which might be clearer, or at least give you a better idea of what flatMap is doing.
Map.empty ++ (for(possiblePair<-sPairs.split(";"); pair<-dynamicListExtract(possiblePair)) yield pair)
A simple solution (not handling error cases):
val str = "A>Aus;B>Ber"
var map = Map[String,String]()
str.split(";").map(_.split(">")).foreach(a=>map += a(0) -> a(1))
but Ben Lings' is better.
val str= "A>Augsburg;B>Berlin"
Map(str.split(";").map(_ split ">").map(s => (s(0),s(1))):_*)
--or--
str.split(";").map(_ split ">").foldLeft(Map[String,String]())((m,s) => m + (s(0) -> s(1)))

Resources