spark sql dynamic filter condition - apache-spark

How can I construct a boolean filter condition dynamically in spark sql?
Having:
val d = Seq(1, 2, 3, 5, 6).toDF
d.filter(col("value") === 1 or col("value") === 3).show
How can I replicate this dynamically:
val desiredThings = Seq(1,3)
I try to build the filter:
val myCondition = desiredThings.map(col("value") === _)
d.filter(myCondition).show
but fail with:
overloaded method value filter with alternatives:
org.apache.spark.api.java.function.FilterFunction[org.apache.spark.sql.Row]
cannot be applied to (Seq[org.apache.spark.sql.Column])
When executing
d.filter(myCondition).show
Also when experimenting with fold left:
val myCondition = desiredThings.foldLeft()((result, entry) => result && col(c.columnCounterId) === entry)
I have compile errors.
How can I adapt the code to dynamically generate the filter predicate?

Just use isin:
d.filter(col("value").isin(desiredThings: _*))
but if you really want to foldLeft you have to provide the base condition:
d.filter(desiredThings.foldLeft(lit(false))(
(acc, x) => (acc || col("value") === (x)))
)
Alternatively, to use with filter or where, you can generate a SQL expression using:
val filterExpr = desiredThings.map( v => s"value = $v").mkString(" or ")
And then use it like
d.filter(filterExpr).show
// or
d.where(filterExpr).show
//+-----+
//|value|
//+-----+
//| 1|
//| 3|
//+-----+

Related

Print characters at even and odd indices from a String

Using scala, how to print string in even and odd indices of a given string? I am aware of the imperative approach using var. I am looking for an approach that uses immutability, avoids side-effects (of course, until need to print result) and concise.
Here is a tail-recursive solution returning even and odd chars (List[Char], List[Char]) in one go
def f(in: String): (List[Char], List[Char]) = {
#tailrec def run(s: String, idx: Int, accEven: List[Char], accOdd: List[Char]): (List[Char], List[Char]) = {
if (idx < 0) (accEven, accOdd)
else if (idx % 2 == 0) run(s, idx - 1, s.charAt(idx) :: accEven, accOdd)
else run(s, idx - 1, accEven, s.charAt(idx) :: accOdd)
}
run(in, in.length - 1, Nil, Nil)
}
which could be printed like so
val (even, odd) = f("abcdefg")
println(even.mkString)
Another way to explore is using zipWithIndex
def printer(evenOdd: Int) {
val str = "1234"
str.zipWithIndex.foreach { i =>
i._2 % 2 match {
case x if x == evenOdd => print(i._1)
case _ =>
}
}
}
In this case you can check the results by using the printer function
scala> printer(1)
24
scala> printer(0)
13
.zipWithIndex takes a List and returns tuples of the elements coupled with their index. Knowing that a String is a list of Char
Looking at str
scala> val str = "1234"
str: String = 1234
str.zipWithIndex
res: scala.collection.immutable.IndexedSeq[(Char, Int)] = Vector((1,0), (2,1), (3,2), (4,3))
Lastly, as you only need to print, using foreach instead of map is more ideal as you aren't expecting values to be returned
You can use the sliding function, which is quite simple:
scala> "abcdefgh".sliding(1,2).mkString("")
res16: String = aceg
scala> "abcdefgh".tail.sliding(1,2).mkString("")
res17: String = bdfh
val s = "abcd"
// ac
(0 until s.length by 2).map(i => s(i))
// bd
(1 until s.length by 2).map(i => s(i))
just pure functions with map operator

Update CoordinateMatrix entry

Is there an efficient way to update a value for a certain index (i,j) for CoordinateMatrix?
Currently I'm using map to iterate all values and update only when I find those certain indexes but I don't think this is the right way to do it
There is not. CoordinateMatrix is backed by and RDD and is immutable. Even if you optimize access by:
Getting its entries:
val mat: CoordinateMatrix = ???
val entries = mat.entries
Converting to RDD of ((row, col), value) and hash partitioning.
val n: Int = ???
val partitioner = new org.apache.spark.HashPartitioner(n)
val pairs = entries.map(e => ((e.i, e.j), e.value)).partitionBy(partitioner)
Mapping only a single partition:
def update(mat: RDD[((Long, Long), Double)], i: Long, j: Long, v: Double) = {
val p = mat.partitioner.map(_.getPartition((i, j)))
p.map(p => mat.mapPartitionsWithIndex{
case (pi, iter) if pi == p => iter.map {
case ((ii, jj), _) if ii == i && jj == j => ((ii, jj), v)
case x => x
}
case (_, iter) => iter
})
}
you'll still make a new RDD for each update.

I want to collect the data frame column values in an array list to conduct some computations, is it possible?

I am loading data from phoenix through this:
val tableDF = sqlContext.phoenixTableAsDataFrame("Hbtable", Array("ID", "distance"), conf = configuration)
and want to carry out the following computation on the column values distance:
val list=Array(10,20,30,40,10,20,0,10,20,30,40,50,60)//list of values from the column distance
val first=list(0)
val last=list(list.length-1)
var m = 0;
for (a <- 0 to list.length-2) {
if (list(a + 1) < list(a) && list(a+1)>=0)
{
m = m + list(a)
}
}
val totalDist=(m+last-first)
You can do something like this. It returns Array[Any]
`val array = df.select("distance").rdd.map(r => r(0)).collect()
If you want to get the data type properly, then you can use. It returns the Array[Int]
val array = df.select("distance").rdd.map(r => r(0).asInstanceOf[Int]).collect()

How to find out the machine in the cluster which stores a given element in RDD and send a message to it?

I want to know if in an RDD, for example, RDD = {"0", "1", "2",... "99999"}, can I find out the machine in the cluster which stores a given element (e.g.: 100)?
And then in shuffle, can I aggregate some data and send it to the certain machine? I know that the partition of RDD is transparent for users but could I use some method like key/value to achieve that?
Generally speaking the answer is no or at least not with RDD API. If you can express your logic using graphs then you can try message based API in GraphX or Giraph. If not then using Akka directly instead of Spark could be a better choice.
Still, there are some workarounds but I wouldn't expect high performance. Lets start with some dummy data:
import org.apache.spark.rdd.RDD
val toPairs = (s: Range) => s.map(_.toChar.toString)
val rdd: RDD[(Int, String)] = sc.parallelize(Seq(
(0, toPairs(97 to 100)), // a-d
(1, toPairs(101 to 107)), // e-k
(2, toPairs(108 to 115)) // l-s
)).flatMap{ case (i, vs) => vs.map(v => (i, v)) }
and partition it using custom partitioner:
import org.apache.spark.Partitioner
class IdentityPartitioner(n: Int) extends Partitioner {
def numPartitions: Int = n
def getPartition(key: Any): Int = key.asInstanceOf[Int]
}
val partitioner = new IdentityPartitioner(4)
val parts = rdd.partitionBy(partitioner)
Now we have RDD with 4 partitions including one empty:
parts.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size))).collect
// Array[(Int, Int)] = Array((0,4), (1,7), (2,8), (3,0))
The simplest thing you can do is to leverage partitioning itself. First a dummy function and a helper:
// Dummy map function
def transform(s: String) =
Map("e" -> "x", "k" -> "y", "l" -> "z").withDefault(identity)(s)
// Map String to partition
def address(curr: Int, s: String) = {
val m = Map("x" -> 3, "y" -> 3, "z" -> 3).withDefault(x => curr)
(m(s), s)
}
and "send" data:
val transformed: RDD[(Int, String)] = parts
// Emit pairs (partition, string)
.map{case (i, s) => address(i, transform(s))}
// Repartition
.partitionBy(partitioner)
transformed
.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size)))
.collect
// Array[(Int, Int)] = Array((0,4), (1,5), (2,7), (3,3))
another approach is to collect "messages":
val tmp = parts.mapValues(s => transform(s))
val messages: Map[Int,Iterable[String]] = tmp
.flatMap{case (i, s) => {
val target = address(i, s)
if (target != (i, s)) Seq(target) else Seq()
}}
.groupByKey
.collectAsMap
create broadcast
val messagesBD = sc.broadcast(messages)
and use it to send messages:
val transformed = tmp
.filter{case (i, s) => address(i, s) == (i, s)}
.mapPartitionsWithIndex((i, iter) => {
val combined = iter ++ messagesBD.value.getOrElse(i, Seq())
combined.map((i, _))
}, true)
transformed
.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size)))
.collect
// Array[(Int, Int)] = Array((0,4), (1,5), (2,7), (3,3))
Note the following line:
val combined = iter ++ messagesBD.value.getOrElse(i, Seq())
messagesBD.value is the entire broadcast data, which is actually a Map[Int,Iterable[String]], but then getOrElse method returns only the data that was mapped to i (if available).

Scala split string and sort data

Hi I am new in scala and I achieved following things in scala, my string contain following data
CLASS: Win32_PerfFormattedData_PerfProc_Process$$(null)|CreatingProcessID|Description|ElapsedTime|Frequency_Object|Frequency_PerfTime|Frequency_Sys100NS|HandleCount|IDProcess|IODataBytesPersec|IODataOperationsPersec|IOOtherBytesPersec|IOOtherOperationsPersec|IOReadBytesPersec|IOReadOperationsPersec|IOWriteBytesPersec|IOWriteOperationsPersec|Name|PageFaultsPersec|PageFileBytes|PageFileBytesPeak|PercentPrivilegedTime|PercentProcessorTime|PercentUserTime|PoolNonpagedBytes|PoolPagedBytes|PriorityBase|PrivateBytes|ThreadCount|Timestamp_Object|Timestamp_PerfTime|Timestamp_Sys100NS|VirtualBytes|VirtualBytesPeak|WorkingSet|WorkingSetPeak|WorkingSetPrivate$$(null)|0|(null)|8300717|0|0|0|0|0|0|0|0|0|0|0|0|0|Idle|0|0|0|100|100|0|0|0|0|0|8|0|0|0|0|0|24576|24576|24576$$(null)|0|(null)|8300717|0|0|0|578|4|0|0|0|0|0|0|0|0|System|0|114688|274432|17|0|0|0|0|8|114688|124|0|0|0|3469312|8908800|311296|5693440|61440$$(null)|4|(null)|8300717|0|0|0|42|280|0|0|0|0|0|0|0|0|smss|0|782336|884736|110|0|0|1864|10664|11|782336|3|0|0|0|5701632|19357696|1388544|1417216|700416$$(null)|372|(null)|8300715|0|0|0|1438|380|0|0|0|0|0|0|0|0|csrss|0|3624960|3747840|0|0|0|15008|157544|13|3624960|10|0|0|0|54886400|55345152|5586944|5648384|2838528$$(null)|424|(null)|8300714|0|0|0|71|432|0|0|0|0|0|0|0|0|csrss#1|0|8605696|8728576|0|0|0|8720|96384|13|8605696|9|0|0|0|50515968|50909184|7438336|9342976|4972544
now I want to find data who's value is PercentProcessorTime, ElapsedTime,.. so for this I first split above string $$ and then again split string using | and this new split string I searched string where PercentProcessorTime' presents and get Index of that string when I get string then skipped first two arrays which split from$$and get data ofPercentProcessorTime` using index , it's looks like complicated but I think following code should helps
// First split string as below
val processData = winProcessData.split("\\$\\$")
// get index here
val getIndex: Int = processData.find(part => part.contains("PercentProcessorTime"))
.map {
case getData =>
getData
} match {
case Some(s) => s.split("\\|").indexOf("PercentProcessorTime")
case None => -1
}
val getIndexOfElapsedTime: Int = processData.find(part => part.contains("ElapsedTime"))
.map {
case getData =>
getData
} match {
case Some(s) => s.split("\\|").indexOf("ElapsedTime")
case None => -1
}
// now fetch data of above index as below
for (i <- 2 to (processData.length - 1)) {
val getValues = processData(i).split("\\|")
val getPercentProcessTime = getValues(getIndex).toFloat
val getElapsedTime = getValues(getIndexOfElapsedTime).toFloat
Logger.info("("+getPercentProcessTime+","+getElapsedTime+"),")
}
Now Problem is that using above code I was getting data of given key in index, so my output was (8300717,100),(8300717,17)(8300717,110)... Now I want sort this data using getPercentProcessTime so my output should be (8300717,110),(8300717,100)(8300717,17)...
and that data should be in lists so I will pass list to case class.
Are you find PercentProcessorTime or PercentPrivilegedTime ?
Here it is
val str = "your very long string"
val heads = Seq("PercentPrivilegedTime", "ElapsedTime")
val Array(elap, perc) = str.split("\\$\\$").tail.map(_.split("\\|"))
.transpose.filter(x => heads.contains(x.head))
//elap: Array[String] = Array(ElapsedTime, 8300717, 8300717, 8300717, 8300715, 8300714)
//perc: Array[String] = Array(PercentPrivilegedTime, 100, 17, 110, 0, 0)
val res = (elap.tail, perc.tail).zipped.toList.sortBy(-_._2.toInt)
//List[(String, String)] = List((8300717,110), (8300717,100), (8300717,17), (8300715,0), (8300714,0))

Resources