How to reuse the result from spark stream?

How to reuse the result from spark stream? - apache-spark

How can we use the value inside the map, it seems the values not being filled.
val goalScore = rawScore.transform(rdd=>{
val minMax = rdd.flatMap(x=>{
x.behaviorProfileType match {
case Some("mapper") => Some((x.sourceType, x.targetType, "mapper"), x)
case Some("non-mappe") => Some((x.sourceType, x.targetType, "non-mapper"), x)
case _ => None
}
})
.reduceByKey(reduceMinMax(_, _))
.collectAsMap()
rdd.map(x => (populateMinMaxWindowGoalScore(x, minMax)))
})
Why minMax is always empty inside populateMinMaxWindowGoalScore function? rawScore is a DStream.

Related

Update CoordinateMatrix entry

Is there an efficient way to update a value for a certain index (i,j) for CoordinateMatrix?
Currently I'm using map to iterate all values and update only when I find those certain indexes but I don't think this is the right way to do it

There is not. CoordinateMatrix is backed by and RDD and is immutable. Even if you optimize access by:
Getting its entries:
val mat: CoordinateMatrix = ???
val entries = mat.entries
Converting to RDD of ((row, col), value) and hash partitioning.
val n: Int = ???
val partitioner = new org.apache.spark.HashPartitioner(n)
val pairs = entries.map(e => ((e.i, e.j), e.value)).partitionBy(partitioner)
Mapping only a single partition:
def update(mat: RDD[((Long, Long), Double)], i: Long, j: Long, v: Double) = {
val p = mat.partitioner.map(_.getPartition((i, j)))
p.map(p => mat.mapPartitionsWithIndex{
case (pi, iter) if pi == p => iter.map {
case ((ii, jj), _) if ii == i && jj == j => ((ii, jj), v)
case x => x
}
case (_, iter) => iter
})
}
you'll still make a new RDD for each update.

How to use scala strings in list-like pattern matching

So I was reading up about how scala lets you treat string as a sequence of chars through its implicit mechanism. I created a generic Trie class for a general element type and wanted to use it's Char based implementation with string like syntax.
import collection.mutable
import scala.annotation.tailrec
case class Trie[Elem, Meta](children: mutable.Map[Elem, Trie[Elem, Meta]], var metadata: Option[Meta] = None) {
def this() = this(mutable.Map.empty)
#tailrec
final def insert(item: Seq[Elem], metadata: Meta): Unit = {
item match {
case Nil =>
this.metadata = Some(metadata)
case x :: xs =>
children.getOrElseUpdate(x, new Trie()).insert(xs, metadata)
}
}
def insert(items: (Seq[Elem], Meta)*): Unit = items.foreach { case (item, meta) => insert(item, meta) }
def find(item: Seq[Elem]): Option[Meta] = {
item match {
case Nil => metadata
case x :: xs => children.get(x).flatMap(_.metadata)
}
}
}
object Trie extends App {
type Dictionary = Trie[Char, String]
val dict = new Dictionary()
dict.insert( "hello", "meaning of hello")
dict.insert("hi", "another word for hello")
dict.insert("bye", "opposite of hello")
println(dict)
}
Weird thing is, it compiles fine but gives error on running:
Exception in thread "main" scala.MatchError: hello (of class scala.collection.immutable.WrappedString)
at Trie.insert(Trie.scala:11)
at Trie$.delayedEndpoint$com$inmobi$data$mleap$Trie$1(Trie.scala:34)
at Trie$delayedInit$body.apply(Trie.scala:30)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at Trie$.main(Trie.scala:30)
at Trie.main(Trie.scala)
It's able to implicitly convert String to WrappedString, but that doesn't match the ::. Any workarounds for this?

You can use startsWith as follows:
val s = "ThisIsAString"
s match {
case x if x.startsWith("T") => 1
case _ => 0
}
Or convert your String to List of chars with toList
scala> val s = "ThisIsAString"
s: String = ThisIsAString
scala> s.toList
res10: List[Char] = List(T, h, i, s, I, s, A, S, t, r, i, n, g)
An then use it as any other List
s.toList match {
case h::t => whatever
case _ => anotherThing
}

Your insert method declares item to be a Seq, but your pattern match only matches on List. A string can be implicitly converted to a Seq[Char], but it isn't a List. Use a pattern match on Seq instead of List using +:.
#tailrec
final def insert(item: Seq[Elem], metadata: Meta): Unit = {
item match {
case Seq() =>
this.metadata = Some(metadata)
case x +: xs =>
children.getOrElseUpdate(x, new Trie()).insert(xs, metadata)
}
}
The same applies to your find method.

How to find out the machine in the cluster which stores a given element in RDD and send a message to it?

I want to know if in an RDD, for example, RDD = {"0", "1", "2",... "99999"}, can I find out the machine in the cluster which stores a given element (e.g.: 100)?
And then in shuffle, can I aggregate some data and send it to the certain machine? I know that the partition of RDD is transparent for users but could I use some method like key/value to achieve that?

Generally speaking the answer is no or at least not with RDD API. If you can express your logic using graphs then you can try message based API in GraphX or Giraph. If not then using Akka directly instead of Spark could be a better choice.
Still, there are some workarounds but I wouldn't expect high performance. Lets start with some dummy data:
import org.apache.spark.rdd.RDD
val toPairs = (s: Range) => s.map(_.toChar.toString)
val rdd: RDD[(Int, String)] = sc.parallelize(Seq(
(0, toPairs(97 to 100)), // a-d
(1, toPairs(101 to 107)), // e-k
(2, toPairs(108 to 115)) // l-s
)).flatMap{ case (i, vs) => vs.map(v => (i, v)) }
and partition it using custom partitioner:
import org.apache.spark.Partitioner
class IdentityPartitioner(n: Int) extends Partitioner {
def numPartitions: Int = n
def getPartition(key: Any): Int = key.asInstanceOf[Int]
}
val partitioner = new IdentityPartitioner(4)
val parts = rdd.partitionBy(partitioner)
Now we have RDD with 4 partitions including one empty:
parts.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size))).collect
// Array[(Int, Int)] = Array((0,4), (1,7), (2,8), (3,0))
The simplest thing you can do is to leverage partitioning itself. First a dummy function and a helper:
// Dummy map function
def transform(s: String) =
Map("e" -> "x", "k" -> "y", "l" -> "z").withDefault(identity)(s)
// Map String to partition
def address(curr: Int, s: String) = {
val m = Map("x" -> 3, "y" -> 3, "z" -> 3).withDefault(x => curr)
(m(s), s)
}
and "send" data:
val transformed: RDD[(Int, String)] = parts
// Emit pairs (partition, string)
.map{case (i, s) => address(i, transform(s))}
// Repartition
.partitionBy(partitioner)
transformed
.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size)))
.collect
// Array[(Int, Int)] = Array((0,4), (1,5), (2,7), (3,3))
another approach is to collect "messages":
val tmp = parts.mapValues(s => transform(s))
val messages: Map[Int,Iterable[String]] = tmp
.flatMap{case (i, s) => {
val target = address(i, s)
if (target != (i, s)) Seq(target) else Seq()
}}
.groupByKey
.collectAsMap
create broadcast
val messagesBD = sc.broadcast(messages)
and use it to send messages:
val transformed = tmp
.filter{case (i, s) => address(i, s) == (i, s)}
.mapPartitionsWithIndex((i, iter) => {
val combined = iter ++ messagesBD.value.getOrElse(i, Seq())
combined.map((i, _))
}, true)
transformed
.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size)))
.collect
// Array[(Int, Int)] = Array((0,4), (1,5), (2,7), (3,3))
Note the following line:
val combined = iter ++ messagesBD.value.getOrElse(i, Seq())
messagesBD.value is the entire broadcast data, which is actually a Map[Int,Iterable[String]], but then getOrElse method returns only the data that was mapped to i (if available).

Spark GraphX subgraph method generate null.

I use subgraph to filter the graph vertices.
However when I collect the vertices, some null value are in there.
I can guarantee that the original graph vertices contains no null value.
class A ....
val graph = .... // no null value contained
val selected_id:Set[Int] = SomeAlgorithm collect() toSet
val sub = graph.subgraph(vpred = (id,data) => data match {
case x:A => selected_id contains x.id
case _ => true
})
sub.vertices.map(_._2).collect() filter (_ == null) foreach println //null printed out

Scala split string and sort data

Hi I am new in scala and I achieved following things in scala, my string contain following data
CLASS: Win32_PerfFormattedData_PerfProc_Process$$(null)|CreatingProcessID|Description|ElapsedTime|Frequency_Object|Frequency_PerfTime|Frequency_Sys100NS|HandleCount|IDProcess|IODataBytesPersec|IODataOperationsPersec|IOOtherBytesPersec|IOOtherOperationsPersec|IOReadBytesPersec|IOReadOperationsPersec|IOWriteBytesPersec|IOWriteOperationsPersec|Name|PageFaultsPersec|PageFileBytes|PageFileBytesPeak|PercentPrivilegedTime|PercentProcessorTime|PercentUserTime|PoolNonpagedBytes|PoolPagedBytes|PriorityBase|PrivateBytes|ThreadCount|Timestamp_Object|Timestamp_PerfTime|Timestamp_Sys100NS|VirtualBytes|VirtualBytesPeak|WorkingSet|WorkingSetPeak|WorkingSetPrivate$$(null)|0|(null)|8300717|0|0|0|0|0|0|0|0|0|0|0|0|0|Idle|0|0|0|100|100|0|0|0|0|0|8|0|0|0|0|0|24576|24576|24576$$(null)|0|(null)|8300717|0|0|0|578|4|0|0|0|0|0|0|0|0|System|0|114688|274432|17|0|0|0|0|8|114688|124|0|0|0|3469312|8908800|311296|5693440|61440$$(null)|4|(null)|8300717|0|0|0|42|280|0|0|0|0|0|0|0|0|smss|0|782336|884736|110|0|0|1864|10664|11|782336|3|0|0|0|5701632|19357696|1388544|1417216|700416$$(null)|372|(null)|8300715|0|0|0|1438|380|0|0|0|0|0|0|0|0|csrss|0|3624960|3747840|0|0|0|15008|157544|13|3624960|10|0|0|0|54886400|55345152|5586944|5648384|2838528$$(null)|424|(null)|8300714|0|0|0|71|432|0|0|0|0|0|0|0|0|csrss#1|0|8605696|8728576|0|0|0|8720|96384|13|8605696|9|0|0|0|50515968|50909184|7438336|9342976|4972544
now I want to find data who's value is PercentProcessorTime, ElapsedTime,.. so for this I first split above string $$ and then again split string using | and this new split string I searched string where PercentProcessorTime' presents and get Index of that string when I get string then skipped first two arrays which split from$$and get data ofPercentProcessorTime` using index , it's looks like complicated but I think following code should helps
// First split string as below
val processData = winProcessData.split("\\$\\$")
// get index here
val getIndex: Int = processData.find(part => part.contains("PercentProcessorTime"))
.map {
case getData =>
getData
} match {
case Some(s) => s.split("\\|").indexOf("PercentProcessorTime")
case None => -1
}
val getIndexOfElapsedTime: Int = processData.find(part => part.contains("ElapsedTime"))
.map {
case getData =>
getData
} match {
case Some(s) => s.split("\\|").indexOf("ElapsedTime")
case None => -1
}
// now fetch data of above index as below
for (i <- 2 to (processData.length - 1)) {
val getValues = processData(i).split("\\|")
val getPercentProcessTime = getValues(getIndex).toFloat
val getElapsedTime = getValues(getIndexOfElapsedTime).toFloat
Logger.info("("+getPercentProcessTime+","+getElapsedTime+"),")
}
Now Problem is that using above code I was getting data of given key in index, so my output was (8300717,100),(8300717,17)(8300717,110)... Now I want sort this data using getPercentProcessTime so my output should be (8300717,110),(8300717,100)(8300717,17)...
and that data should be in lists so I will pass list to case class.

Are you find PercentProcessorTime or PercentPrivilegedTime ?
Here it is
val str = "your very long string"
val heads = Seq("PercentPrivilegedTime", "ElapsedTime")
val Array(elap, perc) = str.split("\\$\\$").tail.map(_.split("\\|"))
.transpose.filter(x => heads.contains(x.head))
//elap: Array[String] = Array(ElapsedTime, 8300717, 8300717, 8300717, 8300715, 8300714)
//perc: Array[String] = Array(PercentPrivilegedTime, 100, 17, 110, 0, 0)
val res = (elap.tail, perc.tail).zipped.toList.sortBy(-_._2.toInt)
//List[(String, String)] = List((8300717,110), (8300717,100), (8300717,17), (8300715,0), (8300714,0))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to reuse the result from spark stream? - apache-spark

Related

Update CoordinateMatrix entry

How to use scala strings in list-like pattern matching

How to find out the machine in the cluster which stores a given element in RDD and send a message to it?

Spark GraphX subgraph method generate null.

Scala split string and sort data

Categories

Resources