spark scala most efficient way to do partial string count - string

I have a question about the most efficient way to do a partial string match in a spark RDD (or scala Array) of 10 million length. Consider the following:
val set1 = Array("star wars", "ipad") //These are the String I am looking for
val set2 = RDD[("user1", "star wars 7 is coming out"),
("user1", "where to watch star wars"),
("user2", "star wars"),
("user2", "cheap ipad")]
I want to be able to count the number of occurrences of each string that belongs in Set1 that also occurs in Set2. So the result should be something like:
Result = ("star wars", 3),("ipad", 1)
I also want to count the number of users (i.e. distinct users) who have searched for the term, so the result should be:
Result = ("star wars", 2), ("ipad", 1)
I had a try at 2 methods, the first involves converting the RDD string to set, flatMapValues and then doing a join operation, but it is memory consuming. The other method I was considering is a regex approach, as only the count is needed and the exact string is given, but I don't know how to make it efficient (by making a function and calling it when I map the RDD?)
I seem to be able to do this quite easily in pgsql using LIKE, but not sure if there is a RDD join that works the same way.
Any help would be greatly appreciated.

So as advised by Yijie Shen you could use regular expressions:
val regex = set1.mkString("(", "|", ")").r
val results = rdd.flatMap {
case (user, str) => regex.findAllIn(str).map(user -> _)
}
val count = results.map(_._2).countByValue()
val byUser = results.distinct().map(_._2).countByValue()

Related

spark parallelize(List(1,2,3,4),2) always partition the list in order?

I've run the code below and the result is 37.
val z = sc.parallelize(List(1,2,7,4,30,6), 2)
z.aggregate(0)(math.max(_, _), _ + _)
res40: Int = 37
It seems that spark partitions the list into 2 lists:[1,2,7] , [4,30,6].
Then I changed the order of 7 and 4 in the list and I got 34.
scala> val z = sc.parallelize(List(1,2,4,7,30,6), 2)
z: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at parallelize at <console>:24
scala> z.aggregate(0)(math.max(_, _), _ + _)
res11: Int = 34
What I want to know is if spark always keeps the order of the elements in the list when partitioning?
Thanks!
There are two different concepts here.
Order of items which is persevered when using parallelize and applying transformations which don't require shuffling.
Order of items during aggregation which is not preserved and is non deterministic. While each partition is aggregated sequentially order of merging partial result is arbitrary.
In general don't depend on the order of values and operations unless you enforce it explicitly (for example by sorting) or you know exactly what you're doing.

takeRightWhile() method in scala

I might be missing something but recently I came across a task to get last symbols according to some condition. For example I have a string: "this_is_separated_values_5". Now I want to extract 5 as Int.
Note: number of parts separated by _ is not defined.
If I would have a method takeRightWhile(f: Char => Boolean) on a string it would be trivial: takeRightWhile(ch => ch != '_'). Moreover it would be efficient: a straightforward implementation would actually involve finding the last index of _ and taking a substring while the use of this method would save first step and provide better average time complexity.
UPDATE: Guys, all the variations of str.reverse.takeWhile(_!='_').reverse are quite inefficient as you actually use additional O(n) space. If you want to implement method takeRightWhile efficiently you could iterate starting from the right, accumulating result in string builder of whatever else, and returning the result. I am asking about this kind of method, not implementation which was already described and declined in the question itself.
Question: Does this kind of method exist in scala standard library? If no, is there method combination from the standard library to achieve the same in minimum amount of lines?
Thanks in advance.
Possible solution:
str.reverse.takeWhile(_!='_').reverse
Update
You can go from right to left with following expression using foldRight:
str.toList.foldRight(List.empty[Char]) {
case (item, acc) => item::acc
}
Here you need to check condition and stop adding items after condition met. For this you can pass a flag to accumulated value:
val (_, list) = str.toList.foldRight((false, List.empty[Char])) {
case (item, (false, list)) if item!='_' => (false, item::list)
case (_, (_, list)) => (true, list)
}
val res = list.mkString.toInt
This solution is even more inefficient then solution with double reverse:
Implementation of foldRight uses combination of List reverse and foldLeft
You cannot break foldRight execution, so you need flag to skip all items after condition met
I'd go with this:
val s = "string_with_following_number_42"
s.split("_").reverse.head
// res:String = 42
This is a naive attempt and by no means optimized. What it does is splitting the String into an Array of Strings, reverses it and takes the first element. Note that, because the reversing happens after the splitting, the order of the characters is correct.
I am not exactly sure about the problem you are facing. My understanding is that you want have a string of format xxx_xxx_xx_...._xxx_123 and you want to extract the part at the end as Int.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val yourInt = yourStr.split('_').last.toInt
// But remember that the above is unsafe so you may want to take it as Option
val yourIntOpt = Try(yourStr.split('_').last.toInt).toOption
Or... lets say your requirement is to collect a right-suffix till some boolean condition remains true.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val rightSuffix = yourStr.reverse.takeWhile(c => c != '_').reverse
val yourInt = rightSuffix.toInt
// but above is unsafe so
val yourIntOpt = Try(righSuffix.toInt).toOption
Comment if your requirement is different from this.
You can use StringBuilder and lastIndexWhere.
val str = "this_is_separated_values_5"
val sb = new StringBuilder(str)
val lastIdx = sb.lastIndexWhere(ch => ch != '_')
val lastCh = str.charAt(lastIdx)

How do I split an RDD into two or more RDDs?

I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD.
If you're familiar with SAS, something like this:
data work.split1, work.split2;
set work.preSplit;
if (condition1)
output work.split1
else if (condition2)
output work.split2
run;
which resulted in two distinct data sets. It would have to be immediately persisted to get the results I intend...
It is not possible to yield multiple RDDs from a single transformation*. If you want to split a RDD you have to apply a filter for each split condition. For example:
def even(x): return x % 2 == 0
def odd(x): return not even(x)
rdd = sc.parallelize(range(20))
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If you have only a binary condition and computation is expensive you may prefer something like this:
kv_rdd = rdd.map(lambda x: (x, odd(x)))
kv_rdd.cache()
rdd_odd = kv_rdd.filter(lambda kv: kv[1]).keys()
rdd_even = kv_rdd.filter(lambda kv: not kv[1]).keys()
It means only a single predicate computation but requires additional pass over all data.
It is important to note that as long as an input RDD is properly cached and there no additional assumptions regarding data distribution there is no significant difference when it comes to time complexity between repeated filter and for-loop with nested if-else.
With N elements and M conditions number of operations you have to perform is clearly proportional to N times M. In case of for-loop it should be closer to (N + MN) / 2 and repeated filter is exactly NM but at the end of the day it is nothing else than O(NM). You can see my discussion** with Jason Lenderman to read about some pros-and-cons.
At the very high level you should consider two things:
Spark transformations are lazy, until you execute an action your RDD is not materialized
Why does it matter? Going back to my example:
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If later I decide that I need only rdd_odd then there is no reason to materialize rdd_even.
If you take a look at your SAS example to compute work.split2 you need to materialize both input data and work.split1.
RDDs provide a declarative API. When you use filter or map it is completely up to Spark engine how this operation is performed. As long as the functions passed to transformations are side effects free it creates multiple possibilities to optimize a whole pipeline.
At the end of the day this case is not special enough to justify its own transformation.
This map with filter pattern is actually used in a core Spark. See my answer to How does Sparks RDD.randomSplit actually split the RDD and a relevant part of the randomSplit method.
If the only goal is to achieve a split on input it is possible to use partitionBy clause for DataFrameWriter which text output format:
def makePairs(row: T): (String, String) = ???
data
.map(makePairs).toDF("key", "value")
.write.partitionBy($"key").format("text").save(...)
* There are only 3 basic types of transformations in Spark:
RDD[T] => RDD[T]
RDD[T] => RDD[U]
(RDD[T], RDD[U]) => RDD[W]
where T, U, W can be either atomic types or products / tuples (K, V). Any other operation has to be expressed using some combination of the above. You can check the original RDD paper for more details.
** https://chat.stackoverflow.com/rooms/91928/discussion-between-zero323-and-jason-lenderman
*** See also Scala Spark: Split collection into several RDD?
As other posters mentioned above, there is no single, native RDD transform that splits RDDs, but here are some "multiplex" operations that can efficiently emulate a wide variety of "splitting" on RDDs, without reading multiple times:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions
Some methods specific to random splitting:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions
Methods are available from open source silex project:
https://github.com/willb/silex
A blog post explaining how they work:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/
def muxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[U],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => Iterator.single(itr.next()(j)) } }
}
def flatMuxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[TraversableOnce[U]],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => itr.next()(j).toIterator } }
}
As mentioned elsewhere, these methods do involve a trade-off of memory for speed, because they operate by computing entire partition results "eagerly" instead of "lazily." Therefore, it is possible for these methods to run into memory problems on large partitions, where more traditional lazy transforms will not.
One way is to use a custom partitioner to partition the data depending upon your filter condition. This can be achieved by extending Partitioner and implementing something similar to the RangePartitioner.
A map partitions can then be used to construct multiple RDDs from the partitioned RDD without reading all the data.
val filtered = partitioned.mapPartitions { iter => {
new Iterator[Int](){
override def hasNext: Boolean = {
if(rangeOfPartitionsToKeep.contains(TaskContext.get().partitionId)) {
false
} else {
iter.hasNext
}
}
override def next():Int = iter.next()
}
Just be aware that the number of partitions in the filtered RDDs will be the same as the number in the partitioned RDD so a coalesce should be used to reduce this down and remove the empty partitions.
If you split an RDD using the randomSplit API call, you get back an array of RDDs.
If you want 5 RDDs returned, pass in 5 weight values.
e.g.
val sourceRDD = val sourceRDD = sc.parallelize(1 to 100, 4)
val seedValue = 5
val splitRDD = sourceRDD.randomSplit(Array(1.0,1.0,1.0,1.0,1.0), seedValue)
splitRDD(1).collect()
res7: Array[Int] = Array(1, 6, 11, 12, 20, 29, 40, 62, 64, 75, 77, 83, 94, 96, 100)

How to find maximum overlap between two strings in Scala?

Suppose I have two strings: s and t. I need to write a function f to find a max. t prefix, which is also an s suffix. For example:
s = "abcxyz", t = "xyz123", f(s, t) = "xyz"
s = "abcxxx", t = "xx1234", f(s, t) = "xx"
How would you write it in Scala ?
This first solution is easily the most concise, also it's more efficient than a recursive version as it's using a lazily evaluated iteration
s.tails.find(t.startsWith).get
Now there has been some discussion regarding whether tails would end up copying the whole string over and over. In which case you could use toList on s then mkString the result.
s.toList.tails.find(t.startsWith(_: List[Char])).get.mkString
For some reason the type annotation is required to get it to compile. I've not actually trying seeing which one is faster.
UPDATE - OPTIMIZATION
As som-snytt pointed out, t cannot start with any string that is longer than it, and therefore we could make the following optimization:
s.drop(s.length - t.length).tails.find(t.startsWith).get
Efficient, this is not, but it is a neat (IMO) one-liner.
val s = "abcxyz"
val t ="xyz123"
(s.tails.toSet intersect t.inits.toSet).maxBy(_.size)
//res8: String = xyz
(take all the suffixes of s that are also prefixes of t, and pick the longest)
If we only need to find the common overlapping part, then we can recursively take tail of the first string (which should overlap with the beginning of the second string) until the remaining part will not be the one that second string begins with. This also covers the case when the strings have no overlap, because then the empty string will be returned.
scala> def findOverlap(s:String, t:String):String = {
if (s == t.take(s.size)) s else findOverlap (s.tail, t)
}
findOverlap: (s: String, t: String)String
scala> findOverlap("abcxyz", "xyz123")
res3: String = xyz
scala> findOverlap("one","two")
res1: String = ""
UPDATE: It was pointed out that tail might not be implemented in the most efficient way (i.e. it creates a new string when it is called). If that becomes an issue, then using substring(1) instead of tail (or converting both Strings to Lists, where it's tail / head should have O(1) complexity) might give a better performance. And by the same token, we can replace t.take(s.size) with t.substring(0,s.size).

Better String formatting in Scala

With too many arguments, String.format easily gets too confusing. Is there a more powerful way to format a String. Like so:
"This is #{number} string".format("number" -> 1)
Or is this not possible because of type issues (format would need to take a Map[String, Any], I assume; don’t know if this would make things worse).
Or is the better way doing it like this:
val number = 1
<plain>This is { number } string</plain> text
even though it pollutes the name space?
Edit:
While a simple pimping might do in many cases, I’m also looking for something going in the same direction as Python’s format() (See: http://docs.python.org/release/3.1.2/library/string.html#formatstrings)
In Scala 2.10 you can use string interpolation.
val height = 1.9d
val name = "James"
println(f"$name%s is $height%2.2f meters tall") // James is 1.90 meters tall
Well, if your only problem is making the order of the parameters more flexible, this can be easily done:
scala> "%d %d" format (1, 2)
res0: String = 1 2
scala> "%2$d %1$d" format (1, 2)
res1: String = 2 1
And there's also regex replacement with the help of a map:
scala> val map = Map("number" -> 1)
map: scala.collection.immutable.Map[java.lang.String,Int] = Map((number,1))
scala> val getGroup = (_: scala.util.matching.Regex.Match) group 1
getGroup: (util.matching.Regex.Match) => String = <function1>
scala> val pf = getGroup andThen map.lift andThen (_ map (_.toString))
pf: (util.matching.Regex.Match) => Option[java.lang.String] = <function1>
scala> val pat = "#\\{([^}]*)\\}".r
pat: scala.util.matching.Regex = #\{([^}]*)\}
scala> pat replaceSomeIn ("This is #{number} string", pf)
res43: String = This is 1 string
You can easily implement a richer formatting yourself (with the "enhance my library" approach):
scala> implicit def RichFormatter(string: String) = new {
| def richFormat(replacement: Map[String, Any]) =
| (string /: replacement) {(res, entry) => res.replaceAll("#\\{%s\\}".format(entry._1), entry._2.toString)}
| }
RichFormatter: (string: String)java.lang.Object{def richFormat(replacement: Map[String,Any]): String}
scala> "This is #{number} string" richFormat Map("number" -> 1)
res43: String = This is 1 string
Or on more recent Scala versions since the original answer:
implicit class RichFormatter(string: String) {
def richFormat(replacement: Map[String, Any]): String =
replacement.foldLeft(string) { (res, entry) =>
res.replaceAll("#\\{%s\\}".format(entry._1), entry._2.toString)
}
}
Maybe the Scala-Enhanced-Strings-Plugin can help you. Look here:
Scala-Enhanced-Strings-Plugin Documentation
This the answer I came here looking for:
"This is %s string".format(1)
If you're using 2.10 then go with built-in interpolation. Otherwise, if you don't care about extreme performance and are not afraid of functional one-liners, you can use a fold + several regexp scans:
val template = "Hello #{name}!"
val replacements = Map( "name" -> "Aldo" )
replacements.foldLeft(template)((s:String, x:(String,String)) => ( "#\\{" + x._1 + "\\}" ).r.replaceAllIn( s, x._2 ))
You might also consider the use of a template engine for really complex and long strings. On top of my head I have Scalate which implements amongst others the Mustache template engine.
Might be overkill and performance loss for simple strings, but you seem to be in that area where they start becoming real templates.

Resources