Suffix array beginning using scala - string

Today I am trying to create suffix arrays using scala. I was able to do it with massive lines of code but then I heard that it can be created by using only few lines by using zipping and sorting.
The problem I have at the moment is with the beginning. I tried using binary search and zipWithIndex to create the following "tree" but so far I haven't been able to create anything. I don't even know if it is possible by only using a line but I bet it is lol.
What I want to do is to get from a word "cheesecake" is a Seq:
Seq((cheesecake, 0),
(heesecake, 1),
(eesecake, 2),
(esecake, 3),
(secake, 4),
(ecake, 5),
(cake, 6),
(ake, 7),
(ke, 8),
(e, 9))
Could someone nudge me to the correct path ?

To generate all the possible postfixes of a String (or any other scala.collection.TraversableLike) you can simply use the tails method:
scala> "cheesecake".tails.toList
res25: List[String] = List(cheesecake, heesecake, eesecake, esecake, secake, ecake, cake, ake, ke, e, "")
If you need the indexes, then you can use GenIterable.zipWithIndex:
scala> "cheesecake".tails.toList.zipWithIndex
res0: List[(String, Int)] = List((cheesecake,0), (heesecake,1), (eesecake,2), (esecake,3), (secake,4), (ecake,5), (cake,6), (ake,7), (ke,8), (e,9), ("",10))

You're looking for the .scan methods, specifically .scanRight (since you want to start build from the end (ie right-side) of the string, prepending the next character (look at your pyramide bottom to top)).
Quoting the documentation :
Produces a collection containing cumulative results of applying the
operator going right to left.
Here the operator is :
Prepend the current character
Decrement the counter (since your first element is "cheesecake".length, counting down)
So :
scala> s.scanRight (List[(String, Int)]())
{ case (char, (stringAcc, count)::tl) => (char + stringAcc, count-1)::tl
case (c, Nil) => List((c.toString, s.length-1))
}
.dropRight(1)
.map(_.head)
res12: scala.collection.immutable.IndexedSeq[List[(String, Int)]] =
Vector((cheesecake,0),
(heesecake,1),
(eesecake,2),
(esecake,3),
(secake,4),
(ecake,5),
(cake,6),
(ake,7),
(ke,8),
(e,9)
)
The dropRight(0) at the end is to remove the (List[(String, Int)]()) (the first argument), which serves as the first element on which to start building (you could pass the last e of your string and iterate on cheesecak, but I find it easier to do it this way).

One approach,
"cheesecake".reverse.inits.map(_.reverse).zipWithIndex.toArray
Scala strings are equipped with ordered collections methods such as reverse and inits, the latter delivers a collection of strings where each string has dropped the latest character.

EDIT - From a previous suffix question that I posted (from an Purely Functional Data Structures exercise, I believe that suffix should/may include the empty list too, i.e. "" for String.
scala> def suffix(x: String): List[String] = x.toList match {
| case Nil => Nil
| case xxs # (_ :: xs) => xxs.mkString :: suffix(xs.mkString)
| }
suffix: (x: String)List[String]
scala> def f(x: String): List[(String, Int)] = suffix(x).zipWithIndex
f: (x: String)List[(String, Int)]
Test
scala> f("cheesecake")
res10: List[(String, Int)] = List((cheesecake,0), (heesecake,1), (eesecake,2),
(esecake,3), (secake,4), (ecake,5), (cake,6), (ake,7), (ke,8), (e,9))

Related

takeRightWhile() method in scala

I might be missing something but recently I came across a task to get last symbols according to some condition. For example I have a string: "this_is_separated_values_5". Now I want to extract 5 as Int.
Note: number of parts separated by _ is not defined.
If I would have a method takeRightWhile(f: Char => Boolean) on a string it would be trivial: takeRightWhile(ch => ch != '_'). Moreover it would be efficient: a straightforward implementation would actually involve finding the last index of _ and taking a substring while the use of this method would save first step and provide better average time complexity.
UPDATE: Guys, all the variations of str.reverse.takeWhile(_!='_').reverse are quite inefficient as you actually use additional O(n) space. If you want to implement method takeRightWhile efficiently you could iterate starting from the right, accumulating result in string builder of whatever else, and returning the result. I am asking about this kind of method, not implementation which was already described and declined in the question itself.
Question: Does this kind of method exist in scala standard library? If no, is there method combination from the standard library to achieve the same in minimum amount of lines?
Thanks in advance.
Possible solution:
str.reverse.takeWhile(_!='_').reverse
Update
You can go from right to left with following expression using foldRight:
str.toList.foldRight(List.empty[Char]) {
case (item, acc) => item::acc
}
Here you need to check condition and stop adding items after condition met. For this you can pass a flag to accumulated value:
val (_, list) = str.toList.foldRight((false, List.empty[Char])) {
case (item, (false, list)) if item!='_' => (false, item::list)
case (_, (_, list)) => (true, list)
}
val res = list.mkString.toInt
This solution is even more inefficient then solution with double reverse:
Implementation of foldRight uses combination of List reverse and foldLeft
You cannot break foldRight execution, so you need flag to skip all items after condition met
I'd go with this:
val s = "string_with_following_number_42"
s.split("_").reverse.head
// res:String = 42
This is a naive attempt and by no means optimized. What it does is splitting the String into an Array of Strings, reverses it and takes the first element. Note that, because the reversing happens after the splitting, the order of the characters is correct.
I am not exactly sure about the problem you are facing. My understanding is that you want have a string of format xxx_xxx_xx_...._xxx_123 and you want to extract the part at the end as Int.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val yourInt = yourStr.split('_').last.toInt
// But remember that the above is unsafe so you may want to take it as Option
val yourIntOpt = Try(yourStr.split('_').last.toInt).toOption
Or... lets say your requirement is to collect a right-suffix till some boolean condition remains true.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val rightSuffix = yourStr.reverse.takeWhile(c => c != '_').reverse
val yourInt = rightSuffix.toInt
// but above is unsafe so
val yourIntOpt = Try(righSuffix.toInt).toOption
Comment if your requirement is different from this.
You can use StringBuilder and lastIndexWhere.
val str = "this_is_separated_values_5"
val sb = new StringBuilder(str)
val lastIdx = sb.lastIndexWhere(ch => ch != '_')
val lastCh = str.charAt(lastIdx)

spark scala most efficient way to do partial string count

I have a question about the most efficient way to do a partial string match in a spark RDD (or scala Array) of 10 million length. Consider the following:
val set1 = Array("star wars", "ipad") //These are the String I am looking for
val set2 = RDD[("user1", "star wars 7 is coming out"),
("user1", "where to watch star wars"),
("user2", "star wars"),
("user2", "cheap ipad")]
I want to be able to count the number of occurrences of each string that belongs in Set1 that also occurs in Set2. So the result should be something like:
Result = ("star wars", 3),("ipad", 1)
I also want to count the number of users (i.e. distinct users) who have searched for the term, so the result should be:
Result = ("star wars", 2), ("ipad", 1)
I had a try at 2 methods, the first involves converting the RDD string to set, flatMapValues and then doing a join operation, but it is memory consuming. The other method I was considering is a regex approach, as only the count is needed and the exact string is given, but I don't know how to make it efficient (by making a function and calling it when I map the RDD?)
I seem to be able to do this quite easily in pgsql using LIKE, but not sure if there is a RDD join that works the same way.
Any help would be greatly appreciated.
So as advised by Yijie Shen you could use regular expressions:
val regex = set1.mkString("(", "|", ")").r
val results = rdd.flatMap {
case (user, str) => regex.findAllIn(str).map(user -> _)
}
val count = results.map(_._2).countByValue()
val byUser = results.distinct().map(_._2).countByValue()

How to find maximum overlap between two strings in Scala?

Suppose I have two strings: s and t. I need to write a function f to find a max. t prefix, which is also an s suffix. For example:
s = "abcxyz", t = "xyz123", f(s, t) = "xyz"
s = "abcxxx", t = "xx1234", f(s, t) = "xx"
How would you write it in Scala ?
This first solution is easily the most concise, also it's more efficient than a recursive version as it's using a lazily evaluated iteration
s.tails.find(t.startsWith).get
Now there has been some discussion regarding whether tails would end up copying the whole string over and over. In which case you could use toList on s then mkString the result.
s.toList.tails.find(t.startsWith(_: List[Char])).get.mkString
For some reason the type annotation is required to get it to compile. I've not actually trying seeing which one is faster.
UPDATE - OPTIMIZATION
As som-snytt pointed out, t cannot start with any string that is longer than it, and therefore we could make the following optimization:
s.drop(s.length - t.length).tails.find(t.startsWith).get
Efficient, this is not, but it is a neat (IMO) one-liner.
val s = "abcxyz"
val t ="xyz123"
(s.tails.toSet intersect t.inits.toSet).maxBy(_.size)
//res8: String = xyz
(take all the suffixes of s that are also prefixes of t, and pick the longest)
If we only need to find the common overlapping part, then we can recursively take tail of the first string (which should overlap with the beginning of the second string) until the remaining part will not be the one that second string begins with. This also covers the case when the strings have no overlap, because then the empty string will be returned.
scala> def findOverlap(s:String, t:String):String = {
if (s == t.take(s.size)) s else findOverlap (s.tail, t)
}
findOverlap: (s: String, t: String)String
scala> findOverlap("abcxyz", "xyz123")
res3: String = xyz
scala> findOverlap("one","two")
res1: String = ""
UPDATE: It was pointed out that tail might not be implemented in the most efficient way (i.e. it creates a new string when it is called). If that becomes an issue, then using substring(1) instead of tail (or converting both Strings to Lists, where it's tail / head should have O(1) complexity) might give a better performance. And by the same token, we can replace t.take(s.size) with t.substring(0,s.size).

List[String] -> Vector[Vector[Char]]

I am trying to convert a list of strings to a vector of char vectors:
import collection.breakOut
def stringsToCharVectors(xs: List[String]) =
xs.map(stringToCharVector)(breakOut) : Vector[Vector[Char]]
def stringToCharVector(x: String) =
x.map(a => a)(breakOut) : Vector[Char]
Is there a way to implement stringToCharVector that does not involve mapping with the identity function? Generally, are there shorter/better ways to implement stringsToCharVectors?
You can pass a String directly to the varargs constructor for Vector:
def stringToCharVector(x: String) = Vector(x: _*)
at which point having a separate method seems kind of silly. breakOut is for optimization; if you just want to convert, you can
Vector(xs.map(x => Vector(x: _*)): _*)
at the relatively modest expense of one extra object per list element. (All the chars will most likely be the memory-intensive part.)
In Scala 2.10:
scala> val xs = List("hello")
xs: List[String] = List(hello)
scala> xs.map(_.to[Vector]).to[Vector]
res0: Vector[Vector[Char]] = Vector(Vector(h, e, l, l, o))
The other way is just to add all the elements to an empty Vector; this is what happens behind the scenes anyway when you call a conversion method:
def stringsToCharVectors(xs: List[String]) =
Vector() ++ xs.map(Vector() ++ _)

Better String formatting in Scala

With too many arguments, String.format easily gets too confusing. Is there a more powerful way to format a String. Like so:
"This is #{number} string".format("number" -> 1)
Or is this not possible because of type issues (format would need to take a Map[String, Any], I assume; don’t know if this would make things worse).
Or is the better way doing it like this:
val number = 1
<plain>This is { number } string</plain> text
even though it pollutes the name space?
Edit:
While a simple pimping might do in many cases, I’m also looking for something going in the same direction as Python’s format() (See: http://docs.python.org/release/3.1.2/library/string.html#formatstrings)
In Scala 2.10 you can use string interpolation.
val height = 1.9d
val name = "James"
println(f"$name%s is $height%2.2f meters tall") // James is 1.90 meters tall
Well, if your only problem is making the order of the parameters more flexible, this can be easily done:
scala> "%d %d" format (1, 2)
res0: String = 1 2
scala> "%2$d %1$d" format (1, 2)
res1: String = 2 1
And there's also regex replacement with the help of a map:
scala> val map = Map("number" -> 1)
map: scala.collection.immutable.Map[java.lang.String,Int] = Map((number,1))
scala> val getGroup = (_: scala.util.matching.Regex.Match) group 1
getGroup: (util.matching.Regex.Match) => String = <function1>
scala> val pf = getGroup andThen map.lift andThen (_ map (_.toString))
pf: (util.matching.Regex.Match) => Option[java.lang.String] = <function1>
scala> val pat = "#\\{([^}]*)\\}".r
pat: scala.util.matching.Regex = #\{([^}]*)\}
scala> pat replaceSomeIn ("This is #{number} string", pf)
res43: String = This is 1 string
You can easily implement a richer formatting yourself (with the "enhance my library" approach):
scala> implicit def RichFormatter(string: String) = new {
| def richFormat(replacement: Map[String, Any]) =
| (string /: replacement) {(res, entry) => res.replaceAll("#\\{%s\\}".format(entry._1), entry._2.toString)}
| }
RichFormatter: (string: String)java.lang.Object{def richFormat(replacement: Map[String,Any]): String}
scala> "This is #{number} string" richFormat Map("number" -> 1)
res43: String = This is 1 string
Or on more recent Scala versions since the original answer:
implicit class RichFormatter(string: String) {
def richFormat(replacement: Map[String, Any]): String =
replacement.foldLeft(string) { (res, entry) =>
res.replaceAll("#\\{%s\\}".format(entry._1), entry._2.toString)
}
}
Maybe the Scala-Enhanced-Strings-Plugin can help you. Look here:
Scala-Enhanced-Strings-Plugin Documentation
This the answer I came here looking for:
"This is %s string".format(1)
If you're using 2.10 then go with built-in interpolation. Otherwise, if you don't care about extreme performance and are not afraid of functional one-liners, you can use a fold + several regexp scans:
val template = "Hello #{name}!"
val replacements = Map( "name" -> "Aldo" )
replacements.foldLeft(template)((s:String, x:(String,String)) => ( "#\\{" + x._1 + "\\}" ).r.replaceAllIn( s, x._2 ))
You might also consider the use of a template engine for really complex and long strings. On top of my head I have Scalate which implements amongst others the Mustache template engine.
Might be overkill and performance loss for simple strings, but you seem to be in that area where they start becoming real templates.

Resources