List[String] -> Vector[Vector[Char]] - string

I am trying to convert a list of strings to a vector of char vectors:
import collection.breakOut
def stringsToCharVectors(xs: List[String]) =
xs.map(stringToCharVector)(breakOut) : Vector[Vector[Char]]
def stringToCharVector(x: String) =
x.map(a => a)(breakOut) : Vector[Char]
Is there a way to implement stringToCharVector that does not involve mapping with the identity function? Generally, are there shorter/better ways to implement stringsToCharVectors?

You can pass a String directly to the varargs constructor for Vector:
def stringToCharVector(x: String) = Vector(x: _*)
at which point having a separate method seems kind of silly. breakOut is for optimization; if you just want to convert, you can
Vector(xs.map(x => Vector(x: _*)): _*)
at the relatively modest expense of one extra object per list element. (All the chars will most likely be the memory-intensive part.)

In Scala 2.10:
scala> val xs = List("hello")
xs: List[String] = List(hello)
scala> xs.map(_.to[Vector]).to[Vector]
res0: Vector[Vector[Char]] = Vector(Vector(h, e, l, l, o))

The other way is just to add all the elements to an empty Vector; this is what happens behind the scenes anyway when you call a conversion method:
def stringsToCharVectors(xs: List[String]) =
Vector() ++ xs.map(Vector() ++ _)

Related

For comprehension parsing of optional string to int

Say I have the following for comprehension:
val validatedInput = for {
stringID <- parseToInt(optionalInputID)
} yield (stringID)
where optionalInputID is an input parameter of type Option[String]. I want to be able to convert an Option[String] into just a String, if of course there is an option present. As far as I'm aware, you cannot case match inside a for comprehension.
Some details have been omitted, such as other for comprehension items. Therefore I would like to know if it's possible to do this inside the for comprehension. If not, then what's a suitable alternative? Can I do it outside of the for comprehension?
Simply add it to the for comprehension:
val validatedInput = for {
inputID <- optionalInputID
stringID <- parseToInt(inputID)
} yield (stringID)
It will work only if parseToInt has type of Option. If it returns something of Try, you can't do it - because you can't mix Try and Option in the same for-comprehension.
If parseToInt returns Try, you can do the following:
val validatedInput = for {
inputID <- optionalInputID
stringID <- parseToInt(inputID).toOption
} yield (stringID)
I want to be able to convert an Option[String] into just a String.
Therefore I would like to know if it's possible to do this inside the for comprehension
In Scala, for-comprehension desugars into a combinitation of map, flatMap, filter, none of which allows to extract the value out of the Option.
If not, then what's a suitable alternative? Can I do it outside of the for comprehension?
To do so you can use one of get (unsafe), or it safer version getOrElse, or fold:
val validatedInput: Option[String] = Some("myString")
scala>validatedInput.get
// res1: String = "myString"
scala>validatedInput.getOrElse("empty")
// res2: String = "myString"
scala>validatedInput.fold("empty")(identity)
// res3: String = "myString"

takeRightWhile() method in scala

I might be missing something but recently I came across a task to get last symbols according to some condition. For example I have a string: "this_is_separated_values_5". Now I want to extract 5 as Int.
Note: number of parts separated by _ is not defined.
If I would have a method takeRightWhile(f: Char => Boolean) on a string it would be trivial: takeRightWhile(ch => ch != '_'). Moreover it would be efficient: a straightforward implementation would actually involve finding the last index of _ and taking a substring while the use of this method would save first step and provide better average time complexity.
UPDATE: Guys, all the variations of str.reverse.takeWhile(_!='_').reverse are quite inefficient as you actually use additional O(n) space. If you want to implement method takeRightWhile efficiently you could iterate starting from the right, accumulating result in string builder of whatever else, and returning the result. I am asking about this kind of method, not implementation which was already described and declined in the question itself.
Question: Does this kind of method exist in scala standard library? If no, is there method combination from the standard library to achieve the same in minimum amount of lines?
Thanks in advance.
Possible solution:
str.reverse.takeWhile(_!='_').reverse
Update
You can go from right to left with following expression using foldRight:
str.toList.foldRight(List.empty[Char]) {
case (item, acc) => item::acc
}
Here you need to check condition and stop adding items after condition met. For this you can pass a flag to accumulated value:
val (_, list) = str.toList.foldRight((false, List.empty[Char])) {
case (item, (false, list)) if item!='_' => (false, item::list)
case (_, (_, list)) => (true, list)
}
val res = list.mkString.toInt
This solution is even more inefficient then solution with double reverse:
Implementation of foldRight uses combination of List reverse and foldLeft
You cannot break foldRight execution, so you need flag to skip all items after condition met
I'd go with this:
val s = "string_with_following_number_42"
s.split("_").reverse.head
// res:String = 42
This is a naive attempt and by no means optimized. What it does is splitting the String into an Array of Strings, reverses it and takes the first element. Note that, because the reversing happens after the splitting, the order of the characters is correct.
I am not exactly sure about the problem you are facing. My understanding is that you want have a string of format xxx_xxx_xx_...._xxx_123 and you want to extract the part at the end as Int.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val yourInt = yourStr.split('_').last.toInt
// But remember that the above is unsafe so you may want to take it as Option
val yourIntOpt = Try(yourStr.split('_').last.toInt).toOption
Or... lets say your requirement is to collect a right-suffix till some boolean condition remains true.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val rightSuffix = yourStr.reverse.takeWhile(c => c != '_').reverse
val yourInt = rightSuffix.toInt
// but above is unsafe so
val yourIntOpt = Try(righSuffix.toInt).toOption
Comment if your requirement is different from this.
You can use StringBuilder and lastIndexWhere.
val str = "this_is_separated_values_5"
val sb = new StringBuilder(str)
val lastIdx = sb.lastIndexWhere(ch => ch != '_')
val lastCh = str.charAt(lastIdx)

Reading and learning Spark API?

I am learning Spark by example, but I don't know the good way to understand API. For instance, the very classic word count example:
val input = sc.textFile("README.md")
val words = input.flatMap(x => x.split(" "))
val result = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
When I read the reduceByKey API, I see:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
The API states: Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
In the programming guide: When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
Ok, through the example I know (x, y) is (V, V), and that should be the value part of the map. I give a function to compute the V and I get RDD[(K, V)]. My questions are: In such example, in reduceByKey(func: (V, V) ⇒ V), why 2 V? The 1st and 2nd V in (V, V) is same or not?
I guess I ask this question and therefore use the question topic due to that I still don't know how to correctly read the API, or I just miss some even basic Spark concept?!
in the code below:
reduceByKey((x, y) => x + y)
you could read for more clarity, something like this:
reduceByKey((sum, addend) => sum + addend)
so, for every key, you iterate that function fore every element with that key.
Basically, (func: (V, V) ⇒ V), means that you have a function with 2 input of a certain type (let's say Int) which returns a single output of the same type.
Usually the data sets will be of the form ("key1",val11),("key2",val21),("key1",val12),("key2",val22)...so on
There will be the same key with multiple values in the RDD[(K,V)]
When you use the reduceByKey . For each values in the key the function will be applied.
For example consider the following program
val data = Array(("key1",2),("key1",20),("key2",21),("key1",2),("key2",10),("key2",33))
val rdd = sc.parallelize(data)
val res = rdd.reduceByKey((x,y) => x+y)
res.foreach(println)
You will get the output as
(key2,64)
(key1,24)
Here the Sequence of values are passed to the function . For key1 -> (2,20,2)
In the end , You will have a single value for each key.
You could use spark shell to try out the APIs.

Suffix array beginning using scala

Today I am trying to create suffix arrays using scala. I was able to do it with massive lines of code but then I heard that it can be created by using only few lines by using zipping and sorting.
The problem I have at the moment is with the beginning. I tried using binary search and zipWithIndex to create the following "tree" but so far I haven't been able to create anything. I don't even know if it is possible by only using a line but I bet it is lol.
What I want to do is to get from a word "cheesecake" is a Seq:
Seq((cheesecake, 0),
(heesecake, 1),
(eesecake, 2),
(esecake, 3),
(secake, 4),
(ecake, 5),
(cake, 6),
(ake, 7),
(ke, 8),
(e, 9))
Could someone nudge me to the correct path ?
To generate all the possible postfixes of a String (or any other scala.collection.TraversableLike) you can simply use the tails method:
scala> "cheesecake".tails.toList
res25: List[String] = List(cheesecake, heesecake, eesecake, esecake, secake, ecake, cake, ake, ke, e, "")
If you need the indexes, then you can use GenIterable.zipWithIndex:
scala> "cheesecake".tails.toList.zipWithIndex
res0: List[(String, Int)] = List((cheesecake,0), (heesecake,1), (eesecake,2), (esecake,3), (secake,4), (ecake,5), (cake,6), (ake,7), (ke,8), (e,9), ("",10))
You're looking for the .scan methods, specifically .scanRight (since you want to start build from the end (ie right-side) of the string, prepending the next character (look at your pyramide bottom to top)).
Quoting the documentation :
Produces a collection containing cumulative results of applying the
operator going right to left.
Here the operator is :
Prepend the current character
Decrement the counter (since your first element is "cheesecake".length, counting down)
So :
scala> s.scanRight (List[(String, Int)]())
{ case (char, (stringAcc, count)::tl) => (char + stringAcc, count-1)::tl
case (c, Nil) => List((c.toString, s.length-1))
}
.dropRight(1)
.map(_.head)
res12: scala.collection.immutable.IndexedSeq[List[(String, Int)]] =
Vector((cheesecake,0),
(heesecake,1),
(eesecake,2),
(esecake,3),
(secake,4),
(ecake,5),
(cake,6),
(ake,7),
(ke,8),
(e,9)
)
The dropRight(0) at the end is to remove the (List[(String, Int)]()) (the first argument), which serves as the first element on which to start building (you could pass the last e of your string and iterate on cheesecak, but I find it easier to do it this way).
One approach,
"cheesecake".reverse.inits.map(_.reverse).zipWithIndex.toArray
Scala strings are equipped with ordered collections methods such as reverse and inits, the latter delivers a collection of strings where each string has dropped the latest character.
EDIT - From a previous suffix question that I posted (from an Purely Functional Data Structures exercise, I believe that suffix should/may include the empty list too, i.e. "" for String.
scala> def suffix(x: String): List[String] = x.toList match {
| case Nil => Nil
| case xxs # (_ :: xs) => xxs.mkString :: suffix(xs.mkString)
| }
suffix: (x: String)List[String]
scala> def f(x: String): List[(String, Int)] = suffix(x).zipWithIndex
f: (x: String)List[(String, Int)]
Test
scala> f("cheesecake")
res10: List[(String, Int)] = List((cheesecake,0), (heesecake,1), (eesecake,2),
(esecake,3), (secake,4), (ecake,5), (cake,6), (ake,7), (ke,8), (e,9))

Better String formatting in Scala

With too many arguments, String.format easily gets too confusing. Is there a more powerful way to format a String. Like so:
"This is #{number} string".format("number" -> 1)
Or is this not possible because of type issues (format would need to take a Map[String, Any], I assume; don’t know if this would make things worse).
Or is the better way doing it like this:
val number = 1
<plain>This is { number } string</plain> text
even though it pollutes the name space?
Edit:
While a simple pimping might do in many cases, I’m also looking for something going in the same direction as Python’s format() (See: http://docs.python.org/release/3.1.2/library/string.html#formatstrings)
In Scala 2.10 you can use string interpolation.
val height = 1.9d
val name = "James"
println(f"$name%s is $height%2.2f meters tall") // James is 1.90 meters tall
Well, if your only problem is making the order of the parameters more flexible, this can be easily done:
scala> "%d %d" format (1, 2)
res0: String = 1 2
scala> "%2$d %1$d" format (1, 2)
res1: String = 2 1
And there's also regex replacement with the help of a map:
scala> val map = Map("number" -> 1)
map: scala.collection.immutable.Map[java.lang.String,Int] = Map((number,1))
scala> val getGroup = (_: scala.util.matching.Regex.Match) group 1
getGroup: (util.matching.Regex.Match) => String = <function1>
scala> val pf = getGroup andThen map.lift andThen (_ map (_.toString))
pf: (util.matching.Regex.Match) => Option[java.lang.String] = <function1>
scala> val pat = "#\\{([^}]*)\\}".r
pat: scala.util.matching.Regex = #\{([^}]*)\}
scala> pat replaceSomeIn ("This is #{number} string", pf)
res43: String = This is 1 string
You can easily implement a richer formatting yourself (with the "enhance my library" approach):
scala> implicit def RichFormatter(string: String) = new {
| def richFormat(replacement: Map[String, Any]) =
| (string /: replacement) {(res, entry) => res.replaceAll("#\\{%s\\}".format(entry._1), entry._2.toString)}
| }
RichFormatter: (string: String)java.lang.Object{def richFormat(replacement: Map[String,Any]): String}
scala> "This is #{number} string" richFormat Map("number" -> 1)
res43: String = This is 1 string
Or on more recent Scala versions since the original answer:
implicit class RichFormatter(string: String) {
def richFormat(replacement: Map[String, Any]): String =
replacement.foldLeft(string) { (res, entry) =>
res.replaceAll("#\\{%s\\}".format(entry._1), entry._2.toString)
}
}
Maybe the Scala-Enhanced-Strings-Plugin can help you. Look here:
Scala-Enhanced-Strings-Plugin Documentation
This the answer I came here looking for:
"This is %s string".format(1)
If you're using 2.10 then go with built-in interpolation. Otherwise, if you don't care about extreme performance and are not afraid of functional one-liners, you can use a fold + several regexp scans:
val template = "Hello #{name}!"
val replacements = Map( "name" -> "Aldo" )
replacements.foldLeft(template)((s:String, x:(String,String)) => ( "#\\{" + x._1 + "\\}" ).r.replaceAllIn( s, x._2 ))
You might also consider the use of a template engine for really complex and long strings. On top of my head I have Scalate which implements amongst others the Mustache template engine.
Might be overkill and performance loss for simple strings, but you seem to be in that area where they start becoming real templates.

Resources