P[Node].rep produces P[String] rather than array of nodes - fastparse

I would expect the result for plusses to be some kind of array
case class Plus()
val plus: P[Plus] = P ("+") map {_ => Plus()}
val plusses: P[List[Plus]] = P ( plus.rep.! ) // type mismatch; found: Parser[String] required: Parser[List[Plus]]
but compiler says
type mismatch; found : Parser[String] required: Parser[List[Plus]]

Answer
First, you don't need to capture something with .! as you already have a result in the plus parser. The .rep then "creates" a parser that repeats the plus parser 0 to n times and concatenates the result into a Seq[Plus].
case class Plus()
val plus: P[Plus] = P ("+") map {_ ⇒ Plus()}
val plusses: P[Seq[Plus]] = P ( plus.rep )
Second, if using the repeat operation rep, Parsers return Seq and not List If you really need a list, use map to convert Seq to List.
val plussesAsList: P[List[Plus]] = P( plus.rep ).map( seq ⇒ seq.toList)

.! is "to capture the section of the input string the parser parsed" (see http://lihaoyi.github.io/fastparse/#Capture). So the return type of .! is String.
I think what you want is:
val plusses: P[List[Plus]] = P ( plus.rep ) map (_.toList)
Here's what you will get:
# plusses.parse("+++")
res6: core.Result[List[Plus]] = Success(List(Plus(), Plus(), Plus()),3)

Related

Scala count chars in a string logical error

here is the code:
val a = "abcabca"
a.groupBy((c: Char) => a.count( (d:Char) => d == c))
here is the result I want:
scala.collection.immutable.Map[Int,String] = Map(2 -> b, 2 -> c, 3 -> a)
but the result I get is
scala.collection.immutable.Map[Int,String] = Map(2 -> bcbc, 3 -> aaa)
why?
thank you.
Write an expression like
"abcabca".groupBy(identity).collect{
case (k,v) => (k,v.length)
}
which will give output as
res0: scala.collection.immutable.Map[Char,Int] = Map(b -> 2, a -> 3, c -> 2)
Let's dissect your initial attempt :
a.groupBy((c: Char) => a.count( (d:Char) => d == c))
So, you're grouping by something which is what ? the result of a.count(...), so the key of your Map will be an Int. For the char a, we will get 3, for the chars b and c, we'll get 2.
Now, the original String will be traversed and for the results accumulated, char by char.
So after traversing the first "ab", the current state is "2-> b, 3->c". (Note that for each char in the string, the .count() is called, which is a n² wasteful algorithm, but anyway).
The string is progressively traversed, and at the end the accumulated results is shown. As it turns out, the 3 "a" have been sent under the "3" key, and the b and c have been sent to the key "2", in the order the string was traversed, which is the left to right order.
Now, a usual groupBy on a list returns something like Map[T, List[T]], so you may have expected a List[Char] somewhere. It doesn't happen (because the Repr for String is String), and your list of chars is effectively recombobulated into a String, and is given to you as such.
Hence your final result !
Your question header reads as "Scala count chars in a string logical error". But you are using Map and you wanted counts as keys. Equal keys are not allowed in Map objects. Hence equal keys get eliminated in the resulting Map, keeping just one, because no duplicate keys are allowed. What you want may be a Seq of tuples like (count, char) like List[Int,Char]. Try this.
val x = "abcabca"
x.groupBy(identity).mapValues(_.size).toList.map{case (x,y)=>(y,x)}
In Scal REPL:
scala> x.groupBy(identity).mapValues(_.size).toList.map{case (x,y)=>(y,x)}
res13: List[(Int, Char)] = List((2,b), (3,a), (2,c))
The above gives a list of counts and respective chars as a list of tuples.So this is what you may really wanted.
If you try converting this to a Map:
scala> x.groupBy(identity).mapValues(_.size).toList.map{case (x,y)=>(y,x)}.toMap
res14: scala.collection.immutable.Map[Int,Char] = Map(2 -> c, 3 -> a)
So this is not what you want obviously.
Even more concisely use:
x.distinct.map(v=>(x.filter(_==v).size,v))
scala> x.distinct.map(v=>(x.filter(_==v).size,v))
res19: scala.collection.immutable.IndexedSeq[(Int, Char)] = Vector((3,a), (2,b), (2,c))
The problem with your approach is you are mapping count to characters. Which is:
In case of
val str = abcabca
While traversing the string str a has count 3, b has count 2 and c has count 2 while creating the map (with the use of groupBy) it will put all the characters in the value which has the same key that is.
Map(3->aaa, 2->bc)
That’s the reason you are getting such output for your program.
As you can see in the definition of the groupBy function:
def
groupBy[K](f: (A) ⇒ K): immutable.Map[K, Repr]
Partitions this traversable collection into a map of traversable collections according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will always force the view and return a new traversable collection.
K
the type of keys returned by the discriminator function.
f
the discriminator function.
returns
A map from keys to traversable collections such that the following invariant holds:
(xs groupBy f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a traversable collection of those elements x for which f(x) equals k.
GroupBy returns a Map which holds the following invariant.
(xs groupBy f)(k) = xs filter (x => f(x) == k)
Which means it return collection of elements for which the key is same.

Document Count of a Word in Spark/Scala

I have a text variable which is an RDD of String in scala
val data = sc.parallelize(List("i am a good boy.Are you a good boy.","You are also working here.","I am posting here today.You are good."))
I have another variable in Scala Map(given below)
//list of words for which doc count needs to be found,initial doc count is 1
val dictionary = Map( """good""" -> 1,"""working""" -> 1,"""posting""" -> 1 ).
I want to do a document count of each of the dictionary terms and get the output in key value format
My output should be like below for the above data.
(good,2)
(working,1)
(posting,1)
What i have tried is
dictionary.map { case(k,v) => k -> k.r.findFirstIn(data.map(line => line.trim()).collect().mkString(",")).size}
I am getting counts as 1 for all the words.
Please help me in fixing the above line
Thanks in advance.
Why not use flatMap to create the dictionary and then you can query that.
val dictionary = data.flatMap {case line => line.split(" ")}.map {case word => (word, 1)}.reduceByKey(_+_)
If I collect this in the REPL I get the following result:
res9: Array[(String, Int)] = Array((here,1), (good.,1), (good,2), (here.,1), (You,1), (working,1), (today.You,1), (boy.Are,1), (are,2), (a,2), (posting,1), (i,1), (boy.,1), (also,1), (I,1), (am,2), (you,1))
Obviously you would need to do a better split than in my simple example.
First of all your dictionary should be a Set, because in general sense you need to map the Set of terms to the number of documents which contain them.
So your data should look like:
scala> val docs = List("i am a good boy.Are you a good boy.","You are also working here.","I am posting here today.You are good.")
docs: List[String] = List(i am a good boy.Are you a good boy., You are also working here., I am posting here today.You are good.)
Your dictionary should look like:
scala> val dictionary = Set("good", "working", "posting")
dictionary: scala.collection.immutable.Set[String] = Set(good, working, posting)
Then you have to implement your transformation, for the simplest logic of the contains function it might look like:
scala> dictionary.map(k => k -> docs.count(_.contains(k))) toMap
res4: scala.collection.immutable.Map[String,Int] = Map(good -> 2, working -> 1, posting -> 1)
For better solution I'd recommend you to implement specific function for your requirements
(String, String) => Boolean
to determine the presence of the term in the document:
scala> def foo(doc: String, term: String): Boolean = doc.contains(term)
foo: (doc: String, term: String)Boolean
Then final solution will look like:
scala> dictionary.map(k => k -> docs.count(d => foo(d, k))) toMap
res3: scala.collection.immutable.Map[String,Int] = Map(good -> 2, working -> 1, posting -> 1)
The last thing you have to do is to calculate the result map using SparkContext. First of all you have to define what data you want to have parallelised. Let's assume we want to parallelize the collection of the documents, then solution might be like following:
val docsRDD = sc.parallelize(List(
"i am a good boy.Are you a good boy.",
"You are also working here.",
"I am posting here today.You are good."
))
docsRDD.mapPartitions(_.map(doc => dictionary.collect {
case term if doc.contains(term) => term -> 1
})).map(_.toMap) reduce { case (m1, m2) => merge(m1, m2) }
def merge(m1: Map[String, Int], m2: Map[String, Int]) =
m1 ++ m2 map { case (k, v) => k -> (v + m1.getOrElse(k, 0)) }

Scala - modify strings in a list based on their number of occurences

Another Scala newbie question since I am not getting how to achieve this in a functional way (mostly coming from a scripting language background):
I have a list of strings:
val food-list = List("banana-name", "orange-name", "orange-num", "orange-name", "orange-num", "grape-name")
and where they are duplicated, I'd like to add an incrementing number into the string and get that in a list similar to the input list, like so:
List("banana-name", "orange1-name", "orange1-num", "orange2-name", "orange2-num", "grape-name")
I've grouped them up to get counts for them with:
val freqs = list.groupBy(identity).mapValues(v => List.range(1, v.length + 1))
Which gives me:
Map(orange-num -> List(1, 2), banana-name -> List(1), grape-name -> List(1), orange-name -> List(1, 2))
The order of the list is important (it should be in the original order of food-list) so I know it's problematic for me to use a Map at this point. The closest I feel I have gotten to a solution is:
food-list.map{l =>
if (freqs(l).length > 1){
freqs(l).map(n =>
l.split("-")(0) + n.toString + "-" + l.split("-")(1))
} else {
l
}
}
This of course gives me a wonky output since I am mapping the list of frequencies from the words value in freqs
List(banana-name, List(orange1-name, orange2-name), List(orange1-num, orange2-num), List(orange1-name, orange2-name), List(orange1-num, orange2-num), grape-name)
How is this done in a Scala fp way without resorting to clumsy for loops and counters?
If the indices are important, sometimes it's best to keep track of them explicitly using zipWithIndex (very similar to Python's enumerate):
food-list.zipWithIndex.groupBy(_._1).values.toList.flatMap{
//if only one entry in this group, don't change the values
//x is actually a tuple, could write case (str, idx) :: Nil => (str, idx) :: Nil
case x :: Nil => x :: Nil
//case where there are duplicate strings
case xs => xs.zipWithIndex.map {
//idx is index in the original list, n is index in the new list i.e. count
case ((str, idx), n) =>
//destructuring assignment, like python's (fruit, suffix) = ...
val Array(fruit, suffix) = str.split("-")
//string interpolation, returning a tuple
(s"$fruit${n+1}-$suffix", idx)
}
//We now have our list of (string, index) pairs;
//sort them and map to a list of just strings
}.sortBy(_._2).map(_._1)
Efficient and simple:
val food = List("banana-name", "orange-name", "orange-num",
"orange-name", "orange-num", "grape-name")
def replaceName(s: String, n: Int) = {
val tokens = s.split("-")
tokens(0) + n + "-" + tokens(1)
}
val indicesMap = scala.collection.mutable.HashMap.empty[String, Int]
val res = food.map { name =>
{
val n = indicesMap.getOrElse(name, 1)
indicesMap += (name -> (n + 1))
replaceName(name, n)
}
}
Here is an attempt to provide what you expected with foldLeft:
foodList.foldLeft((List[String](), Map[String, Int]()))//initial value
((a/*accumulator, list, map*/, v/*value from the list*/)=>
if (a._2.isDefinedAt(v))//already seen
(s"$v+${a._2(v)}" :: a._1, a._2.updated(v, a._2(v) + 1))
else
(v::a._1, a._2.updated(v, 1)))
._1/*select the list*/.reverse/*because we created in the opposite order*/

Scala split string to tuple

I would like to split a string on whitespace that has 4 elements:
1 1 4.57 0.83
and I am trying to convert into List[(String,String,Point)] such that first two splits are first two elements in the list and the last two is Point. I am doing the following but it doesn't seem to work:
Source.fromFile(filename).getLines.map(string => {
val split = string.split(" ")
(split(0), split(1), split(2))
}).map{t => List(t._1, t._2, t._3)}.toIterator
How about this:
scala> case class Point(x: Double, y: Double)
defined class Point
scala> s43.split("\\s+") match { case Array(i, j, x, y) => (i.toInt, j.toInt, Point(x.toDouble, y.toDouble)) }
res00: (Int, Int, Point) = (1,1,Point(4.57,0.83))
You could use pattern matching to extract what you need from the array:
case class Point(pts: Seq[Double])
val lines = List("1 1 4.34 2.34")
val coords = lines.collect(_.split("\\s+") match {
case Array(s1, s2, points # _*) => (s1, s2, Point(points.map(_.toDouble)))
})
You are not converting the third and fourth tokens into a Point, nor are you converting the lines into a List. Also, you are not rendering each element as a Tuple3, but as a List.
The following should be more in line with what you are looking for.
case class Point(x: Double, y: Double) // Simple point class
Source.fromFile(filename).getLines.map(line => {
val tokens = line.split("""\s+""") // Use a regex to avoid empty tokens
(tokens(0), tokens(1), Point(tokens(2).toDouble, tokens(3).toDouble))
}).toList // Convert from an Iterator to List
case class Point(pts: Seq[Double])
val lines = "1 1 4.34 2.34"
val splitLines = lines.split("\\s+") match {
case Array(s1, s2, points # _*) => (s1, s2, Point(points.map(_.toDouble)))
}
And for the curious, the # in pattern matching binds a variable to the pattern, so points # _* is binding the variable points to the pattern *_ And *_ matches the rest of the array, so points ends up being a Seq[String].
There are ways to convert a Tuple to List or Seq, One way is
scala> (1,2,3).productIterator.toList
res12: List[Any] = List(1, 2, 3)
But as you can see that the return type is Any and NOT an INTEGER
For converting into different types you use Hlist of
https://github.com/milessabin/shapeless

HowTo get a Map from a csv string

I'm fairly new to Scala, but I'm doing my exercises now.
I have a string like "A>Augsburg;B>Berlin". What I want at the end is a map
val mymap = Map("A"->"Augsburg", "B"->"Berlin")
What I did is:
val st = locations.split(";").map(dynamicListExtract _)
with the function
private def dynamicListExtract(input: String) = {
if (input contains ">") {
val split = input split ">"
Some(split(0), split(1)) // return key , value
} else {
None
}
}
Now I have an Array[Option[(String, String)
How do I elegantly convert this into a Map[String, String]
Can anybody help?
Thanks
Just change your map call to flatMap:
scala> sPairs.split(";").flatMap(dynamicListExtract _)
res1: Array[(java.lang.String, java.lang.String)] = Array((A,Augsburg), (B,Berlin))
scala> Map(sPairs.split(";").flatMap(dynamicListExtract _): _*)
res2: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map((A,Augsburg), (B,Berlin))
For comparison:
scala> Map("A" -> "Augsburg", "B" -> "Berlin")
res3: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map((A,Augsburg), (B,Berlin))
In 2.8, you can do this:
val locations = "A>Augsburg;B>Berlin"
val result = locations.split(";").map(_ split ">") collect { case Array(k, v) => (k, v) } toMap
collect is like map but also filters values that aren't defined in the partial function. toMap will create a Map from a Traversable as long as it's a Traversable[(K, V)].
It's also worth seeing Randall's solution in for-comprehension form, which might be clearer, or at least give you a better idea of what flatMap is doing.
Map.empty ++ (for(possiblePair<-sPairs.split(";"); pair<-dynamicListExtract(possiblePair)) yield pair)
A simple solution (not handling error cases):
val str = "A>Aus;B>Ber"
var map = Map[String,String]()
str.split(";").map(_.split(">")).foreach(a=>map += a(0) -> a(1))
but Ben Lings' is better.
val str= "A>Augsburg;B>Berlin"
Map(str.split(";").map(_ split ">").map(s => (s(0),s(1))):_*)
--or--
str.split(";").map(_ split ">").foldLeft(Map[String,String]())((m,s) => m + (s(0) -> s(1)))

Resources