Scala - modify strings in a list based on their number of occurences - string

Another Scala newbie question since I am not getting how to achieve this in a functional way (mostly coming from a scripting language background):
I have a list of strings:
val food-list = List("banana-name", "orange-name", "orange-num", "orange-name", "orange-num", "grape-name")
and where they are duplicated, I'd like to add an incrementing number into the string and get that in a list similar to the input list, like so:
List("banana-name", "orange1-name", "orange1-num", "orange2-name", "orange2-num", "grape-name")
I've grouped them up to get counts for them with:
val freqs = list.groupBy(identity).mapValues(v => List.range(1, v.length + 1))
Which gives me:
Map(orange-num -> List(1, 2), banana-name -> List(1), grape-name -> List(1), orange-name -> List(1, 2))
The order of the list is important (it should be in the original order of food-list) so I know it's problematic for me to use a Map at this point. The closest I feel I have gotten to a solution is:
food-list.map{l =>
if (freqs(l).length > 1){
freqs(l).map(n =>
l.split("-")(0) + n.toString + "-" + l.split("-")(1))
} else {
l
}
}
This of course gives me a wonky output since I am mapping the list of frequencies from the words value in freqs
List(banana-name, List(orange1-name, orange2-name), List(orange1-num, orange2-num), List(orange1-name, orange2-name), List(orange1-num, orange2-num), grape-name)
How is this done in a Scala fp way without resorting to clumsy for loops and counters?

If the indices are important, sometimes it's best to keep track of them explicitly using zipWithIndex (very similar to Python's enumerate):
food-list.zipWithIndex.groupBy(_._1).values.toList.flatMap{
//if only one entry in this group, don't change the values
//x is actually a tuple, could write case (str, idx) :: Nil => (str, idx) :: Nil
case x :: Nil => x :: Nil
//case where there are duplicate strings
case xs => xs.zipWithIndex.map {
//idx is index in the original list, n is index in the new list i.e. count
case ((str, idx), n) =>
//destructuring assignment, like python's (fruit, suffix) = ...
val Array(fruit, suffix) = str.split("-")
//string interpolation, returning a tuple
(s"$fruit${n+1}-$suffix", idx)
}
//We now have our list of (string, index) pairs;
//sort them and map to a list of just strings
}.sortBy(_._2).map(_._1)

Efficient and simple:
val food = List("banana-name", "orange-name", "orange-num",
"orange-name", "orange-num", "grape-name")
def replaceName(s: String, n: Int) = {
val tokens = s.split("-")
tokens(0) + n + "-" + tokens(1)
}
val indicesMap = scala.collection.mutable.HashMap.empty[String, Int]
val res = food.map { name =>
{
val n = indicesMap.getOrElse(name, 1)
indicesMap += (name -> (n + 1))
replaceName(name, n)
}
}

Here is an attempt to provide what you expected with foldLeft:
foodList.foldLeft((List[String](), Map[String, Int]()))//initial value
((a/*accumulator, list, map*/, v/*value from the list*/)=>
if (a._2.isDefinedAt(v))//already seen
(s"$v+${a._2(v)}" :: a._1, a._2.updated(v, a._2(v) + 1))
else
(v::a._1, a._2.updated(v, 1)))
._1/*select the list*/.reverse/*because we created in the opposite order*/

Related

Merge two strings in kotlin

I have two strings
val a = "abc"
val b = "xyz"
I want to merge it and need output like below
axbycz
I added both strings to arraylist and then flatmap it
val c = listOf(a, b)
val d = c.flatMap {
it.toList()
}
but not getting the desired result
Use the zip function. It creates a list of pairs with "adjacent" letters. You can then use joinToString with a transformer to create your final result.
a.zip(b) // Returns the list [(a, x), (b, y), (c, z)]
.joinToString("") { (a, b) -> "$a$b" } // Joins the list back to a string with no separator
You can always use a simple loop, assuming both strings have the same size. That way You only allocate a StringBuilder and counter variable, without any lists, arrays or pairs:
val a = "abc"
val b = "xyz"
val sb = StringBuilder()
for(i in 0 until a.length){
sb.append(a[i]).append(b[i])
}
val d = sb.toString()
marstran's answer is really concise and Pawels answer is really fast. Using buildString you can have to best of both worlds:
buildString {
a.zip(b).forEach { (a, b) ->
append(a).append(b)
}
}
buildString creates a StringBuilder and offers it as receiver in the lambda. It returns the built string.
Try it out here: Kotlin Playground. Thanks to Pawel for creating the original benchmark.

count number of chars in String

In SML, how can i count the number of appearences of chars in a String using recursion?
Output should be in the form of (char,#AppearenceOfChar).
What i managed to do is
fun frequency(x) = if x = [] then [] else [(hd x,1)]#frequency(tl x)
which will return tupels of the form (char,1). I can too eliminate duplicates in this list, so what i fail to do now is to write a function like
fun count(s:string,l: (char,int) list)
which 'iterates' trough the string incrementing the particular tupel component. How can i do this recursively? Sorry for noob question but i am new to functional programming but i hope the question is at least understandable :)
I'd break the problem into two: Increasing the frequency of a single character, and iterating over the characters in a string and inserting each of them. Increasing the frequency depends on whether you have already seen the character before.
fun increaseFrequency (c, []) = [(c, 1)]
| increaseFrequency (c, ((c1, count)::freqs)) =
if c = c1
then (c1, count+1)
else (c1,count)::increaseFrequency (c, freqs)
This provides a function with the following type declaration:
val increaseFrequency = fn : ''a * (''a * int) list -> (''a * int) list
So given a character and a list of frequencies, it returns an updated list of frequencies where either the character has been inserted with frequency 1, or its existing frequency has been increased by 1, by performing a linear search through each tuple until either the right one is found or the end of the list is met. All other character frequencies are preserved.
The simplest way to iterate over the characters in a string is to explode it into a list of characters and insert each character into an accumulating list of frequencies that starts with the empty list:
fun frequencies s =
let fun freq [] freqs = freqs
| freq (c::cs) freqs = freq cs (increaseFrequency (c, freqs))
in freq (explode s) [] end
But this isn't a very efficient way to iterate a string one character at a time. Alternatively, you can visit each character by indexing without converting to a list:
fun foldrs f e s =
let val len = size s
fun loop i e' = if i = len
then e'
else loop (i+1) (f (String.sub (s, i), e'))
in loop 0 e end
fun frequencies s = foldrs increaseFrequency [] s
You might also consider using a more efficient representation of sets than lists to reduce the linear-time insertions.

How to generate random vector in Spark

I want to generate random vectors with norm 1 in Spark.
Since the vector could be very large, I want it to be distributed, And since data in RDD has no order, I want to store the vector in the form of RDD[(Int, Double)], because I also need to use this vector to do some matrix-vector multiplication.
So how could I generate this kind of vector?
Here is my plan for now:
val v = normalRDD(sc, n, NUM_NODE)
val mod = GetMod(v) // Get the modularity of v
val res = v.map(x => x / mod)
val arr:Array[Double] = res.toArray()
var tuples = new List[(Int, Double)]()
for (i <- 0 to (arr.length - 1)) {
tuples = (i, arr(i)) :: tuples
}
// Get the entries and length of the vector.
entries = sc.parallelize(tuples)
length = arr.length
I think it not elegant enough because it goes through a "distributed -> single node -> distributed" process.
Is there any way better? Thanks:D
try this:
import scala.util.Random
import scala.math.sqrt
val n = 5 // insert length of your array here
val randomRDD = sc.parallelize(for (i <- 0 to n) yield (i, Random.nextDouble))
val norm = sqrt(randomRDD.map(x => x._2 * x._2).sum())
val finalRDD = randomRDD.mapValues(x => x/norm)
You can use this function to generate a random vector, then you can normalise it by dividing each element on the sum() of the vector, or by using a normalizer.

Scala String Similarity

I have a Scala code that computes similarity between a set of strings and give all the unique strings.
val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) {
case ((acc, zt), zz) =>
if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc
else zz :: acc, zt.tail
}._1
I'll try to explain what is going on here :
This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").
The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)
If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.
Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.
The problem is like in first iteration, if 1st, 4th and 8th strings are similar, I am getting only the 1st string. Instead of it, I should get a set of (1st,4th,8th), then if 2nd,5th,14th and 21st strings are similar, I should get a set of (2nd,5th,14th,21st).
If I understand you correctly - you want the result to be of type List[List[String]] and not the List[String] you are getting now - where each item is a list of similar Strings (right?).
If so - I can't see a trivial change to your implementation that would achieve this, as the similar values are lost (when you enter the if(true) branch and just return the acc - you skip an item and you'll never "see" it again).
Two possible solutions I can think of:
Based on your idea, but using a 3-Tuple of the form (acc, zt, scanned) as the foldLeft result type, where the added scanned is the list of already-scanned items. This way we can refer back to them when we find an element that doesn't have preceeding similar elements:
val filtered = z.reverse.foldLeft((List.empty[List[String]],z.reverse,List.empty[String])) {
case ((acc, zt, scanned), zz) =>
val hasSimilarPreceeding = zt.tail.exists { tt => similarity(tt, zz) < threshold }
val similarFollowing = scanned.collect { case tt if similarity(tt, zz) < threshold => tt }
(if (hasSimilarPreceeding) acc else (zz :: similarFollowing) :: acc, zt.tail, zz :: scanned)
}._1
A probably-slower but much simpler solution would be to just groupBy the group of similar strings:
val alternative = z.groupBy(s => z.collect {
case other if similarity(s, other) < threshold => other
}.toSet ).values.toList
All of this assumes that the function:
f(a: String, b: String): Boolean = similarity(a, b) < threshold
Is commutative and transitive, i.e.:
f(a, b) && f(a. c) means that f(b, c)
f(a, b) if and only if f(b, a)
To test both implementations I used:
// strings are similar if they start with the same character
def similarity(s1: String, s2: String) = if (s1.head == s2.head) 0 else 100
val threshold = 1
val z = List("aa", "ab", "c", "a", "e", "fa", "fb")
And both options produce the same results:
List(List(aa, ab, a), List(c), List(e), List(fa, fb))

Detecting the index in a string that is not printable character with Scala

I have a method that detects the index in a string that is not printable as follows.
def isPrintable(v:Char) = v >= 0x20 && v <= 0x7E
val ba = List[Byte](33,33,0,0,0)
ba.zipWithIndex.filter { v => !isPrintable(v._1.toChar) } map {v => v._2}
> res115: List[Int] = List(2, 3, 4)
The first element of the result list is the index, but I wonder if there is a simpler way to do this.
If you want an Option[Int] of the first non-printable character (if one exists), you can do:
ba.zipWithIndex.collectFirst{
case (char, index) if (!isPrintable(char.toChar)) => index
}
> res4: Option[Int] = Some(2)
If you want all the indices like in your example, just use collect instead of collectFirst and you'll get back a List.
For getting only the first index that meets the given condition:
ba.indexWhere(v => !isPrintable(v.toChar))
(it returns -1 if nothing is found)
You can use directly regexp to found unprintable characters by unicode code points.
Resource: Regexp page
In such way you can directly filter your string with such pattern, for instance:
val text = "this is \n sparta\t\r\n!!!"
text.zipWithIndex.filter(_._1.matches("\\p{C}")).map(_._2)
> res3: Vector(8, 16, 17, 18)
As result you'll get Vector with indices of all unprintable characters in String. Check it out
If desired only the first occurrence of non printable char
Method span applied on a List delivers two sublists, the first where all the elements hold a condition, the second starts with an element that falsified the condition. In this case consider,
val (l,r) = ba.span(b => isPrintable(b.toChar))
l: List(33, 33)
r: List(0, 0, 0)
To get the index of the first non printable char,
l.size
res: Int = 2
If desired all the occurrences of non printable chars
Consider partition of a given List for a criteria. For instance, for
val ba2 = List[Byte](33,33,0,33,33)
val (l,r) = ba2.zipWithIndex.partition(b => isPrintable(b._1.toChar))
l: List((33,0), (33,1), (33,3), (33,4))
r: List((0,2))
where r includes tuples with non printable chars and their position in the original List.
I am not sure whether list of indexes or tuples is needed and I am not sure whether 'ba' needs to be an list of bytes or starts off as a string.
for { i <- 0 until ba.length if !isPrintable(ba(i).toChar) } yield i
here, because people need performance :)
def getNonPrintable(ba:List[Byte]):List[Int] = {
import scala.collection.mutable.ListBuffer
var buffer = ListBuffer[Int]()
#tailrec
def go(xs: List[Byte], cur: Int): ListBuffer[Int] = {
xs match {
case Nil => buffer
case y :: ys => {
if (!isPrintable(y.toChar)) buffer += cur
go(ys, cur + 1)
}
}
}
go(ba, 0)
buffer.toList
}

Resources