Scala String Similarity

Scala String Similarity - string

I have a Scala code that computes similarity between a set of strings and give all the unique strings.
val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) {
case ((acc, zt), zz) =>
if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc
else zz :: acc, zt.tail
}._1
I'll try to explain what is going on here :
This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").
The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)
If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.
Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.
The problem is like in first iteration, if 1st, 4th and 8th strings are similar, I am getting only the 1st string. Instead of it, I should get a set of (1st,4th,8th), then if 2nd,5th,14th and 21st strings are similar, I should get a set of (2nd,5th,14th,21st).

If I understand you correctly - you want the result to be of type List[List[String]] and not the List[String] you are getting now - where each item is a list of similar Strings (right?).
If so - I can't see a trivial change to your implementation that would achieve this, as the similar values are lost (when you enter the if(true) branch and just return the acc - you skip an item and you'll never "see" it again).
Two possible solutions I can think of:
Based on your idea, but using a 3-Tuple of the form (acc, zt, scanned) as the foldLeft result type, where the added scanned is the list of already-scanned items. This way we can refer back to them when we find an element that doesn't have preceeding similar elements:
val filtered = z.reverse.foldLeft((List.empty[List[String]],z.reverse,List.empty[String])) {
case ((acc, zt, scanned), zz) =>
val hasSimilarPreceeding = zt.tail.exists { tt => similarity(tt, zz) < threshold }
val similarFollowing = scanned.collect { case tt if similarity(tt, zz) < threshold => tt }
(if (hasSimilarPreceeding) acc else (zz :: similarFollowing) :: acc, zt.tail, zz :: scanned)
}._1
A probably-slower but much simpler solution would be to just groupBy the group of similar strings:
val alternative = z.groupBy(s => z.collect {
case other if similarity(s, other) < threshold => other
}.toSet ).values.toList
All of this assumes that the function:
f(a: String, b: String): Boolean = similarity(a, b) < threshold
Is commutative and transitive, i.e.:
f(a, b) && f(a. c) means that f(b, c)
f(a, b) if and only if f(b, a)
To test both implementations I used:
// strings are similar if they start with the same character
def similarity(s1: String, s2: String) = if (s1.head == s2.head) 0 else 100
val threshold = 1
val z = List("aa", "ab", "c", "a", "e", "fa", "fb")
And both options produce the same results:
List(List(aa, ab, a), List(c), List(e), List(fa, fb))

Related

Scala count chars in a string logical error

here is the code:
val a = "abcabca"
a.groupBy((c: Char) => a.count( (d:Char) => d == c))
here is the result I want:
scala.collection.immutable.Map[Int,String] = Map(2 -> b, 2 -> c, 3 -> a)
but the result I get is
scala.collection.immutable.Map[Int,String] = Map(2 -> bcbc, 3 -> aaa)
why?
thank you.

Write an expression like
"abcabca".groupBy(identity).collect{
case (k,v) => (k,v.length)
}
which will give output as
res0: scala.collection.immutable.Map[Char,Int] = Map(b -> 2, a -> 3, c -> 2)

Let's dissect your initial attempt :
a.groupBy((c: Char) => a.count( (d:Char) => d == c))
So, you're grouping by something which is what ? the result of a.count(...), so the key of your Map will be an Int. For the char a, we will get 3, for the chars b and c, we'll get 2.
Now, the original String will be traversed and for the results accumulated, char by char.
So after traversing the first "ab", the current state is "2-> b, 3->c". (Note that for each char in the string, the .count() is called, which is a n² wasteful algorithm, but anyway).
The string is progressively traversed, and at the end the accumulated results is shown. As it turns out, the 3 "a" have been sent under the "3" key, and the b and c have been sent to the key "2", in the order the string was traversed, which is the left to right order.
Now, a usual groupBy on a list returns something like Map[T, List[T]], so you may have expected a List[Char] somewhere. It doesn't happen (because the Repr for String is String), and your list of chars is effectively recombobulated into a String, and is given to you as such.
Hence your final result !

Your question header reads as "Scala count chars in a string logical error". But you are using Map and you wanted counts as keys. Equal keys are not allowed in Map objects. Hence equal keys get eliminated in the resulting Map, keeping just one, because no duplicate keys are allowed. What you want may be a Seq of tuples like (count, char) like List[Int,Char]. Try this.
val x = "abcabca"
x.groupBy(identity).mapValues(_.size).toList.map{case (x,y)=>(y,x)}
In Scal REPL:
scala> x.groupBy(identity).mapValues(_.size).toList.map{case (x,y)=>(y,x)}
res13: List[(Int, Char)] = List((2,b), (3,a), (2,c))
The above gives a list of counts and respective chars as a list of tuples.So this is what you may really wanted.
If you try converting this to a Map:
scala> x.groupBy(identity).mapValues(_.size).toList.map{case (x,y)=>(y,x)}.toMap
res14: scala.collection.immutable.Map[Int,Char] = Map(2 -> c, 3 -> a)
So this is not what you want obviously.
Even more concisely use:
x.distinct.map(v=>(x.filter(_==v).size,v))
scala> x.distinct.map(v=>(x.filter(_==v).size,v))
res19: scala.collection.immutable.IndexedSeq[(Int, Char)] = Vector((3,a), (2,b), (2,c))

The problem with your approach is you are mapping count to characters. Which is:
In case of
val str = abcabca
While traversing the string str a has count 3, b has count 2 and c has count 2 while creating the map (with the use of groupBy) it will put all the characters in the value which has the same key that is.
Map(3->aaa, 2->bc)
That’s the reason you are getting such output for your program.
As you can see in the definition of the groupBy function:
def
groupBy[K](f: (A) ⇒ K): immutable.Map[K, Repr]
Partitions this traversable collection into a map of traversable collections according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will always force the view and return a new traversable collection.
K
the type of keys returned by the discriminator function.
f
the discriminator function.
returns
A map from keys to traversable collections such that the following invariant holds:
(xs groupBy f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a traversable collection of those elements x for which f(x) equals k.
GroupBy returns a Map which holds the following invariant.
(xs groupBy f)(k) = xs filter (x => f(x) == k)
Which means it return collection of elements for which the key is same.

count number of chars in String

In SML, how can i count the number of appearences of chars in a String using recursion?
Output should be in the form of (char,#AppearenceOfChar).
What i managed to do is
fun frequency(x) = if x = [] then [] else [(hd x,1)]#frequency(tl x)
which will return tupels of the form (char,1). I can too eliminate duplicates in this list, so what i fail to do now is to write a function like
fun count(s:string,l: (char,int) list)
which 'iterates' trough the string incrementing the particular tupel component. How can i do this recursively? Sorry for noob question but i am new to functional programming but i hope the question is at least understandable :)

I'd break the problem into two: Increasing the frequency of a single character, and iterating over the characters in a string and inserting each of them. Increasing the frequency depends on whether you have already seen the character before.
fun increaseFrequency (c, []) = [(c, 1)]
| increaseFrequency (c, ((c1, count)::freqs)) =
if c = c1
then (c1, count+1)
else (c1,count)::increaseFrequency (c, freqs)
This provides a function with the following type declaration:
val increaseFrequency = fn : ''a * (''a * int) list -> (''a * int) list
So given a character and a list of frequencies, it returns an updated list of frequencies where either the character has been inserted with frequency 1, or its existing frequency has been increased by 1, by performing a linear search through each tuple until either the right one is found or the end of the list is met. All other character frequencies are preserved.
The simplest way to iterate over the characters in a string is to explode it into a list of characters and insert each character into an accumulating list of frequencies that starts with the empty list:
fun frequencies s =
let fun freq [] freqs = freqs
| freq (c::cs) freqs = freq cs (increaseFrequency (c, freqs))
in freq (explode s) [] end
But this isn't a very efficient way to iterate a string one character at a time. Alternatively, you can visit each character by indexing without converting to a list:
fun foldrs f e s =
let val len = size s
fun loop i e' = if i = len
then e'
else loop (i+1) (f (String.sub (s, i), e'))
in loop 0 e end
fun frequencies s = foldrs increaseFrequency [] s
You might also consider using a more efficient representation of sets than lists to reduce the linear-time insertions.

Scala Comprehension Errors

I am working on some of the exercism.io exercises. The current one I am working on is for Scala DNA exercise. Here is my code and the errors that I am receiving:
For reference, DNA is instantiated with a strand String. This DNA can call count (which counts the strand for the single nucleotide passed) and nucletideCounts which counts all of the respective occurrences of each nucleotide in the strand and returns a Map[Char,Int].
class DNA(strand:String) {
def count(nucleotide:Char): Int = {
strand.count(_ == nucleotide)
}
def nucleotideCounts = (
for {
n <- strand
c <- count(n)
} yield (n, c)
).toMap
}
The errors I am receiving are:
Error:(10, 17) value map is not a member of Int
c <- count(n)
^
Error:(12, 5) Cannot prove that Char <:< (T, U). ).toMap
^
Error:(12, 5) not enough arguments for method toMap: (implicit ev:
<:<[Char,(T, U)])scala.collection.immutable.Map[T,U]. Unspecified
value parameter ev. ).toMap
^
I am quite new to Scala, so any enlightenment on why these errors are occurring and suggestions to fixing them would be greatly appreciated.

for comprehensions work over Traversable's that have flatMap and map methods defined, as the error message is pointing out.
In your case count returns with a simple integer so no need to "iterate" over it, just simply add it to your result set.
for {
n <- strand
} yield (n, count(n))
On a side note this solution is not too optimal as in the case of a strand AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA count is going to be called many times. I would recommend calling toSet so you get the distinct Chars only:
for {
n <- strand.toSet
} yield (n, count(n))

In line with Akos's approach, consider a parallel traversal of a given strand (String),
strand.distinct.par.map( n => n -> count(n) )
Here we use distinct to gather unique items and construct each Map association in map.

A pipeline solution would look like:
def nucleotideCounts() = strand.groupBy(identity).mapValues(_.length)

Another approach is
Map() ++ {for (n <- strand; c = count(n)) yield n->c}
Not sure why it's different than {...}.toMap() but it gets the job done!
Another way to go is
Map() ++ {for (n <- strand; c <- Seq(count(n))) yield n->c}

SML - Find same elements in a string

Sorry but I need your help again!
I need a function which takes as input a
list of (string*string)
and returns as output a
list of (int*int)
The function should manipulate the list such that, every element that is represented by the same string has to be replaced by the same number.
For example if I insert:
changState([("s0","l0"),("l0","s1"),("s1","l1"),("l1","s0")]);
the output should be:
val it = [(0,1),(1,2),(2,3),(3,0)]
Is there someone who has an idea to solve this problem?
I would really appreciate that. Thanks a lot!

Here is one way:
structure StringKey =
struct
type ord_key = string
val compare = String.compare
end
structure Map = RedBlackMapFn(StringKey)
fun changState l =
let fun lp(l, map, max) =
case l
of nil => nil
| (s1,s2)::l' =>
case (Map.find(map, s1), Map.find(map, s2))
of (SOME i1, SOME i2) => (i1, i2)::lp(l', map, max)
| (NONE, SOME i) => (max, i)::lp(l', Map.insert(map, s1, max), max+1)
| (SOME i, NONE) => (i, max)::lp(l', Map.insert(map, s2, max), max+1)
| (NONE, NONE) =>
if s1 <> s2
then (max, max+1)::lp(l', Map.insert(Map.insert(map, s1, max), s2, max+1), max+2)
else (max, max)::lp(l', Map.insert(map, s1, max), max+1)
in lp(l, Map.empty, 0) end
Here lp takes the list of string pairs, a map which relates strings to integers, and a variable max, which keeps track of the next unused number. In each iteration, we lookup the two strings in the map. If they are found, return that integer, otherwise, use the next available integer. The last case, where neither string exists in the map, we need to check if the strings are equal, in case your input was [("s0", "s0")], we would expect [(0, 0)]. If they are equal, we map them to the same number, otherwise create distinct ones.
If you are not familiar with functors, the first 5 lines are creating a structure that satisfies the ORD_KEY signature. You can find more details in the documentation at
http://www.smlnj.org/doc/smlnj-lib/Manual/ord-map.html
http://www.smlnj.org/doc/smlnj-lib/Manual/ord-key.html#ORD_KEY:SIG:SPEC
By applying the ReBlackMapFn functor to the StringKey structure, it creates a new structure of a map (implemented as a red black tree) that maps strings to things of type 'a

Scala - modify strings in a list based on their number of occurences

Another Scala newbie question since I am not getting how to achieve this in a functional way (mostly coming from a scripting language background):
I have a list of strings:
val food-list = List("banana-name", "orange-name", "orange-num", "orange-name", "orange-num", "grape-name")
and where they are duplicated, I'd like to add an incrementing number into the string and get that in a list similar to the input list, like so:
List("banana-name", "orange1-name", "orange1-num", "orange2-name", "orange2-num", "grape-name")
I've grouped them up to get counts for them with:
val freqs = list.groupBy(identity).mapValues(v => List.range(1, v.length + 1))
Which gives me:
Map(orange-num -> List(1, 2), banana-name -> List(1), grape-name -> List(1), orange-name -> List(1, 2))
The order of the list is important (it should be in the original order of food-list) so I know it's problematic for me to use a Map at this point. The closest I feel I have gotten to a solution is:
food-list.map{l =>
if (freqs(l).length > 1){
freqs(l).map(n =>
l.split("-")(0) + n.toString + "-" + l.split("-")(1))
} else {
l
}
}
This of course gives me a wonky output since I am mapping the list of frequencies from the words value in freqs
List(banana-name, List(orange1-name, orange2-name), List(orange1-num, orange2-num), List(orange1-name, orange2-name), List(orange1-num, orange2-num), grape-name)
How is this done in a Scala fp way without resorting to clumsy for loops and counters?

If the indices are important, sometimes it's best to keep track of them explicitly using zipWithIndex (very similar to Python's enumerate):
food-list.zipWithIndex.groupBy(_._1).values.toList.flatMap{
//if only one entry in this group, don't change the values
//x is actually a tuple, could write case (str, idx) :: Nil => (str, idx) :: Nil
case x :: Nil => x :: Nil
//case where there are duplicate strings
case xs => xs.zipWithIndex.map {
//idx is index in the original list, n is index in the new list i.e. count
case ((str, idx), n) =>
//destructuring assignment, like python's (fruit, suffix) = ...
val Array(fruit, suffix) = str.split("-")
//string interpolation, returning a tuple
(s"$fruit${n+1}-$suffix", idx)
}
//We now have our list of (string, index) pairs;
//sort them and map to a list of just strings
}.sortBy(_._2).map(_._1)

Efficient and simple:
val food = List("banana-name", "orange-name", "orange-num",
"orange-name", "orange-num", "grape-name")
def replaceName(s: String, n: Int) = {
val tokens = s.split("-")
tokens(0) + n + "-" + tokens(1)
}
val indicesMap = scala.collection.mutable.HashMap.empty[String, Int]
val res = food.map { name =>
{
val n = indicesMap.getOrElse(name, 1)
indicesMap += (name -> (n + 1))
replaceName(name, n)
}
}

Here is an attempt to provide what you expected with foldLeft:
foodList.foldLeft((List[String](), Map[String, Int]()))//initial value
((a/*accumulator, list, map*/, v/*value from the list*/)=>
if (a._2.isDefinedAt(v))//already seen
(s"$v+${a._2(v)}" :: a._1, a._2.updated(v, a._2(v) + 1))
else
(v::a._1, a._2.updated(v, 1)))
._1/*select the list*/.reverse/*because we created in the opposite order*/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scala String Similarity - string

Related

Scala count chars in a string logical error

count number of chars in String

Scala Comprehension Errors

SML - Find same elements in a string

Scala - modify strings in a list based on their number of occurences

Categories

Resources