Longest common suffix - string

I I would like to find the longest common suffix of two strings in Scala.
def longestSuffix(s1: String, s2: String) = {
val it = (s1.reverseIterator zip s2.reverseIterator) takeWhile {case (x, y) => x == y}
it.map (_._1).toList.reverse.mkString
}
This code is clumsy and probably inefficient (e.g. because of reversing). How would find the longest common suffix functionally, i.e. without mutable variables ?

One way to improve it would be to connect reverse and map in last operation:
str1.reverseIterator.zip(str2.reverseIterator).takeWhile( c => c._1 == c._2)
.toList.reverseMap(c => c._1) mkString ""
firstly make a list, and then reverseMap this list

We can iterate over substrings, without reverse:
def longestSuffix(s1: String, s2: String) = {
s1.substring(s1.length to 0 by -1 takeWhile { n => s2.endsWith(s1.substring(n)) } last)
}

Let tails produce the sub-strings and then return the first that fits.
def longestSuffix(s1: String, s2: String) =
s1.tails.dropWhile(!s2.endsWith(_)).next
Some efficiency might be gained by calling tails on the shorter of the two inputs.

I came up with a solution like this:def commonSuffix(s1: String, s2: String): String = {
val n = (s1.reverseIterator zip s2.reverseIterator) // mutable !
.takeWhile {case (a, b) => a == b}
.size
s1.substring(s1.length - n) // is it efficient ?
}
Note that I am using substring for efficiency (not sure if it's correct).
This solution also is not completely "functional" since I am using reverseIterator despite it's mutable because I did not find another way to iterate over strings in reverse order. How would you suggest fix/improve it ?

Related

Scala String Similarity

I have a Scala code that computes similarity between a set of strings and give all the unique strings.
val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) {
case ((acc, zt), zz) =>
if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc
else zz :: acc, zt.tail
}._1
I'll try to explain what is going on here :
This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").
The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)
If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.
Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.
The problem is like in first iteration, if 1st, 4th and 8th strings are similar, I am getting only the 1st string. Instead of it, I should get a set of (1st,4th,8th), then if 2nd,5th,14th and 21st strings are similar, I should get a set of (2nd,5th,14th,21st).
If I understand you correctly - you want the result to be of type List[List[String]] and not the List[String] you are getting now - where each item is a list of similar Strings (right?).
If so - I can't see a trivial change to your implementation that would achieve this, as the similar values are lost (when you enter the if(true) branch and just return the acc - you skip an item and you'll never "see" it again).
Two possible solutions I can think of:
Based on your idea, but using a 3-Tuple of the form (acc, zt, scanned) as the foldLeft result type, where the added scanned is the list of already-scanned items. This way we can refer back to them when we find an element that doesn't have preceeding similar elements:
val filtered = z.reverse.foldLeft((List.empty[List[String]],z.reverse,List.empty[String])) {
case ((acc, zt, scanned), zz) =>
val hasSimilarPreceeding = zt.tail.exists { tt => similarity(tt, zz) < threshold }
val similarFollowing = scanned.collect { case tt if similarity(tt, zz) < threshold => tt }
(if (hasSimilarPreceeding) acc else (zz :: similarFollowing) :: acc, zt.tail, zz :: scanned)
}._1
A probably-slower but much simpler solution would be to just groupBy the group of similar strings:
val alternative = z.groupBy(s => z.collect {
case other if similarity(s, other) < threshold => other
}.toSet ).values.toList
All of this assumes that the function:
f(a: String, b: String): Boolean = similarity(a, b) < threshold
Is commutative and transitive, i.e.:
f(a, b) && f(a. c) means that f(b, c)
f(a, b) if and only if f(b, a)
To test both implementations I used:
// strings are similar if they start with the same character
def similarity(s1: String, s2: String) = if (s1.head == s2.head) 0 else 100
val threshold = 1
val z = List("aa", "ab", "c", "a", "e", "fa", "fb")
And both options produce the same results:
List(List(aa, ab, a), List(c), List(e), List(fa, fb))

Scala Comprehension Errors

I am working on some of the exercism.io exercises. The current one I am working on is for Scala DNA exercise. Here is my code and the errors that I am receiving:
For reference, DNA is instantiated with a strand String. This DNA can call count (which counts the strand for the single nucleotide passed) and nucletideCounts which counts all of the respective occurrences of each nucleotide in the strand and returns a Map[Char,Int].
class DNA(strand:String) {
def count(nucleotide:Char): Int = {
strand.count(_ == nucleotide)
}
def nucleotideCounts = (
for {
n <- strand
c <- count(n)
} yield (n, c)
).toMap
}
The errors I am receiving are:
Error:(10, 17) value map is not a member of Int
c <- count(n)
^
Error:(12, 5) Cannot prove that Char <:< (T, U). ).toMap
^
Error:(12, 5) not enough arguments for method toMap: (implicit ev:
<:<[Char,(T, U)])scala.collection.immutable.Map[T,U]. Unspecified
value parameter ev. ).toMap
^
I am quite new to Scala, so any enlightenment on why these errors are occurring and suggestions to fixing them would be greatly appreciated.
for comprehensions work over Traversable's that have flatMap and map methods defined, as the error message is pointing out.
In your case count returns with a simple integer so no need to "iterate" over it, just simply add it to your result set.
for {
n <- strand
} yield (n, count(n))
On a side note this solution is not too optimal as in the case of a strand AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA count is going to be called many times. I would recommend calling toSet so you get the distinct Chars only:
for {
n <- strand.toSet
} yield (n, count(n))
In line with Akos's approach, consider a parallel traversal of a given strand (String),
strand.distinct.par.map( n => n -> count(n) )
Here we use distinct to gather unique items and construct each Map association in map.
A pipeline solution would look like:
def nucleotideCounts() = strand.groupBy(identity).mapValues(_.length)
Another approach is
Map() ++ {for (n <- strand; c = count(n)) yield n->c}
Not sure why it's different than {...}.toMap() but it gets the job done!
Another way to go is
Map() ++ {for (n <- strand; c <- Seq(count(n))) yield n->c}

Scala - modify strings in a list based on their number of occurences

Another Scala newbie question since I am not getting how to achieve this in a functional way (mostly coming from a scripting language background):
I have a list of strings:
val food-list = List("banana-name", "orange-name", "orange-num", "orange-name", "orange-num", "grape-name")
and where they are duplicated, I'd like to add an incrementing number into the string and get that in a list similar to the input list, like so:
List("banana-name", "orange1-name", "orange1-num", "orange2-name", "orange2-num", "grape-name")
I've grouped them up to get counts for them with:
val freqs = list.groupBy(identity).mapValues(v => List.range(1, v.length + 1))
Which gives me:
Map(orange-num -> List(1, 2), banana-name -> List(1), grape-name -> List(1), orange-name -> List(1, 2))
The order of the list is important (it should be in the original order of food-list) so I know it's problematic for me to use a Map at this point. The closest I feel I have gotten to a solution is:
food-list.map{l =>
if (freqs(l).length > 1){
freqs(l).map(n =>
l.split("-")(0) + n.toString + "-" + l.split("-")(1))
} else {
l
}
}
This of course gives me a wonky output since I am mapping the list of frequencies from the words value in freqs
List(banana-name, List(orange1-name, orange2-name), List(orange1-num, orange2-num), List(orange1-name, orange2-name), List(orange1-num, orange2-num), grape-name)
How is this done in a Scala fp way without resorting to clumsy for loops and counters?
If the indices are important, sometimes it's best to keep track of them explicitly using zipWithIndex (very similar to Python's enumerate):
food-list.zipWithIndex.groupBy(_._1).values.toList.flatMap{
//if only one entry in this group, don't change the values
//x is actually a tuple, could write case (str, idx) :: Nil => (str, idx) :: Nil
case x :: Nil => x :: Nil
//case where there are duplicate strings
case xs => xs.zipWithIndex.map {
//idx is index in the original list, n is index in the new list i.e. count
case ((str, idx), n) =>
//destructuring assignment, like python's (fruit, suffix) = ...
val Array(fruit, suffix) = str.split("-")
//string interpolation, returning a tuple
(s"$fruit${n+1}-$suffix", idx)
}
//We now have our list of (string, index) pairs;
//sort them and map to a list of just strings
}.sortBy(_._2).map(_._1)
Efficient and simple:
val food = List("banana-name", "orange-name", "orange-num",
"orange-name", "orange-num", "grape-name")
def replaceName(s: String, n: Int) = {
val tokens = s.split("-")
tokens(0) + n + "-" + tokens(1)
}
val indicesMap = scala.collection.mutable.HashMap.empty[String, Int]
val res = food.map { name =>
{
val n = indicesMap.getOrElse(name, 1)
indicesMap += (name -> (n + 1))
replaceName(name, n)
}
}
Here is an attempt to provide what you expected with foldLeft:
foodList.foldLeft((List[String](), Map[String, Int]()))//initial value
((a/*accumulator, list, map*/, v/*value from the list*/)=>
if (a._2.isDefinedAt(v))//already seen
(s"$v+${a._2(v)}" :: a._1, a._2.updated(v, a._2(v) + 1))
else
(v::a._1, a._2.updated(v, 1)))
._1/*select the list*/.reverse/*because we created in the opposite order*/

Why do this recursive function fails with stack overflow? How can I improve it?

I am thinking of writing a recursive function in scala that concatenate the string for n times.
My code is below:
def repeat(s: String, n: Int): String = {
if(n==1) s
else
s+repeat(s,n-1)
}
Is it possible that I did not use "+" properly? But the "+" is indeed a sign of concatenation, as I was originally trying
def repeat(s: String, n: Int): String = {
if(n==1) s
else
repeat(s+s,n-1)
}
That repeats my string 2^n times
#annotation.tailrec
final def repeat(s: String, n: Int, ss: String = ""): String = {
if (n == 0) ss else repeat(s, n - 1, ss + s)
}
repeat("test", 5)
Your first version is NOT tail recursive.
It has to call itself and then prepends s. For tail recursion the very last expression must be the self-call. This means it'll blow the stack for large values of n
The second version is tail recursive.
Put #annotation.tailrec before both definitions, and the compiler will throw an error where it can't perform tail-call optimisation.
A simple approach that bypasses recursion,
def repeat(s: String, n: Int) = s * n
Hence,
scala> repeat("mystring",3)
res0: String = mystringmystringmystring
This function may be rewritten into a recursive form for instance as follows,
def re(s: String, n: Int): String = n match {
case m if (m <= 0) => ""
case 1 => s
case m => s + re(s,n-1)
}
Note the uses of + and * operators on String.

How can i determine if a string is a concatenation of a string list

Suppose we are given a string S, and a list of some other strings L.
How can we know if S is a one of all the possible concatenations of L?
For example:
S = "abcdabce"
L = ["abcd", "a", "bc", "e"]
S is "abcd" + "a" + "bc" + "e", then S is a concatenation of L, whereas "ababcecd" is not.
In order to solve this question, I tried to use DFS/backtracking. The pseudo code is as follows:
boolean isConcatenation(S, L) {
if (L.length == 1 && S == L[0]) return true;
for (String s: L) {
if (S.startwith(s)) {
markAsVisited(s);
if (isConcatnation(S.exclude(s), L.exclude(s)))
return true;
markAsUnvisited(s);
}
}
return false;
}
However, DFS/backtracking is not a efficient solution. I am curious what is the fastest algorithm to solve this question or if there is any other algorithm to solve it in a faster way. I hope there are algorithms like KMP, which can solve it in O(n) time.
In python:
>>> yes = 'abcdabce'
>>> no = 'ababcecd'
>>> L = ['abcd','a','bc','e']
>>> yes in [''.join(p) for p in itertools.permutations(L)]
True
>>> no in [''.join(p) for p in itertools.permutations(L)]
False
edit: as pointed out, this is n! complex, so is inappropriate for large L. But hey, development time under 10 seconds.
You can instead build your own permutation generator, starting with the basic permutator:
def all_perms(elements):
if len(elements) <=1:
yield elements
else:
for perm in all_perms(elements[1:]):
for i in range(len(elements)):
yield perm[:i] + elements[0:1] + perm[i:]
And then discard branches that you don't care about by tracking what the concatenation of the elements would be and only iterating if it adds up to your target string.
def all_perms(elements, conc=''):
...
for perm in all_perms(elements[1:], conc + elements[0]):
...
if target.startswith(''.join(conc)):
...
A dynamic programming approach would be to work left to right, building up an array A[x] where A[x] is true if the first x characters of the string form one of the possible concatenations of L. You can work out A[n] given earlier A[n] by checking each possible string in the list - if the characters of S up to the nth character match a candidate string of length k and if A[n-k] is true, then you can set A[n] true.
I note that you can use https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm to find the matches you need as input to the dynamic program - the matching costs will be linear in the size of the input string, the total size of all candidate strings, and the number of matches between the input string and candidate strings.
i would try the following:
find all positions of L_i patterns in S
let n=length(S)+1
create a graph with n nodes
for all L_i positions i: directed edges: node_i --> L_i matches node --> node_{i+length(L_i)}
to enable the permutation constrains you have to add some more node/edges to exclude multiple usage of the same pattern
now i can ask a new question: is there exists a directed path from 0 to n ?
notes:
if there exists a node(0 < i < n) with degree <2 then no match is possible
all nodes which have d-=1, d+=1 are part of the permutation
bread first or diskstra to look for the solution
You can use the Trie data structure. First, construct a trie from strings in L.
Then, for the input string S, search for the S in the trie.
During searching, for every visited node which is an end of one of the words in L, call a new search on the trie (from the root) with remaining (yet unmatched) suffix of S. So, we are using recursion. If you consume all characters of S in that process then you know, that S is a contatenation of some strings from L.
I would suggest this solution:
Take an array of size 256 which will store the occurence count of each character in all strings of L. Now try to match that with count of each character of S. If both are unequal then we can confidently say that they cannot form the given character.
If counts are same, Do the following, using KMP algorithm try to find simultaneously each string in L in S. If at any time there is a match we remove that string from L and continue search for other strings in L. If at any time we dont find a match we just print that it cannot be represented. If at the end L is empty we conclude that S indeed is a concatenation of L.
Assuming that L is a set of unique strings.
Two Haskell propositions:
There may be some counter examples to this...just for fun...sort L by a custom sort:
import Data.List (sortBy,isInfixOf)
h s l = (concat . sortBy wierd $ l) == s where
wierd a b | isInfixOf (a ++ b) s = LT
| isInfixOf (b ++ a) s = GT
| otherwise = EQ
More boring...attempt to build S from L:
import Data.List (delete,isPrefixOf)
f s l = g s l [] where
g str subs result
| concat result == s = [result]
| otherwise =
if null str || null subs'
then []
else do sub <- subs'
g (drop (length sub) str) (delete sub subs) (result ++ [sub])
where subs' = filter (flip isPrefixOf str) subs
Output:
*Main> f "abcdabce" ["abcd", "a", "bc", "e", "abc"]
[["abcd","a","bc","e"],["abcd","abc","e"]]
*Main> h "abcdabce" ["abcd", "a", "bc", "e", "abc"]
False
*Main> h "abcdabce" ["abcd", "a", "bc", "e"]
True
Your algorithm has complexity N^2 (N is the length of list). Let's see in actual C++
#include <string>
#include <vector>
#include <algorithm>
#include <iostream>
using namespace std;
typedef pair<string::const_iterator, string::const_iterator> stringp;
typedef vector<string> strings;
bool isConcatenation(stringp S, const strings L) {
for (strings::const_iterator p = L.begin(); p != L.end(); ++p) {
auto M = mismatch(p->begin(), p->end(), S.first);
if (M.first == p->end()) {
if (L.size() == 1)
return true;
strings T;
T.insert(T.end(), L.begin(), p);
strings::const_iterator v = p;
T.insert(T.end(), ++v, L.end());
if (isConcatenation(make_pair(M.second, S.second), T))
return true;
}
}
return false;
}
Instead of looping on the entire vector, we could sort it, then reduce the search to O(LOG(N)) steps in the optimum case, where all strings start with different chars. The worst case will remain O(N^2).

Resources