How to find maximum overlap between two strings in Scala?

How to find maximum overlap between two strings in Scala? - string

Suppose I have two strings: s and t. I need to write a function f to find a max. t prefix, which is also an s suffix. For example:
s = "abcxyz", t = "xyz123", f(s, t) = "xyz"
s = "abcxxx", t = "xx1234", f(s, t) = "xx"
How would you write it in Scala ?

This first solution is easily the most concise, also it's more efficient than a recursive version as it's using a lazily evaluated iteration
s.tails.find(t.startsWith).get
Now there has been some discussion regarding whether tails would end up copying the whole string over and over. In which case you could use toList on s then mkString the result.
s.toList.tails.find(t.startsWith(_: List[Char])).get.mkString
For some reason the type annotation is required to get it to compile. I've not actually trying seeing which one is faster.
UPDATE - OPTIMIZATION
As som-snytt pointed out, t cannot start with any string that is longer than it, and therefore we could make the following optimization:
s.drop(s.length - t.length).tails.find(t.startsWith).get

Efficient, this is not, but it is a neat (IMO) one-liner.
val s = "abcxyz"
val t ="xyz123"
(s.tails.toSet intersect t.inits.toSet).maxBy(_.size)
//res8: String = xyz
(take all the suffixes of s that are also prefixes of t, and pick the longest)

If we only need to find the common overlapping part, then we can recursively take tail of the first string (which should overlap with the beginning of the second string) until the remaining part will not be the one that second string begins with. This also covers the case when the strings have no overlap, because then the empty string will be returned.
scala> def findOverlap(s:String, t:String):String = {
if (s == t.take(s.size)) s else findOverlap (s.tail, t)
}
findOverlap: (s: String, t: String)String
scala> findOverlap("abcxyz", "xyz123")
res3: String = xyz
scala> findOverlap("one","two")
res1: String = ""
UPDATE: It was pointed out that tail might not be implemented in the most efficient way (i.e. it creates a new string when it is called). If that becomes an issue, then using substring(1) instead of tail (or converting both Strings to Lists, where it's tail / head should have O(1) complexity) might give a better performance. And by the same token, we can replace t.take(s.size) with t.substring(0,s.size).

Related

How to construct an array with multiple possible lengths using immutability and functional programming practices?

We're in the process of converting our imperative brains to a mostly-functional paradigm. This function is giving me trouble. I want to construct an array that EITHER contains two pairs or three pairs, depending on a condition (whether refreshToken is null). How can I do this cleanly using a FP paradigm? Of course with imperative code and mutation, I would just conditionally .push() the extra value onto the end which looks quite clean.
Is this an example of the "local mutation is ok" FP caveat?
(We're using ReadonlyArray in TypeScript to enforce immutability, which makes this somewhat more ugly.)
const itemsToSet = [
[JWT_KEY, jwt],
[JWT_EXPIRES_KEY, tokenExpireDate.toString()],
[REFRESH_TOKEN_KEY, refreshToken /*could be null*/]]
.filter(item => item[1] != null) as ReadonlyArray<ReadonlyArray<string>>;
AsyncStorage.multiSet(itemsToSet.map(roArray => [...roArray]));

What's wrong with itemsToSet as given in the OP? It looks functional to me, but it may be because of my lack of knowledge of TypeScript.
In Haskell, there's no null, but if we use Maybe for the second element, I think that itemsToSet could be translated to this:
itemsToSet :: [(String, String)]
itemsToSet = foldr folder [] values
where
values = [
(jwt_key, jwt),
(jwt_expires_key, tokenExpireDate),
(refresh_token_key, refreshToken)]
folder (key, Just value) acc = (key, value) : acc
folder _ acc = acc
Here, jwt, tokenExpireDate, and refreshToken are all of the type Maybe String.
itemsToSet performs a right fold over values, pattern-matching the Maye String elements against Just and (implicitly) Nothing. If it's a Just value, it cons the (key, value) pair to the accumulator acc. If not, folder just returns acc.
foldr traverses the values list from right to left, building up the accumulator as it visits each element. The initial accumulator value is the empty list [].
You don't need 'local mutation' in functional programming. In general, you can refactor from 'local mutation' to proper functional style by using recursion and introducing an accumulator value.
While foldr is a built-in function, you could implement it yourself using recursion.

In Haskell, I'd just create an array with three elements and, depending on the condition, pass it on either as-is or pass on just a slice of two elements. Thanks to laziness, no computation effort will be spent on the third element unless it's actually needed. In TypeScript, you probably will get the cost of computing the third element even if it's not needed, but perhaps that doesn't matter.
Alternatively, if you don't need the structure to be an actual array (for String elements, performance probably isn't that critical, and the O (n) direct-access cost isn't an issue if the length is limited to three elements), I'd use a singly-linked list instead. Create the list with two elements and, depending on the condition, append the third. This does not require any mutation: the 3-element list simply contains the unchanged 2-element list as a substructure.

Based on the description, I don't think arrays are the best solution simply because you know ahead of time that they contain either 2 values or 3 values depending on some condition. As such, I would model the problem as follows:
type alias Pair = (String, String)
type TokenState
= WithoutRefresh (Pair, Pair)
| WithRefresh (Pair, Pair, Pair)
itemsToTokenState: String -> Date -> Maybe String -> TokenState
itemsToTokenState jwtKey jwtExpiry maybeRefreshToken =
case maybeRefreshToken of
Some refreshToken ->
WithRefresh (("JWT_KEY", jwtKey), ("JWT_EXPIRES_KEY", toString jwtExpiry), ("REFRESH_TOKEN_KEY", refreshToken))
None ->
WithoutRefresh (("JWT_KEY", jwtKey), ("JWT_EXPIRES_KEY", toString jwtExpiry))
This way you are leveraging the type system more effectively, and could be improved on further by doing something more ergonomic than returning tuples.

takeRightWhile() method in scala

I might be missing something but recently I came across a task to get last symbols according to some condition. For example I have a string: "this_is_separated_values_5". Now I want to extract 5 as Int.
Note: number of parts separated by _ is not defined.
If I would have a method takeRightWhile(f: Char => Boolean) on a string it would be trivial: takeRightWhile(ch => ch != '_'). Moreover it would be efficient: a straightforward implementation would actually involve finding the last index of _ and taking a substring while the use of this method would save first step and provide better average time complexity.
UPDATE: Guys, all the variations of str.reverse.takeWhile(_!='_').reverse are quite inefficient as you actually use additional O(n) space. If you want to implement method takeRightWhile efficiently you could iterate starting from the right, accumulating result in string builder of whatever else, and returning the result. I am asking about this kind of method, not implementation which was already described and declined in the question itself.
Question: Does this kind of method exist in scala standard library? If no, is there method combination from the standard library to achieve the same in minimum amount of lines?
Thanks in advance.

Possible solution:
str.reverse.takeWhile(_!='_').reverse
Update
You can go from right to left with following expression using foldRight:
str.toList.foldRight(List.empty[Char]) {
case (item, acc) => item::acc
}
Here you need to check condition and stop adding items after condition met. For this you can pass a flag to accumulated value:
val (_, list) = str.toList.foldRight((false, List.empty[Char])) {
case (item, (false, list)) if item!='_' => (false, item::list)
case (_, (_, list)) => (true, list)
}
val res = list.mkString.toInt
This solution is even more inefficient then solution with double reverse:
Implementation of foldRight uses combination of List reverse and foldLeft
You cannot break foldRight execution, so you need flag to skip all items after condition met

I'd go with this:
val s = "string_with_following_number_42"
s.split("_").reverse.head
// res:String = 42
This is a naive attempt and by no means optimized. What it does is splitting the String into an Array of Strings, reverses it and takes the first element. Note that, because the reversing happens after the splitting, the order of the characters is correct.

I am not exactly sure about the problem you are facing. My understanding is that you want have a string of format xxx_xxx_xx_...._xxx_123 and you want to extract the part at the end as Int.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val yourInt = yourStr.split('_').last.toInt
// But remember that the above is unsafe so you may want to take it as Option
val yourIntOpt = Try(yourStr.split('_').last.toInt).toOption
Or... lets say your requirement is to collect a right-suffix till some boolean condition remains true.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val rightSuffix = yourStr.reverse.takeWhile(c => c != '_').reverse
val yourInt = rightSuffix.toInt
// but above is unsafe so
val yourIntOpt = Try(righSuffix.toInt).toOption
Comment if your requirement is different from this.

You can use StringBuilder and lastIndexWhere.
val str = "this_is_separated_values_5"
val sb = new StringBuilder(str)
val lastIdx = sb.lastIndexWhere(ch => ch != '_')
val lastCh = str.charAt(lastIdx)

Scala string manipulation

I have the following Scala code :
val res = for {
i <- 0 to 3
j <- 0 to 3
if (similarity(z(i),z(j)) < threshold) && (i<=j)
} yield z(j)
z here represents Array[String] and similarity(z(i),z(j)) calculates similarity between two strings.
This problems works like that similarity is calculated between 1st string and all the other strings and then similarity is calculated between 2nd string and all other strings except for first and then similarity for 3rd string and so on.
My requirement is that if 1st string matches with 3rd, 4th and 8th string, then
all these 3 strings shouldn't participate in loops further and loop should jump to 2nd string, then 5th string, 6th string and so on.
I am stuck at this step and don't know how to proceed further.

I am presuming that your intent is to keep the first String of two similar Strings (eg. if 1st String is too similar to 3rd, 4th, and 8th Strings, keep only the 1st String [out of these similar strings]).
I have a couple of ways to do this. They both work, in a sense, in reverse: for each String, if it is too similar to any later Strings, then that current String is filtered out (not the later Strings). If you first reverse the input data before applying this process, you will find that the desired outcome is produced (although in the first solution below the resulting list is itself reversed - so you can just reverse it again, if order is important):
1st way (likely easier to understand):
def filterStrings(z: Array[String]) = {
val revz = z.reverse
val filtered = for {
i <- 0 to revz.length if !revz.drop(i+1).exists(zz => similarity(zz, revz(i)) < threshold)
} yield revz(i)
filtered.reverse // re-reverses output if order is important
}
The 'drop' call is to ensure that each String is only checked against later Strings.
2nd option (fully functional, but harder to follow):
val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) { case ((acc, zt), zz) =>
(if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc else zz :: acc, zt.tail)
}._1
I'll try to explain what is going on here (in case you - or any readers - aren't use to following folds):
This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").
The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)
If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.
Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.

If I understand correctly, you want to loop through the elements of the array, comparing each element to later elements, and removing ones that are too similar as you go.
You can't (easily) do this within a simple loop. You'd need to keep track of which items had been filtered out, which would require another array of booleans, which you update and test against as you go. It's not a bad approach and is efficient, but it's not pretty or functional.
So you need to use a recursive function, and this kind of thing is best done using an immutable data structure, so let's stick to List.
def removeSimilar(xs: List[String]): List[String] = xs match {
case Nil => Nil
case y :: ys => y :: removeSimilar(ys filter {x => similarity(y, x) < threshold})
}
It's a simple-recursive function. Not much to explain: if xs is empty, it returns the empty list, else it adds the head of the list to the function applied to the filtered tail.

Suffix array beginning using scala

Today I am trying to create suffix arrays using scala. I was able to do it with massive lines of code but then I heard that it can be created by using only few lines by using zipping and sorting.
The problem I have at the moment is with the beginning. I tried using binary search and zipWithIndex to create the following "tree" but so far I haven't been able to create anything. I don't even know if it is possible by only using a line but I bet it is lol.
What I want to do is to get from a word "cheesecake" is a Seq:
Seq((cheesecake, 0),
(heesecake, 1),
(eesecake, 2),
(esecake, 3),
(secake, 4),
(ecake, 5),
(cake, 6),
(ake, 7),
(ke, 8),
(e, 9))
Could someone nudge me to the correct path ?

To generate all the possible postfixes of a String (or any other scala.collection.TraversableLike) you can simply use the tails method:
scala> "cheesecake".tails.toList
res25: List[String] = List(cheesecake, heesecake, eesecake, esecake, secake, ecake, cake, ake, ke, e, "")
If you need the indexes, then you can use GenIterable.zipWithIndex:
scala> "cheesecake".tails.toList.zipWithIndex
res0: List[(String, Int)] = List((cheesecake,0), (heesecake,1), (eesecake,2), (esecake,3), (secake,4), (ecake,5), (cake,6), (ake,7), (ke,8), (e,9), ("",10))

You're looking for the .scan methods, specifically .scanRight (since you want to start build from the end (ie right-side) of the string, prepending the next character (look at your pyramide bottom to top)).
Quoting the documentation :
Produces a collection containing cumulative results of applying the
operator going right to left.
Here the operator is :
Prepend the current character
Decrement the counter (since your first element is "cheesecake".length, counting down)
So :
scala> s.scanRight (List[(String, Int)]())
{ case (char, (stringAcc, count)::tl) => (char + stringAcc, count-1)::tl
case (c, Nil) => List((c.toString, s.length-1))
}
.dropRight(1)
.map(_.head)
res12: scala.collection.immutable.IndexedSeq[List[(String, Int)]] =
Vector((cheesecake,0),
(heesecake,1),
(eesecake,2),
(esecake,3),
(secake,4),
(ecake,5),
(cake,6),
(ake,7),
(ke,8),
(e,9)
)
The dropRight(0) at the end is to remove the (List[(String, Int)]()) (the first argument), which serves as the first element on which to start building (you could pass the last e of your string and iterate on cheesecak, but I find it easier to do it this way).

One approach,
"cheesecake".reverse.inits.map(_.reverse).zipWithIndex.toArray
Scala strings are equipped with ordered collections methods such as reverse and inits, the latter delivers a collection of strings where each string has dropped the latest character.

EDIT - From a previous suffix question that I posted (from an Purely Functional Data Structures exercise, I believe that suffix should/may include the empty list too, i.e. "" for String.
scala> def suffix(x: String): List[String] = x.toList match {
| case Nil => Nil
| case xxs # (_ :: xs) => xxs.mkString :: suffix(xs.mkString)
| }
suffix: (x: String)List[String]
scala> def f(x: String): List[(String, Int)] = suffix(x).zipWithIndex
f: (x: String)List[(String, Int)]
Test
scala> f("cheesecake")
res10: List[(String, Int)] = List((cheesecake,0), (heesecake,1), (eesecake,2),
(esecake,3), (secake,4), (ecake,5), (cake,6), (ake,7), (ke,8), (e,9))

Is there a better way to write a "string contains X" method?

Just stared using Haskell and realized (at far as I can tell) there is no direct way to check a string to see if it contains a smaller string. So I figured I'd just take a shot at it.
Essentially the idea was to check if the two strings were the same size and were equal. If the string being checked was longer, recursively lop of the head and run the check again until the string being checked was the same length.
The rest of the possibilities I used pattern matching to handle them. This is what I came up with:
stringExists "" wordToCheckAgainst = False
stringExists wordToCheckFor "" = False
stringExists wordToCheckFor wordToCheckAgainst | length wordToCheckAgainst < length wordToCheckFor = False
| length wordToCheckAgainst == length wordToCheckFor = wordToCheckAgainst == wordToCheckFor
| take (length wordToCheckFor) wordToCheckAgainst == wordToCheckFor = True
| otherwise = stringExists wordToCheckFor (tail wordToCheckAgainst)

If you search Hoogle for the signature of the function you're looking for (String -> String -> Bool) you should see isInfixOf among the top results.

isInfixOf from Data.List will surely solve the problem, however in case of longer haystacks or perverse¹ needles you should consider more advanced string matching algorithms with a much better average and worst case complexity.
¹ Consider a really long string consisting only of a's and a needle with a lot of a's at the beginning and one b at the end.

Consider using the text package(text on Hackage, now also part of Haskell Platform) for your text-processing needs. It provides a Unicode text type, which is more time- and space-efficient than the built-in list-based String. For string search, the text package implements a Boyer-Moore-based algorithm, which has better complexity than the naïve method used by Data.List.isInfixOf.
Usage example:
Prelude> :s -XOverloadedStrings
Prelude> import qualified Data.Text as T
Prelude Data.Text> T.breakOnAll "abc" "defabcged"
[("def","abcged")]
Prelude Data.Text> T.isInfixOf "abc" "defabcged"
True

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to find maximum overlap between two strings in Scala? - string

Suppose I have two strings: s and t. I need to write a function f to find a max. t prefix, which is also an s suffix. For example: s = "abcxyz", t = "xyz123", f(s, t) = "xyz" s = "abcxxx", t = "xx1234", f(s, t) = "xx" How would you write it in Scala ?

Efficient, this is not, but it is a neat (IMO) one-liner. val s = "abcxyz" val t ="xyz123" (s.tails.toSet intersect t.inits.toSet).maxBy(_.size) //res8: String = xyz (take all the suffixes of s that are also prefixes of t, and pick the longest)

Related

How to construct an array with multiple possible lengths using immutability and functional programming practices?

takeRightWhile() method in scala

Scala string manipulation

Suffix array beginning using scala

Is there a better way to write a "string contains X" method?

Categories

Resources