Scala: replace char at position i in String - string

I have an initial String (binary) looking like this :
val mask = "00000000000000000000000000000000" of length 32
Additionally, I have a list of positions i (0 <= i <= 31) at which I want the mask to have value 1.
For instance List(0,12,30,4) should give the following result :
mask = "10001000000010000000000000000010"
How can I do this efficiently in scala ?
Thank you

A naive approach would be to fold over the positions with zero element 'mask' and successively update the char at the given position:
List(0,12,30,4).foldLeft(mask)((s, i) => s.updated(i, '1'))
- Daniel

Unfortunately the most efficient way I can think of to do this is the same as you'd do it in any other (not functional) programming language:
val mask = "00000000000000000000000000000000"
val l = List(0, 12, 30, 4)
val sb = new StringBuilder(mask)
for (i <- l) { sb(i) = '1' }
println(sb.toString)
This should actually be more efficient than Daniel answer, but I'd prefer Daniel's due to clarity. Still you've asked for the most efficient way
Updated
Ok, I think this should be more or less efficient and FP-style - the trick is to use views:
val view : SeqView[Char, Seq[_]] = (mask: Seq[Char]).view
println(List(0,12,30,4).foldLeft(view)((s, i) => s.updated(i, '1')).mkString)

I know this isn't exactly what you asked—but I have to wonder why you're using a String. There's a very efficient data structure for storing this type of information, called a BitSet.
If you're using a BitSet, then setting the bits corresponding to a list of integers is trivial.
If you want a mutable BitSet:
scala.collection.mutable.BitSet.empty ++= List(0,12,30,4)
If you want an immutable BitSet:
scala.collection.immutable.BitSet.empty ++ List(0,12,30,4)

Related

takeRightWhile() method in scala

I might be missing something but recently I came across a task to get last symbols according to some condition. For example I have a string: "this_is_separated_values_5". Now I want to extract 5 as Int.
Note: number of parts separated by _ is not defined.
If I would have a method takeRightWhile(f: Char => Boolean) on a string it would be trivial: takeRightWhile(ch => ch != '_'). Moreover it would be efficient: a straightforward implementation would actually involve finding the last index of _ and taking a substring while the use of this method would save first step and provide better average time complexity.
UPDATE: Guys, all the variations of str.reverse.takeWhile(_!='_').reverse are quite inefficient as you actually use additional O(n) space. If you want to implement method takeRightWhile efficiently you could iterate starting from the right, accumulating result in string builder of whatever else, and returning the result. I am asking about this kind of method, not implementation which was already described and declined in the question itself.
Question: Does this kind of method exist in scala standard library? If no, is there method combination from the standard library to achieve the same in minimum amount of lines?
Thanks in advance.
Possible solution:
str.reverse.takeWhile(_!='_').reverse
Update
You can go from right to left with following expression using foldRight:
str.toList.foldRight(List.empty[Char]) {
case (item, acc) => item::acc
}
Here you need to check condition and stop adding items after condition met. For this you can pass a flag to accumulated value:
val (_, list) = str.toList.foldRight((false, List.empty[Char])) {
case (item, (false, list)) if item!='_' => (false, item::list)
case (_, (_, list)) => (true, list)
}
val res = list.mkString.toInt
This solution is even more inefficient then solution with double reverse:
Implementation of foldRight uses combination of List reverse and foldLeft
You cannot break foldRight execution, so you need flag to skip all items after condition met
I'd go with this:
val s = "string_with_following_number_42"
s.split("_").reverse.head
// res:String = 42
This is a naive attempt and by no means optimized. What it does is splitting the String into an Array of Strings, reverses it and takes the first element. Note that, because the reversing happens after the splitting, the order of the characters is correct.
I am not exactly sure about the problem you are facing. My understanding is that you want have a string of format xxx_xxx_xx_...._xxx_123 and you want to extract the part at the end as Int.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val yourInt = yourStr.split('_').last.toInt
// But remember that the above is unsafe so you may want to take it as Option
val yourIntOpt = Try(yourStr.split('_').last.toInt).toOption
Or... lets say your requirement is to collect a right-suffix till some boolean condition remains true.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val rightSuffix = yourStr.reverse.takeWhile(c => c != '_').reverse
val yourInt = rightSuffix.toInt
// but above is unsafe so
val yourIntOpt = Try(righSuffix.toInt).toOption
Comment if your requirement is different from this.
You can use StringBuilder and lastIndexWhere.
val str = "this_is_separated_values_5"
val sb = new StringBuilder(str)
val lastIdx = sb.lastIndexWhere(ch => ch != '_')
val lastCh = str.charAt(lastIdx)

Understanding String in Scala and the map method

I wrote the following simple example to understand how the map method works:
object Main{
def main (args : Array[String]) = {
val test = "abc"
val t = Vector(97, 98, 99)
println(test.map(c => (c + 1))) //1 Vector(98, 99, 100)
println(test.map(c => (c + 1).toChar)) //2 bcd
println(t.map(i => (i + 1))) //3 Vector(98, 99, 100)
println(t.map(i => (i + 1).toChar)) //4 Vector(b, c, d)
};
}
I didn't quite understand why bcd is printed at //2. Since every String is treated by Scala as being a Seq I thought that test.map(c => (c + 1).toChar) should have produced another Seq. As //1 suggests Vector(b, c, d). But as you can see, it didn't. Why? How does it actually work?
This is a feature of Scala collections (String in this case is treated as a collection of characters). The real explanation is quite complex, and involves understanding of typeclasses (I guess, this is why Haskell was mentioned in the comment), but the simple explanation is, well, not quite hard.
The point is, Scala collections library authors tried very hard to avoid code duplication. For example, the map function on String is actually defined here: scala.collection.TraversableLike#map. On the other hand, a naive approach to such task would make map return TraversableLike, not the original type the map was called on (it was the String). That's why they've came up with an approach that allows to avoid both code duplication and unnecessary type casting or too general return type.
Basically, Scala collections methods like map produce the type that is as close to the type it was called at as possible. This is achieved using a typeclass called CanBuildFrom. The full signature of the map looks as follows:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That
There is a lot of explanations what is a typeclass and CanBuildFrom around. I'd suggest looking here first: http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html#factoring-out-common-operations. Another good explanation is here: Scala 2.8 CanBuildFrom
When you use map, this is what is happening : [List|Seq|etc].map([eachElement] => [do something])
map applies some operation to each element of the variable on the left hand-side : "abc".map(letter => letter + 1) will add 1 to each element of the String "abc". And each element of the String abc is called here "letter" (which is of type Char)
"abc" is a String, and as in C++, it is treated as an array of characters. But since test is of type String, the map function gives a String as well.
I tried the following :
val test2 : Seq[Char] = "abc"
but I still get a result of type String, I guess Scala does the conversion automatically from a Seq[Char] to a String
I hope it helped!

Matlab - dynamically produce string for anonymous function

I am trying to produce the argument string for an anonymous function based on the number of input arguments without using for loops. For example, if N=3, then I want a string that reads
#(ax(1),ax(2),ax(3),ay(1),ay(2),ay(3))
I tried using repmat('ax',1,N) but I cannot figure out how to interleave the (i) index.
Any ideas?
Aside: Great answers so far, the above problem has been solved. To provide some intuition for those who are wondering why I want to do this: I need to construct a very large matrix anonymous function (a Jacobian) on the order of 3000x3000. I initially used the Matlab operations jacobian and matlabFunction to construct the anonymous function; however, this was quite slow. Instead, since the closed form of the derivative was quite simple, I decided to form the anonymous function directly. This was done by forming the symbolic Jacobian matrix, J, then appending it to the above #() string by using char(J{:})' and using eval to form the final anonymous function. This may not be the most elegant solution but I find it runs much faster than the jacobian/matlabFunction combination, especially for large N (additionally the structure of the new approach allows for the evaluation to be done in parallel).
EDIT: Just for completeness, the correct form of the argument string for the anonymous function should read
#(ax1,ax2,ax3,ay1,ay2,ay3)
to avoid a syntax error associated with indexing.
I suggest the following:
N = 3;
argumentString = [repmat('ax(%i),',1,N),repmat('ay(%i),',1,N)];
functionString = sprintf(['#(',argumentString(1:end-1),')'], 1:N, 1:N)
First, you create input masks for sprintf (e.g. 'ax(%i)'), which you then fill in with the appropriate numbers to create the function string.
Note: the syntax #(ax(1),...) will not actually work. More likely, you want to use either #()someFunction(ax(1),...), or you are trying to pass multiple input arguments to an existing function, in which case storing the inputs in a cell array and calling the function as fun(axCell{:}) would work.
A solution would be to use arrayfun:
sx = strjoin(arrayfun(#(x) ['ax(' num2str(x) ')'], 1:3, 'UniformOutput', false), ',');
sy = strjoin(arrayfun(#(x) ['ay(' num2str(x) ')'], 1:3, 'UniformOutput', false), ',');
s = ['#(' sx ',' sy ')'];
contains
'#(ax(1),ax(2),ax(3),ay(1),ay(2),ay(3))'
Best,
Try this:
N = 3;
sx = strcat('ax(', arrayfun(#num2str, 1:N, 'uniformoutput', 0), '),');
sy = strcat('ay(', arrayfun(#num2str, 1:N, 'uniformoutput', 0), '),');
str = [sx{:} sy{:}];
str = ['#(' str(1:end-1) ')']

How to find maximum overlap between two strings in Scala?

Suppose I have two strings: s and t. I need to write a function f to find a max. t prefix, which is also an s suffix. For example:
s = "abcxyz", t = "xyz123", f(s, t) = "xyz"
s = "abcxxx", t = "xx1234", f(s, t) = "xx"
How would you write it in Scala ?
This first solution is easily the most concise, also it's more efficient than a recursive version as it's using a lazily evaluated iteration
s.tails.find(t.startsWith).get
Now there has been some discussion regarding whether tails would end up copying the whole string over and over. In which case you could use toList on s then mkString the result.
s.toList.tails.find(t.startsWith(_: List[Char])).get.mkString
For some reason the type annotation is required to get it to compile. I've not actually trying seeing which one is faster.
UPDATE - OPTIMIZATION
As som-snytt pointed out, t cannot start with any string that is longer than it, and therefore we could make the following optimization:
s.drop(s.length - t.length).tails.find(t.startsWith).get
Efficient, this is not, but it is a neat (IMO) one-liner.
val s = "abcxyz"
val t ="xyz123"
(s.tails.toSet intersect t.inits.toSet).maxBy(_.size)
//res8: String = xyz
(take all the suffixes of s that are also prefixes of t, and pick the longest)
If we only need to find the common overlapping part, then we can recursively take tail of the first string (which should overlap with the beginning of the second string) until the remaining part will not be the one that second string begins with. This also covers the case when the strings have no overlap, because then the empty string will be returned.
scala> def findOverlap(s:String, t:String):String = {
if (s == t.take(s.size)) s else findOverlap (s.tail, t)
}
findOverlap: (s: String, t: String)String
scala> findOverlap("abcxyz", "xyz123")
res3: String = xyz
scala> findOverlap("one","two")
res1: String = ""
UPDATE: It was pointed out that tail might not be implemented in the most efficient way (i.e. it creates a new string when it is called). If that becomes an issue, then using substring(1) instead of tail (or converting both Strings to Lists, where it's tail / head should have O(1) complexity) might give a better performance. And by the same token, we can replace t.take(s.size) with t.substring(0,s.size).

Data Structure for Subsequence Queries

In a program I need to efficiently answer queries of the following form:
Given a set of strings A and a query string q return all s ∈ A such that q is a subsequence of s
For example, given A = {"abcdef", "aaaaaa", "ddca"} and q = "acd" exactly "abcdef" should be returned.
The following is what I have considered considered so far:
For each possible character, make a sorted list of all string/locations where it appears. For querying interleave the lists of the involved characters, and scan through it looking for matches within string boundaries.
This would probably be more efficient for words instead of characters, since the limited number of different characters will make the return lists very dense.
For each n-prefix q might have, store the list of all matching strings. n might realistically be close to 3. For query strings longer than that we brute force the initial list.
This might speed things up a bit, but one could easily imagine some n-subsequences being present close to all strings in A, which means worst case is the same as just brute forcing the entire set.
Do you know of any data structures, algorithms or preprocessing tricks which might be helpful for performing the above task efficiently for large As? (My ss will be around 100 characters)
Update: Some people have suggested using LCS to check if q is a subsequence of s. I just want to remind that this can be done using a simple function such as:
def isSub(q,s):
i, j = 0, 0
while i != len(q) and j != len(s):
if q[i] == s[j]:
i += 1
j += 1
else:
j += 1
return i == len(q)
Update 2: I've been asked to give more details on the nature of q, A and its elements. While I'd prefer something that works as generally as possible, I assume A will have length around 10^6 and will need to support insertion. The elements s will be shorter with an average length of 64. The queries q will only be 1 to 20 characters and be used for a live search, so the query "ab" will be sent just before the query "abc". Again, I'd much prefer the solution to use the above as little as possible.
Update 3: It has occurred to me, that a data-structure with O(n^{1-epsilon}) lookups, would allow you to solve OVP / disprove the SETH conjecture. That is probably the reason for our suffering. The only options are then to disprove the conjecture, use approximation, or take advantage of the dataset. I imagine quadlets and tries would do the last in different settings.
It could done by building an automaton. You can start with NFA (nondeterministic finite automaton which is like an indeterministic directed graph) which allows edges labeled with an epsilon character, which means that during processing you can jump from one node to another without consuming any character. I'll try to reduce your A. Let's say you A is:
A = {'ab, 'bc'}
If you build NFA for ab string you should get something like this:
+--(1)--+
e | a| |e
(S)--+--(2)--+--(F)
| b| |
+--(3)--+
Above drawing is not the best looking automaton. But there are a few points to consider:
S state is the starting state and F is the ending state.
If you are at F state it means your string qualifies as a subsequence.
The rule of propagating within an autmaton is that you can consume e (epsilon) to jump forward, therefore you can be at more then one state at each point in time. This is called e closure.
Now if given b, starting at state S I can jump one epsilon, reach 2, and consume b and reach 3. Now given end string I consume epsilon and reach F, thus b qualifies as a sub-sequence of ab. So does a or ab you can try yourself using above automata.
The good thing about NFA is that they have one start state and one final state. Two NFA could be easily connected using epsilons. There are various algorithms that could help you to convert NFA to DFA. DFA is a directed graph which can follow precise path given a character -- in particular, it is always in exactly one state at any point in time. (For any NFA, there is a corresponding DFA whose states correspond to sets of states in the NFA.)
So, for A = {'ab, 'bc'}, we would need to build NFA for ab then NFA for bc then join the two NFAs and build the DFA of the entire big NFA.
EDIT
NFA of subsequence of abc would be a?b?c?, so you can build your NFA as:
Now, consider the input acd. To query if ab is subsequence of {'abc', 'acd'}, you can use this NFA: (a?b?c?)|(a?c?d). Once you have NFA you can convert it to DFA where each state will contain whether it is a subsequence of abc or acd or maybe both.
I used link below to make NFA graphic from regular expression:
http://hackingoff.com/images/re2nfa/2013-08-04_21-56-03_-0700-nfa.svg
EDIT 2
You're right! In case if you've 10,000 unique characters in the A. By unique I mean A is something like this: {'abc', 'def'} i.e. intersection of each element of A is empty set. Then your DFA would be worst case in terms of states i.e. 2^10000. But I'm not sure when would that be possible given that there can never be 10,000 unique characters. Even if you have 10,000 characters in A still there will be repetitions and that might reduce states alot since e-closure might eventually merge. I cannot really estimate how much it might reduce. But even having 10 million states, you will only consume less then 10 mb worth of space to construct a DFA. You can even use NFA and find e-closures at run-time but that would add to run-time complexity. You can search different papers on how large regex are converted to DFAs.
EDIT 3
For regex (a?b?c?)|(e?d?a?)|(a?b?m?)
If you convert above NFA to DFA you get:
It actually lot less states then NFA.
Reference:
http://hackingoff.com/compilers/regular-expression-to-nfa-dfa
EDIT 4
After fiddling with that website more. I found that worst case would be something like this A = {'aaaa', 'bbbbb', 'cccc' ....}. But even in this case states are lesser than NFA states.
Tests
There have been four main proposals in this thread:
Shivam Kalra suggested creating an automaton based on all the strings in A. This approach has been tried slightly in the literature, normally under the name "Directed Acyclic Subsequence Graph" (DASG).
J Random Hacker suggested extending my 'prefix list' idea to all 'n choose 3' triplets in the query string, and merging them all using a heap.
In the note "Efficient Subsequence Search in Databases" Rohit Jain, Mukesh K. Mohania and Sunil Prabhakar suggest using a Trie structure with some optimizations and recursively search the tree for the query. They also have a suggestion similar to the triplet idea.
Finally there is the 'naive' approach, which wanghq suggested optimizing by storing an index for each element of A.
To get a better idea of what's worth putting continued effort into, I have implemented the above four approaches in Python and benchmarked them on two sets of data. The implementations could all be made a couple of magnitudes faster with a well done implementation in C or Java; and I haven't included the optimizations suggested for the 'trie' and 'naive' versions.
Test 1
A consists of random paths from my filesystem. q are 100 random [a-z] strings of average length 7. As the alphabet is large (and Python is slow) I was only able to use duplets for method 3.
Construction times in seconds as a function of A size:
Query times in seconds as a function of A size:
Test 2
A consists of randomly sampled [a-b] strings of length 20. q are 100 random [a-b] strings of average length 7. As the alphabet is small we can use quadlets for method 3.
Construction times in seconds as a function of A size:
Query times in seconds as a function of A size:
Conclusions
The double logarithmic plot is a bit hard to read, but from the data we can draw the following conclusions:
Automatons are very fast at querying (constant time), however they are impossible to create and store for |A| >= 256. It might be possible that a closer analysis could yield a better time/memory balance, or some tricks applicable for the remaining methods.
The dup-/trip-/quadlet method is about twice as fast as my trie implementation and four times as fast as the 'naive' implementation. I used only a linear amount of lists for the merge, instead of n^3 as suggested by j_random_hacker. It might be possible to tune the method better, but in general it was disappointing.
My trie implementation consistently does better than the naive approach by around a factor of two. By incorporating more preprocessing (like "where are the next 'c's in this subtree") or perhaps merging it with the triplet method, this seems like todays winner.
If you can do with a magnitude less performance, the naive method does comparatively just fine for very little cost.
As you point out, it might be that all strings in A contain q as a subsequence, in which case you can't hope to do better than O(|A|). (That said, you might still be able to do better than the time taken to run LCS on (q, A[i]) for each string i in A, but I won't focus on that here.)
TTBOMK there are no magic, fast ways to answer this question (in the way that suffix trees are the magic, fast way to answer the corresponding question involving substrings instead of subsequences). Nevertheless if you expect the set of answers for most queries to be small on average then it's worth looking at ways to speed up these queries (the ones yielding small-size answers).
I suggest filtering based on a generalisation of your heuristic (2): if some database sequence A[i] contains q as a subsequence, then it must also contain every subsequence of q. (The reverse direction is not true unfortunately!) So for some small k, e.g. 3 as you suggest, you can preprocess by building an array of lists telling you, for every length-k string s, the list of database sequences containing s as a subsequence. I.e. c[s] will contain a list of the ID numbers of database sequences containing s as a subsequence. Keep each list in numeric order to enable fast intersections later.
Now the basic idea (which we'll improve in a moment) for each query q is: Find all k-sized subsequences of q, look up each in the array of lists c[], and intersect these lists to find the set of sequences in A that might possibly contain q as a subsequence. Then for each possible sequence A[i] in this (hopefully small) intersection, perform an O(n^2) LCS calculation with q to see whether it really does contain q.
A few observations:
The intersection of 2 sorted lists of size m and n can be found in O(m+n) time. To find the intersection of r lists, perform r-1 pairwise intersections in any order. Since taking intersections can only produce sets that are smaller or of the same size, time can be saved by intersecting the smallest pair of lists first, then the next smallest pair (this will necessarily include the result of the first operation), and so on. In particular: sort lists in increasing size order, then always intersect the next list with the "current" intersection.
It is actually faster to find the intersection a different way, by adding the first element (sequence number) of each of the r lists into a heap data structure, then repeatedly pulling out the minimum value and replenishing the heap with the next value from the list that the most recent minimum came from. This will produce a list of sequence numbers in nondecreasing order; any value that appears fewer than r times in a row can be discarded, since it cannot be a member of all r sets.
If a k-string s has only a few sequences in c[s], then it is in some sense discriminating. For most datasets, not all k-strings will be equally discriminating, and this can be used to our advantage. After preprocessing, consider throwing away all lists having more than some fixed number (or some fixed fraction of the total) of sequences, for 3 reasons:
They take a lot of space to store
They take a lot of time to intersect during query processing
Intersecting them will usually not shrink the overall intersection much
It is not necessary to consider every k-subsequence of q. Although this will produce the smallest intersection, it involves merging (|q| choose k) lists, and it might well be possible to produce an intersection that is nearly as small using just a fraction of these k-subsequences. E.g. you could limit yourself to trying all (or a few) k-substrings of q. As a further filter, consider just those k-subsequences whose sequence lists in c[s] are below some value. (Note: if your threshold is the same for every query, you might as well delete all such lists from the database instead, since this will have the same effect, and saves space.)
One thought;
if q tends to be short, maybe reducing A and q to a set will help?
So for the example, derive to { (a,b,c,d,e,f), (a), (a,c,d) }. Looking up possible candidates for any q should be faster than the original problem (that's a guess actually, not sure how exactly. maybe sort them and "group" similar ones in bloom filters?), then use bruteforce to weed out false positives.
If A strings are lengthy, you could make the characters unique based on their occurence, so that would be {(a1,b1,c1,d1,e1,f1),(a1,a2,a3,a4,a5,a6),(a1,c1,d1,d2)}. This is fine, because if you search for "ddca" you only want to match the second d to a second d. The size of your alphabet would go up (bad for bloom or bitmap style operations) and would be different ever time you get new A's, but the amount of false positives would go down.
First let me make sure my understanding/abstraction is correct. The following two requirements should be met:
if A is a subsequence of B, then all characters in A should appear in B.
for those characters in B, their positions should be in an ascending order.
Note that, a char in A might appear more than once in B.
To solve 1), a map/set can be used. The key is the character in string B, and the value doesn't matter.
To solve 2), we need to maintain the position of each characters. Since a character might appear more than once, the position should be a collection.
So the structure is like:
Map<Character, List<Integer>)
e.g.
abcdefab
a: [0, 6]
b: [1, 7]
c: [2]
d: [3]
e: [4]
f: [5]
Once we have the structure, how to know if the characters are in the right order as they are in string A? If B is acd, we should check the a at position 0 (but not 6), c at position 2 and d at position 3.
The strategy here is to choose the position that's after and close to the previous chosen position. TreeSet is a good candidate for this operation.
public E higher(E e)
Returns the least element in this set strictly greater than the given element, or null if there is no such element.
The runtime complexity is O(s * (n1 + n2)*log(m))).
s: number of strings in the set
n1: number of chars in string (B)
n2: number of chars in query string (A)
m: number of duplicates in string (B), e.g. there are 5 a.
Below is the implementation with some test data.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeSet;
public class SubsequenceStr {
public static void main(String[] args) {
String[] testSet = new String[] {
"abcdefgh", //right one
"adcefgh", //has all chars, but not the right order
"bcdefh", //missing one char
"", //empty
"acdh",//exact match
"acd",
"acdehacdeh"
};
List<String> subseqenceStrs = subsequenceStrs(testSet, "acdh");
for (String str : subseqenceStrs) {
System.out.println(str);
}
//duplicates in query
subseqenceStrs = subsequenceStrs(testSet, "aa");
for (String str : subseqenceStrs) {
System.out.println(str);
}
subseqenceStrs = subsequenceStrs(testSet, "aaa");
for (String str : subseqenceStrs) {
System.out.println(str);
}
}
public static List<String> subsequenceStrs(String[] strSet, String q) {
System.out.println("find strings whose subsequence string is " + q);
List<String> results = new ArrayList<String>();
for (String str : strSet) {
char[] chars = str.toCharArray();
Map<Character, TreeSet<Integer>> charPositions = new HashMap<Character, TreeSet<Integer>>();
for (int i = 0; i < chars.length; i++) {
TreeSet<Integer> positions = charPositions.get(chars[i]);
if (positions == null) {
positions = new TreeSet<Integer>();
charPositions.put(chars[i], positions);
}
positions.add(i);
}
char[] qChars = q.toCharArray();
int lowestPosition = -1;
boolean isSubsequence = false;
for (int i = 0; i < qChars.length; i++) {
TreeSet<Integer> positions = charPositions.get(qChars[i]);
if (positions == null || positions.size() == 0) {
break;
} else {
Integer position = positions.higher(lowestPosition);
if (position == null) {
break;
} else {
lowestPosition = position;
if (i == qChars.length - 1) {
isSubsequence = true;
}
}
}
}
if (isSubsequence) {
results.add(str);
}
}
return results;
}
}
Output:
find strings whose subsequence string is acdh
abcdefgh
acdh
acdehacdeh
find strings whose subsequence string is aa
acdehacdeh
find strings whose subsequence string is aaa
As always, I might be totally wrong :)
You might want to have a look into the Book Algorithms on Strings and Sequences by Dan Gusfield. As it turns out part of it is available on the internet. You might also want to read Gusfield's Introduction to Suffix Trees. As it turns out this book covers many approaches for you kind of question. It is considered one of the standard publications in this field.
Get a fast longest common subsequence algorithm implementation. Actually it suffices to determine the length of the LCS. Notice that Gusman's book has very good algorithms and also points to more sources for such algorithms.
Return all s ∈ A with length(LCS(s,q)) == length(q)

Resources