faster string sorting with long common prefix? - string

I have a set of strings. 90% of them are URLs start with "http://www.". I want to sort them alphabetically.
Currently I use C++ std::sort(). but std::sort is a variant of quick-sort based on comparison, and comparing two strings with long common prefix is not effecient. However (I think) a radix-sort won't work either, since most strings are put in the same bucket because of long common prefix.
Is there any better algorithm than normal quick-sort/radix-sort for this problem?

I would suspect that the processing time you spend trying to exploit common prefixes on the order of 10 characters per URL doesn't even pay for itself when you consider the average length of URLs.
Just try a completely standard sort. If that's not fast enough, look at parallelizing or distributing a completely standard sort. It's a straightforward approach that will work.

Common Prefixes seem to naturally imply that a trie data structure could be useful. So the idea is to build a trie of all the words and then sort each node. The ordering should be that the children of a particular node reside in a list and are sorted. This can be done easily since at a particular node we need only sort the children, so naturally a recursive solution reveals itself. See this for more inspiration: http://goanna.cs.rmit.edu.au/~jz/fulltext/acsc03sz.pdf

Create two groups: the ones with the prefix and the ones without. For the first set remove the prefix, sort and add the prefix back. For the second set just sort. After that divide the second set into before prefix and after prefix. Now concatenate the three lists (list_2_before, list_1, list_2_after).
Instead of removing and adding prefix for the first list, you can write your own custom code that would start comparing after the prefix (i.e. ignore the prefix part while comparing).
Addendum: If you have multiple common prefixes you can use them further to speed up. It is better to create a shallow tree with very common prefixes and join them.

If you figure out the minimum and maximum values in the vector before you start quicksorting, then you always know the range of values for each call to partition(), since the partition value is either the minimum or the maximum (or at least close to the minimum/maximum) of each subrange and the containing partition's minimum and maximum are the other end of each subrange.
If the minimum and the maximum of a subrange share a common prefix, then you can do all of the partition comparisons starting from the character position following the common prefix. As the quicksort progresses, the ranges get smaller and smaller so their common prefixes should get longer and longer, and ignoring them for the comparisons will save more and more time. How much, I don't know; you'd have to benchmark this to see if it actually helps.
In any event, the additional overhead is fairly small; one pass through the vector to find the minim and maximum string, costing 1.5 comparisons per string (*), and then one check for each partition to find the maximum shared prefix for the partition; the check is equivalent to a comparison, and it can start from the maximum shared prefix of the containing prefix, so it's not even a full string comparison.
The min/max algorithm: Scan the vector two elements at a time. For each pair, first compare them with each other, then compare the smaller one with the running minimum and the larger one with the running maximum. Result: three comparisons for two elements, or 1.5 comparisons per element.

At last I found a Ternary Quick Sort works well. I found the algorithm at www.larsson.dogma.net/qsufsort.c.
Here is my modified implementation, with similar interface to std::sort. It's about 40% faster than std::sort on my machine and dataset.
#include <iterator>
template<class RandIt> static inline void multiway_qsort(RandIt beg, RandIt end, size_t depth = 0, size_t step_len = 6) {
if(beg + 1 >= end) {
return;
}
struct { /* implement bounded comparing */
inline int operator() (
const typename std::iterator_traits<RandIt>::value_type& a,
const typename std::iterator_traits<RandIt>::value_type& b, size_t depth, size_t step_len) {
for(size_t i = 0; i < step_len; i++) {
if(a[depth + i] == b[depth + i] && a[depth + i] == 0) return 0;
if(a[depth + i] < b[depth + i]) return +1;
if(a[depth + i] > b[depth + i]) return -1;
}
return 0;
}
} bounded_cmp;
RandIt i = beg;
RandIt j = beg + std::distance(beg, end) / 2;
RandIt k = end - 1;
typename std::iterator_traits<RandIt>::value_type key = ( /* median of l,m,r */
bounded_cmp(*i, *j, depth, step_len) > 0 ?
(bounded_cmp(*i, *k, depth, step_len) > 0 ? (bounded_cmp(*j, *k, depth, step_len) > 0 ? *j : *k) : *i) :
(bounded_cmp(*i, *k, depth, step_len) < 0 ? (bounded_cmp(*j, *k, depth, step_len) < 0 ? *j : *k) : *i));
/* 3-way partition */
for(j = i; j <= k; ++j) {
switch(bounded_cmp(*j, key, depth, step_len)) {
case +1: std::iter_swap(i, j); ++i; break;
case -1: std::iter_swap(k, j); --k; --j; break;
}
}
++k;
if(beg + 1 < i) multiway_qsort(beg, i, depth, step_len); /* recursively sort [x > pivot] subset */
if(end + 1 > k) multiway_qsort(k, end, depth, step_len); /* recursively sort [x < pivot] subset */
/* recursively sort [x == pivot] subset with higher depth */
if(i < k && (*i)[depth] != 0) {
multiway_qsort(i, k, depth + step_len, step_len);
}
return;
}

The paper Engineering Parallel String Sorting has, surprisingly, benchmarks of a large number of single-threaded algorithms on an URL dataset (see page 29). Rantala's variant of multikey quicksort with caching comes out ahead; you can test multikey_cache8 in this repository cited in the paper.
I've tested the dataset in that paper, and if it's any indication you're seeing barely one bit of entropy in the first ten characters, and distinguishing prefixes in the 100-character range. Doing 100 passes of radix sort will thrash the cache with little benefit, eg sorting a million urls means you're looking for ~20 distinguishing bits in each key at a cost of ~100 cache misses.
While in general radix sort will not perform well on long strings , the optimizations that Kärkkäinen and Rantala describe in Engineering Radix Sort for Strings are sufficient for the URL dataset. In particular they read ahead 8 characters and cache them with the string pointers; sorting on these cached values yields enough entropy to get past the cache-miss problem.
For longer strings try some of the LCP-based algorithms in that repository; in my experience the URL dataset is about at the break-even point between highly-optimized radix-type sorts and LCP-based algorithms which do asymptotically better on longer strings.

Related

What does the int value returned from compareTo function in Kotlin really mean?

In the documentation of compareTo function, I read:
Returns zero if this object is equal to the specified other object, a
negative number if it's less than other, or a positive number if it's
greater than other.
What does this less than or greater than mean in the context of strings? Is -for example- Hello World less than a single character a?
val epicString = "Hello World"
println(epicString.compareTo("a")) //-25
Why -25 and not -10 or -1 (for example)?
Other examples:
val epicString = "Hello World"
println(epicString.compareTo("HelloWorld")) //-55
Is Hello World less than HelloWorld? Why?
Why it returns -55 and not -1, -2, -3, etc?
val epicString = "Hello World"
println(epicString.compareTo("Hello World")) //55
Is Hello World greater than Hello World? Why?
Why it returns 55 and not 1, 2, 3, etc?
I believe you're asking about the implementation of compareTo method for java.lang.String. Here is a source code for java 11:
public int compareTo(String anotherString) {
byte v1[] = value;
byte v2[] = anotherString.value;
if (coder() == anotherString.coder()) {
return isLatin1() ? StringLatin1.compareTo(v1, v2)
: StringUTF16.compareTo(v1, v2);
}
return isLatin1() ? StringLatin1.compareToUTF16(v1, v2)
: StringUTF16.compareToLatin1(v1, v2);
}
So we have a delegation to either StringLatin1 or StringUTF16 here, so we should look further:
Fortunately StringLatin1 and StringUTF16 have similar implementation when it comes to compare functionality:
Here is an implementation for StringLatin1 for example:
public static int compareTo(byte[] value, byte[] other) {
int len1 = value.length;
int len2 = other.length;
return compareTo(value, other, len1, len2);
}
public static int compareTo(byte[] value, byte[] other, int len1, int len2) {
int lim = Math.min(len1, len2);
for (int k = 0; k < lim; k++) {
if (value[k] != other[k]) {
return getChar(value, k) - getChar(other, k);
}
}
return len1 - len2;
}
As you see, it iterated over the characters of the shorter string and in case the charaters in the same index of two strings are different it returns the difference between them. If during the iterations it doesn't find any different (one string is prefix of another) it resorts to the comparison between the length of two strings.
In your case, there is a difference in the first iteration already...
So its the same as `"H".compareTo("a") --> -25".
The code of "H" is 72
The code of "a" is 97
So, 72 - 97 = -25
Short answer: The exact value doesn't have any meaning; only its sign does.
As the specification for compareTo() says, it returns a -ve number if the receiver is smaller than the other object, a +ve number if the receiver is larger, or 0 if the two are considered equal (for the purposes of this ordering).
The specification doesn't distinguish between different -ve numbers, nor between different +ve numbers — and so neither should you.  Some classes always return -1, 0, and 1, while others return different numbers, but that's just an implementation detail — and implementations vary.
Let's look at a very simple hypothetical example:
class Length(val metres: Int) : Comparable<Length> {
override fun compareTo(other: Length)
= metres - other.metres
}
This class has a single numerical property, so we can use that property to compare them.  One common way to do the comparison is simply to subtract the two lengths: that gives a number which is positive if the receiver is larger, negative if it's smaller, and zero of they're the same length — which is just what we need.
In this case, the value of compareTo() would happen to be the signed difference between the two lengths.
However, that method has a subtle bug: the subtraction could overflow, and give the wrong results if the difference is bigger than Int.MAX_VALUE.  (Obviously, to hit that you'd need to be working with astronomical distances, both positive and negative — but that's not implausible.  Rocket scientists write programs too!)
To fix it, you might change it to something like:
class Length(val metres: Int) : Comparable<Length> {
override fun compareTo(other: Length) = when {
metres > other.metres -> 1
metres < other.metres -> -1
else -> 0
}
}
That fixes the bug; it works for all possible lengths.
But notice that the actual return value has changed in most cases: now it only ever returns -1, 0, or 1, and no longer gives an indication of the actual difference in lengths.
If this was your class, then it would be safe to make this change because it still matches the specification.  Anyone who just looked at the sign of the result would see no change (apart from the bug fix).  Anyone using the exact value would find that their programs were now broken — but that's their own fault, because they shouldn't have been relying on that, because it was undocumented behaviour.
Exactly the same applies to the String class and its implementation.  While it might be interesting to poke around inside it and look at how it's written, the code you write should never rely on that sort of detail.  (It could change in a future version.  Or someone could apply your code to another object which didn't behave the same way.  Or you might want to expand your project to be cross-platform, and discover the hard way that the JavaScript implementation didn't behave exactly the same as the Java one.)
In the long run, life is much simpler if you don't assume anything more than the specification promises!

i made the following binary search algorithm, and I am trying to improve it. suggestions?

i am doing a homework assignment where I have to check large lists of numbers for a given number. the list length is <= 20000, and I can be searching for just as many numbers. if the number we are searching for is in the list, return the index of that number, otherwise, return -1. here is what I did.
i wrote the following code, that outputsthe correct answer, but does not do it fast enough. it has to be done in less than 1 second.
here is my binary search code:`I am looking for suggestions to make it faster.
def binary_search(list1, target):
p = list1
upper = len(list1)
lower = 0
found = False
check = int((upper+lower)//2)
while found == False:
upper = len(list1)
lower = 0
check = int(len(list1)//2)
if list1[check] > target:
list1 = list1[lower:check]
check= int((len(list1))//2)
if list1[check] < target:
list1 = list1[check:upper]
check = int((len(list1))//2)
if list1[check] == target:
found = True
return p.index(target)
if len(list1)==1:
if target not in list1:
return -1`
grateful for any help/
The core problem is that this is not a correctly written Binary Search (see the algorithm there).
There is no need of index(target) to find the solution; a binary search is O(lg n) and so the very presence of index-of, being O(n), violates this! As written, the entire "binary search" function could be replaced as list1.index(value) as it is "finding" the value index simply to "find" the value index.
The code is also slicing the lists which is not needed1; simply move the upper/lower indices to "hone in" on the value. The index of the found value is where the upper and lower bounds eventually meet.
Also, make sure that list1 is really a list such that the item access is O(1).
(And int is not needed with //.)
1 I believe that the complexity is still O(lg n) with the slice, but it is a non-idiomatic binary search and adds additional overhead of creating new lists and copying the relevant items. It also doesn't allow the correct generation of the found item's index - at least without similar variable maintenance as found in a traditional implementation.
Try using else if's, for example if the value thats being checked is greater then you don't also need to check if its smaller.

Data Structure for Subsequence Queries

In a program I need to efficiently answer queries of the following form:
Given a set of strings A and a query string q return all s ∈ A such that q is a subsequence of s
For example, given A = {"abcdef", "aaaaaa", "ddca"} and q = "acd" exactly "abcdef" should be returned.
The following is what I have considered considered so far:
For each possible character, make a sorted list of all string/locations where it appears. For querying interleave the lists of the involved characters, and scan through it looking for matches within string boundaries.
This would probably be more efficient for words instead of characters, since the limited number of different characters will make the return lists very dense.
For each n-prefix q might have, store the list of all matching strings. n might realistically be close to 3. For query strings longer than that we brute force the initial list.
This might speed things up a bit, but one could easily imagine some n-subsequences being present close to all strings in A, which means worst case is the same as just brute forcing the entire set.
Do you know of any data structures, algorithms or preprocessing tricks which might be helpful for performing the above task efficiently for large As? (My ss will be around 100 characters)
Update: Some people have suggested using LCS to check if q is a subsequence of s. I just want to remind that this can be done using a simple function such as:
def isSub(q,s):
i, j = 0, 0
while i != len(q) and j != len(s):
if q[i] == s[j]:
i += 1
j += 1
else:
j += 1
return i == len(q)
Update 2: I've been asked to give more details on the nature of q, A and its elements. While I'd prefer something that works as generally as possible, I assume A will have length around 10^6 and will need to support insertion. The elements s will be shorter with an average length of 64. The queries q will only be 1 to 20 characters and be used for a live search, so the query "ab" will be sent just before the query "abc". Again, I'd much prefer the solution to use the above as little as possible.
Update 3: It has occurred to me, that a data-structure with O(n^{1-epsilon}) lookups, would allow you to solve OVP / disprove the SETH conjecture. That is probably the reason for our suffering. The only options are then to disprove the conjecture, use approximation, or take advantage of the dataset. I imagine quadlets and tries would do the last in different settings.
It could done by building an automaton. You can start with NFA (nondeterministic finite automaton which is like an indeterministic directed graph) which allows edges labeled with an epsilon character, which means that during processing you can jump from one node to another without consuming any character. I'll try to reduce your A. Let's say you A is:
A = {'ab, 'bc'}
If you build NFA for ab string you should get something like this:
+--(1)--+
e | a| |e
(S)--+--(2)--+--(F)
| b| |
+--(3)--+
Above drawing is not the best looking automaton. But there are a few points to consider:
S state is the starting state and F is the ending state.
If you are at F state it means your string qualifies as a subsequence.
The rule of propagating within an autmaton is that you can consume e (epsilon) to jump forward, therefore you can be at more then one state at each point in time. This is called e closure.
Now if given b, starting at state S I can jump one epsilon, reach 2, and consume b and reach 3. Now given end string I consume epsilon and reach F, thus b qualifies as a sub-sequence of ab. So does a or ab you can try yourself using above automata.
The good thing about NFA is that they have one start state and one final state. Two NFA could be easily connected using epsilons. There are various algorithms that could help you to convert NFA to DFA. DFA is a directed graph which can follow precise path given a character -- in particular, it is always in exactly one state at any point in time. (For any NFA, there is a corresponding DFA whose states correspond to sets of states in the NFA.)
So, for A = {'ab, 'bc'}, we would need to build NFA for ab then NFA for bc then join the two NFAs and build the DFA of the entire big NFA.
EDIT
NFA of subsequence of abc would be a?b?c?, so you can build your NFA as:
Now, consider the input acd. To query if ab is subsequence of {'abc', 'acd'}, you can use this NFA: (a?b?c?)|(a?c?d). Once you have NFA you can convert it to DFA where each state will contain whether it is a subsequence of abc or acd or maybe both.
I used link below to make NFA graphic from regular expression:
http://hackingoff.com/images/re2nfa/2013-08-04_21-56-03_-0700-nfa.svg
EDIT 2
You're right! In case if you've 10,000 unique characters in the A. By unique I mean A is something like this: {'abc', 'def'} i.e. intersection of each element of A is empty set. Then your DFA would be worst case in terms of states i.e. 2^10000. But I'm not sure when would that be possible given that there can never be 10,000 unique characters. Even if you have 10,000 characters in A still there will be repetitions and that might reduce states alot since e-closure might eventually merge. I cannot really estimate how much it might reduce. But even having 10 million states, you will only consume less then 10 mb worth of space to construct a DFA. You can even use NFA and find e-closures at run-time but that would add to run-time complexity. You can search different papers on how large regex are converted to DFAs.
EDIT 3
For regex (a?b?c?)|(e?d?a?)|(a?b?m?)
If you convert above NFA to DFA you get:
It actually lot less states then NFA.
Reference:
http://hackingoff.com/compilers/regular-expression-to-nfa-dfa
EDIT 4
After fiddling with that website more. I found that worst case would be something like this A = {'aaaa', 'bbbbb', 'cccc' ....}. But even in this case states are lesser than NFA states.
Tests
There have been four main proposals in this thread:
Shivam Kalra suggested creating an automaton based on all the strings in A. This approach has been tried slightly in the literature, normally under the name "Directed Acyclic Subsequence Graph" (DASG).
J Random Hacker suggested extending my 'prefix list' idea to all 'n choose 3' triplets in the query string, and merging them all using a heap.
In the note "Efficient Subsequence Search in Databases" Rohit Jain, Mukesh K. Mohania and Sunil Prabhakar suggest using a Trie structure with some optimizations and recursively search the tree for the query. They also have a suggestion similar to the triplet idea.
Finally there is the 'naive' approach, which wanghq suggested optimizing by storing an index for each element of A.
To get a better idea of what's worth putting continued effort into, I have implemented the above four approaches in Python and benchmarked them on two sets of data. The implementations could all be made a couple of magnitudes faster with a well done implementation in C or Java; and I haven't included the optimizations suggested for the 'trie' and 'naive' versions.
Test 1
A consists of random paths from my filesystem. q are 100 random [a-z] strings of average length 7. As the alphabet is large (and Python is slow) I was only able to use duplets for method 3.
Construction times in seconds as a function of A size:
Query times in seconds as a function of A size:
Test 2
A consists of randomly sampled [a-b] strings of length 20. q are 100 random [a-b] strings of average length 7. As the alphabet is small we can use quadlets for method 3.
Construction times in seconds as a function of A size:
Query times in seconds as a function of A size:
Conclusions
The double logarithmic plot is a bit hard to read, but from the data we can draw the following conclusions:
Automatons are very fast at querying (constant time), however they are impossible to create and store for |A| >= 256. It might be possible that a closer analysis could yield a better time/memory balance, or some tricks applicable for the remaining methods.
The dup-/trip-/quadlet method is about twice as fast as my trie implementation and four times as fast as the 'naive' implementation. I used only a linear amount of lists for the merge, instead of n^3 as suggested by j_random_hacker. It might be possible to tune the method better, but in general it was disappointing.
My trie implementation consistently does better than the naive approach by around a factor of two. By incorporating more preprocessing (like "where are the next 'c's in this subtree") or perhaps merging it with the triplet method, this seems like todays winner.
If you can do with a magnitude less performance, the naive method does comparatively just fine for very little cost.
As you point out, it might be that all strings in A contain q as a subsequence, in which case you can't hope to do better than O(|A|). (That said, you might still be able to do better than the time taken to run LCS on (q, A[i]) for each string i in A, but I won't focus on that here.)
TTBOMK there are no magic, fast ways to answer this question (in the way that suffix trees are the magic, fast way to answer the corresponding question involving substrings instead of subsequences). Nevertheless if you expect the set of answers for most queries to be small on average then it's worth looking at ways to speed up these queries (the ones yielding small-size answers).
I suggest filtering based on a generalisation of your heuristic (2): if some database sequence A[i] contains q as a subsequence, then it must also contain every subsequence of q. (The reverse direction is not true unfortunately!) So for some small k, e.g. 3 as you suggest, you can preprocess by building an array of lists telling you, for every length-k string s, the list of database sequences containing s as a subsequence. I.e. c[s] will contain a list of the ID numbers of database sequences containing s as a subsequence. Keep each list in numeric order to enable fast intersections later.
Now the basic idea (which we'll improve in a moment) for each query q is: Find all k-sized subsequences of q, look up each in the array of lists c[], and intersect these lists to find the set of sequences in A that might possibly contain q as a subsequence. Then for each possible sequence A[i] in this (hopefully small) intersection, perform an O(n^2) LCS calculation with q to see whether it really does contain q.
A few observations:
The intersection of 2 sorted lists of size m and n can be found in O(m+n) time. To find the intersection of r lists, perform r-1 pairwise intersections in any order. Since taking intersections can only produce sets that are smaller or of the same size, time can be saved by intersecting the smallest pair of lists first, then the next smallest pair (this will necessarily include the result of the first operation), and so on. In particular: sort lists in increasing size order, then always intersect the next list with the "current" intersection.
It is actually faster to find the intersection a different way, by adding the first element (sequence number) of each of the r lists into a heap data structure, then repeatedly pulling out the minimum value and replenishing the heap with the next value from the list that the most recent minimum came from. This will produce a list of sequence numbers in nondecreasing order; any value that appears fewer than r times in a row can be discarded, since it cannot be a member of all r sets.
If a k-string s has only a few sequences in c[s], then it is in some sense discriminating. For most datasets, not all k-strings will be equally discriminating, and this can be used to our advantage. After preprocessing, consider throwing away all lists having more than some fixed number (or some fixed fraction of the total) of sequences, for 3 reasons:
They take a lot of space to store
They take a lot of time to intersect during query processing
Intersecting them will usually not shrink the overall intersection much
It is not necessary to consider every k-subsequence of q. Although this will produce the smallest intersection, it involves merging (|q| choose k) lists, and it might well be possible to produce an intersection that is nearly as small using just a fraction of these k-subsequences. E.g. you could limit yourself to trying all (or a few) k-substrings of q. As a further filter, consider just those k-subsequences whose sequence lists in c[s] are below some value. (Note: if your threshold is the same for every query, you might as well delete all such lists from the database instead, since this will have the same effect, and saves space.)
One thought;
if q tends to be short, maybe reducing A and q to a set will help?
So for the example, derive to { (a,b,c,d,e,f), (a), (a,c,d) }. Looking up possible candidates for any q should be faster than the original problem (that's a guess actually, not sure how exactly. maybe sort them and "group" similar ones in bloom filters?), then use bruteforce to weed out false positives.
If A strings are lengthy, you could make the characters unique based on their occurence, so that would be {(a1,b1,c1,d1,e1,f1),(a1,a2,a3,a4,a5,a6),(a1,c1,d1,d2)}. This is fine, because if you search for "ddca" you only want to match the second d to a second d. The size of your alphabet would go up (bad for bloom or bitmap style operations) and would be different ever time you get new A's, but the amount of false positives would go down.
First let me make sure my understanding/abstraction is correct. The following two requirements should be met:
if A is a subsequence of B, then all characters in A should appear in B.
for those characters in B, their positions should be in an ascending order.
Note that, a char in A might appear more than once in B.
To solve 1), a map/set can be used. The key is the character in string B, and the value doesn't matter.
To solve 2), we need to maintain the position of each characters. Since a character might appear more than once, the position should be a collection.
So the structure is like:
Map<Character, List<Integer>)
e.g.
abcdefab
a: [0, 6]
b: [1, 7]
c: [2]
d: [3]
e: [4]
f: [5]
Once we have the structure, how to know if the characters are in the right order as they are in string A? If B is acd, we should check the a at position 0 (but not 6), c at position 2 and d at position 3.
The strategy here is to choose the position that's after and close to the previous chosen position. TreeSet is a good candidate for this operation.
public E higher(E e)
Returns the least element in this set strictly greater than the given element, or null if there is no such element.
The runtime complexity is O(s * (n1 + n2)*log(m))).
s: number of strings in the set
n1: number of chars in string (B)
n2: number of chars in query string (A)
m: number of duplicates in string (B), e.g. there are 5 a.
Below is the implementation with some test data.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeSet;
public class SubsequenceStr {
public static void main(String[] args) {
String[] testSet = new String[] {
"abcdefgh", //right one
"adcefgh", //has all chars, but not the right order
"bcdefh", //missing one char
"", //empty
"acdh",//exact match
"acd",
"acdehacdeh"
};
List<String> subseqenceStrs = subsequenceStrs(testSet, "acdh");
for (String str : subseqenceStrs) {
System.out.println(str);
}
//duplicates in query
subseqenceStrs = subsequenceStrs(testSet, "aa");
for (String str : subseqenceStrs) {
System.out.println(str);
}
subseqenceStrs = subsequenceStrs(testSet, "aaa");
for (String str : subseqenceStrs) {
System.out.println(str);
}
}
public static List<String> subsequenceStrs(String[] strSet, String q) {
System.out.println("find strings whose subsequence string is " + q);
List<String> results = new ArrayList<String>();
for (String str : strSet) {
char[] chars = str.toCharArray();
Map<Character, TreeSet<Integer>> charPositions = new HashMap<Character, TreeSet<Integer>>();
for (int i = 0; i < chars.length; i++) {
TreeSet<Integer> positions = charPositions.get(chars[i]);
if (positions == null) {
positions = new TreeSet<Integer>();
charPositions.put(chars[i], positions);
}
positions.add(i);
}
char[] qChars = q.toCharArray();
int lowestPosition = -1;
boolean isSubsequence = false;
for (int i = 0; i < qChars.length; i++) {
TreeSet<Integer> positions = charPositions.get(qChars[i]);
if (positions == null || positions.size() == 0) {
break;
} else {
Integer position = positions.higher(lowestPosition);
if (position == null) {
break;
} else {
lowestPosition = position;
if (i == qChars.length - 1) {
isSubsequence = true;
}
}
}
}
if (isSubsequence) {
results.add(str);
}
}
return results;
}
}
Output:
find strings whose subsequence string is acdh
abcdefgh
acdh
acdehacdeh
find strings whose subsequence string is aa
acdehacdeh
find strings whose subsequence string is aaa
As always, I might be totally wrong :)
You might want to have a look into the Book Algorithms on Strings and Sequences by Dan Gusfield. As it turns out part of it is available on the internet. You might also want to read Gusfield's Introduction to Suffix Trees. As it turns out this book covers many approaches for you kind of question. It is considered one of the standard publications in this field.
Get a fast longest common subsequence algorithm implementation. Actually it suffices to determine the length of the LCS. Notice that Gusman's book has very good algorithms and also points to more sources for such algorithms.
Return all s ∈ A with length(LCS(s,q)) == length(q)

Find a specific char in an array of chars

I am doing an application where I want to find a specific char in an array of chars. In other words, I have the following char array:
char[] charArray= new char[] {'\uE001', '\uE002', '\uE003', '\uE004', '\uE005', '\uE006', '\uE007', '\uE008', '\uE009'};
At some point, I want to check if the character '\uE002' exists in the charArray. My method was to make a loop on every character in the charArray and find if it exists.
for (int z = 0 ; z < charArray; z ++) {
if (charArray[z] == myChar) {
//Do the work
}
}
Is there any other solution than making a char array and finding the character by looping every single char?
One option is to pre-sort charArray and use Arrays.binarySearch(charArray, myChar). A non-negative return value will indicate that myChar is present in charArray.
char[] charArray = new char[] {'\uE001', '\uE002', '\uE003', '\uE004', '\uE005', '\uE006', '\uE007', '\uE008', '\uE009'};
Arrays.sort(charArray); // can be omitted if you know that the values are already sorted
...
if (Arrays.binarySearch(charArray, myChar) >= 0) {
// Do the work
}
edit An alternative that avoids using the Arrays module is to put the characters into a string (at initialization time) and then use String.indexOf():
String chars = "\uE001...";
...
if (chars.indexOf(myChar) >= 0) {
// Do the work
}
This is not hugely different to what you're doing already, except that it requires less code.
If n is the size of charArray, the first solution is O(log n) whereas the second one is O(n).
You can use hash/map to check existance of the characer. This approach has better time of O(log n) or O(1) depending on hash/map inner structure.
If you don't have access to Arrays, since you are working in JavaME, then you should try to:
Or implement a sorted array and the binary search youself
Or just use a O(n) solution, wich is a good solution anyways.
Your solution is O(n), along with the one stated by aix.
You could try to use a Map, but it would depends on how much elements you are in your array. If you think there won't be more than 1000 elements in the array, just use a O(n) solution. But if you think you could have an unkown number of elements, a Map would be a reasoable choice, providing a better solution.
You can use net.rim.device.api.util.Arrays.getIndex(char[] array, char element)
It depends on what Java ME configuration / profile you are using. If you're on CDC then check for what parts of Java SE 1.3 Collections framework are supported (just find javadocs targeted for your device and look in java.util package). Another thing worth checking is whether your device has some blackberry-specific API extensions to work with collections.
If you are limited to bare minimum of CLDC/MIDP then other solution than you mentioned would be to add chars to Vector and use Vector.contains(Object)
If you don't want to implement it yourself you could use ArrayUtils in apache commons project:
ArrayUtils apache-commons

Design a data structure for reporting resource conflicts

I have a memory address pool with 1024 addresses. There are 16 threads running inside a program which access these memory locations doing either read or write operations. The output of this program is in the form of a series of quadruples whose defn is like this
Quadruple q1 : (Thread no, Memory address, read/write , time)
e.g q1 = (12,578,r,2t), q2= (16,578,w,6t)
I want to design a program which takes the stream of quadruples as input and reports all the conflicts which occur if more than 2 threads try to access the same memory resource inside an interval of 5t secs with at least one write operation.
I have several solutions in mind but I am not sure if they are the best ones to address this problem. I am looking for a solution from a design and data structure perspective.
So the basic problem here is collision detection. I would generally look for a solution where elements are added to some kind of associative collection. As a new element is about to be added, you need to be able to tell whether the collection already contains a similar element, indicating a collision. Here you would seem to need a collection type that allows for duplicate elements, such as the STL multimap. The Quadraple (quadruple?) would obviously be the value type in the associative collection, and the key type would contain the data necessary to determine whether two elements represent a collision, i.e. memory address and time. In order to use a standard associative collection like STL multimap, you need to define some ordering on the keys by defining operator< for the key type (I'm assuming C++ here, you didn't specify). You define a collision as two elements where the memory location is identical and the time values differ by less than some threshold amount. The ordering of the key type has to be such that two keys that represent a collision come out as equivalent under the ordering. Equivalence under the < operator is expressed as a < b is false and b < a is false as well, so the ordering might be defined by this operator:
bool operator<( Key const& a, Key const& b ) {
if ( a.address == b.address ) {
if ( abs(a.time - b.time) < threshold ) {
return false;
}
return a.time < b.time;
}
return a.address < b.address;
}
There is a problem with this design, due to the fact that two keys may be equivalent under < without being equal. This means that two different but similar Quadraples, i.e. two values that collide with one another, would be stored under the same key in the collection. You could use a simpler definition of the ordering
bool operator<( Key const& a, Key const& b ) {
if ( a.address == b.address ) {
return a.time < b.time;
}
return a.address < b.address;
}
Under this ordering definition, colliding elements end up adjacent in an ordered associative container (but under different keys), so you'd be able to find them easily in a post-processing step after they have all been added to the collection.

Resources