Efficient algorithm for grouping array of strings by prefixes - string

I wonder what is the best way to group an array of strings according to a list of prefixes (of arbitrary length).
For example, if we have this:
prefixes = ['GENERAL', 'COMMON', 'HY-PHE-NATED', 'UNDERSCORED_']
Then
tasks = ['COMMONA', 'COMMONB', 'GENERALA', 'HY-PHE-NATEDA', 'UNDERESCORED_A', 'HY-PHE-NATEDB']
Should be grouped this way:
[['GENERALA'], ['COMMONA', 'COMMONB'], ['HY-PHE-NATEDA', 'HY-PHE-NATEDB'], ['UNDERESCORED_A'] ]
The naïve approach is to loop through all the tasks and inner loop through prefixes (or vice versa, whatever) and test each task for each prefix.
Can one give me a hint how to make this in a more efficient way?

It depends a bit on the size of your problem, of course, but your naive approach should be okay if you sort both your prefixes and your tasks and then build your sub-arrays by traversing both sorted lists only forwards.

There are a few options, but you might be interested in looking into the trie data structure.
http://en.wikipedia.org/wiki/Trie
The trie data structure is easy to understand and implement and works well for this type of problem. If you find that this works for your situation you can also look at Patricia Tries which achieve the similar performance characteristics but typically have better memory utilization. They are a little more involved to implement but not overly complex.

Related

Concurrently check if string is in slice?

Generally to check if a string is in slice I write a function with for loop and if statement. but it's really inefficient in cases of large slices of string or struct types. is it possible to make this check concurrent?
The concurrent search on sequential data is usually not a great idea, simply because we already have a binary search that scales really well for even billions of records. All you have to do to utilize it is build indexing on top of the slice you are searching in. To build the most trivial indexing, you have to save keys into another slice along with the index of data they are pointing to. Once you have the slice, just sort it by strings, and indexing is done.
You have to perform the binary search on the indexing you just created to be more efficient. This way you have the complexity of O(log N).
Another much simpler option you have is creating the map[string]int and inserting all keys along with the indexes. Then find the index inside the map. Which can be O(1) best case.
The important thing to note is that if you have to perform just one search on a given slice, this is not worth it as creating indexing is a lot heavier than linear search.

Haskell data structure that is efficient for swapping elements?

I am looking for a Haskell data structure that stores an ordered list of elements and that is time-efficient at swapping pairs of elements at arbitrary locations within the list. It's not [a], obviously. It's not Vector because swapping creates new vectors. Which data structure is efficient at this?
The most efficient implementations of persistent data structures, which exhibit O(1) updates (as well as appending, prepending, counting and slicing), are based on the Array Mapped Trie algorithm. The Vector data-structures of Clojure and Scala are based on it, for instance. The only Haskell implementation of that data-structure that I know of is presented by the "persistent-vector" package.
This algorithm is very young, it was only first presented in the year 2000, which might be the reason why not so many people have ever heard about it. But the thing turned out to be such a universal solution that it got adapted for Hash-tables soon after. The adapted version of this algorithm is called Hash Array Mapped Trie. It is as well used in Clojure and Scala to implement the Set and Map data-structures. It is also more ubiquitous in Haskell with packages like "unordered-containers" and "stm-containers" revolving around it.
To learn more about the algorithm I recommend the following links:
http://blog.higher-order.net/2009/02/01/understanding-clojures-persistentvector-implementation.html
http://lampwww.epfl.ch/papers/idealhashtrees.pdf
Data.Sequence from the containers package would likely be a not-terrible data structure to start with for this use case.
Haskell is a (nearly) pure functional language, so any data structure you update will need to make a new copy of the structure, and re-using the data elements is close to the best you can do. Also, the new list would be lazily evaluated and typically only the spine would need to be created until you need the data. If the number of updates is small compared to the number of elements, you could make a difference list that checks a sparse set of updates first, and only then looks in the original vector.

Meta-information in DAWG/DAFSA

I would like to implement a string look-up data structure, for dynamic strings, that will support efficient search and insertion. Currently, I am using a trie but I would like to reduce the memory footprint if possible. This Wikipedia article describes a DAWG/DAFSA, which will obviously save a lot of space over a trie by compressing suffixes. However, while it will clearly test whether a string is legal, it is not obvious to me if there is any way to exclude illegal strings. For example, using the words "cite" and "cat" where the "t" and "e" are terminal states, a DAWG/DAFSA would look like this:
c
/ \
a i
\ /
t
|
e
and "cit" and "cate" will be incorrectly recognized as legal strings without some meta-information.
Questions:
1) Is there a preferred way to store meta-information about strings/paths (such as legality) in a DAWG/DAFSA?
2) If a DAWG/DAFSA is incompatible with the requirements (efficient search/insertion and storing meta-information) what's the best data structure to use? A minimal memory footprint would be nice, but perhaps not absolutely necessary.
In a DAWG, you only compress states together if they're completely indistinguishable from one another. This means that you actually wouldn't combine the T nodes for CAT and CITE together for precisely the reason you've noted - that gives you either a false positive on CIT or a false negative on CAT.
DAWGs are typically most effective for static dictionaries when you have a huge number of words with common suffixes. A DAWG for all of English, for example, could save a lot of space by combining all the suffix "s"'s at the end of plural words and most of the "ING" suffixes from gerunds. If you're going to be doing a lot of insertions or deletions, DAWGs are almost certainly the wrong data structure for the job because adding or removing a single word from a DAWG can cause ripple effects that require lots of branches that were previously combined to be split or vice-versa.
Quite honestly, for reasonably-sized data sets, a trie isn't a bad call. A trie for all of English would only use up something like 26MB, which isn't very much. I would only go with the DAWG if space usage really is at a premium and you aren't doing many insertions or deletions.
Hope this helps!

efficient functional data structure for finite bijections

I'm looking for a functional data structure that represents finite bijections between two types, that is space-efficient and time-efficient.
For instance, I'd be happy if, considering a bijection f of size n:
extending f with a new pair of elements has complexity O(ln n)
querying f(x) or f^-1(x) has complexity O(ln n)
the internal representation for f is more space efficient than having 2 finite maps (representing f and its inverse)
I am aware of efficient representation of permutations, like this paper, but it does not seem to solve my problem.
Please have a look at my answer for a relatively similar question. The provided code can handle general NxM relations, but also be specialized to just bijections (just as you would for a binary search tree).
Pasting the answer here for completeness:
The simplest way is to use a pair of unidirectional maps. It has some cost, but you won't get much better (you could get a bit better using dedicated binary trees, but you have a huge complexity cost to pay if you have to implement it yourself). In essence, lookups will be just as fast, but addition and deletion will be twice as slow. Which isn't so bad for a logarithmic operation. Another advantage of this technique is that you can use specialized maps types for the key or value type if you have one available. You won't get as much flexibility with a specific generalist data structure.
A different solution is to use a quadtree (instead of considering a NxN relation as a pair of 1xN and Nx1 relations, you see it as a set of elements in the cartesian product (Key*Value) of your types, that is, a spatial plane), but it's not clear to me that the time and memory costs are better than with two maps. I suppose it needs to be tested.
Although it doesn't satisfy your third requirement, bimaps seem like the way to go. (They just make two finite maps, one in each direction, convenient to use.)

Looking for an efficient array-like structure that supports "replace-one-member" and "append"

As an exercise I wrote an implementation of the longest increasing subsequence algorithm, initially in Python but I would like to translate this to Haskell. In a nutshell, the algorithm involves a fold over a list of integers, where the result of each iteration is an array of integers that is the result of either changing one element of or appending one element to the previous result.
Of course in Python you can just change one element of the array. In Haskell, you could rebuild the array while replacing one element at each iteration - but that seems wasteful (copying most of the array at each iteration).
In summary what I'm looking for is an efficient Haskell data structure that is an ordered collection of 'n' objects and supports the operations: lookup i, replace i foo, and append foo (where i is in [0..n-1]). Suggestions?
Perhaps the standard Seq type from Data.Sequence. It's not quite O(1), but it's pretty good:
index (your lookup) and adjust (your replace) are O(log(min(index, length - index)))
(><) (your append) is O(log(min(length1, length2)))
It's based on a tree structure (specifically, a 2-3 finger tree), so it should have good sharing properties (meaning that it won't copy the entire sequence for incremental modifications, and will perform them faster too). Note that Seqs are strict, unlike lists.
I would try to just use mutable arrays in this case, preferably in the ST monad.
The main advantages would be making the translation more straightforward and making things simple and efficient.
The disadvantage, of course, is losing on purity and composability. However I think this should not be such a big deal since I don't think there are many cases where you would like to keep intermediate algorithm states around.

Resources