Generate ngrams with Julia

Generate ngrams with Julia - nlp

To generate word bigrams in Julia, I could simply zip through the original list and a list that drops the first element, e.g.:
julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
"the"
"lazy"
"fox"
"jumps"
"over"
"the"
"brown"
"dog"
julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")
To generate a trigram I could use the same collect(zip(...)) idiom to get:
julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox")
("lazy","fox","jumps")
("fox","jumps","over")
("jumps","over","the")
("over","the","brown")
("the","brown","dog")
But I have to manually add in the 3rd list to zip through, is there an idiomatic way such that I can do any order of n-gram?
e.g. I'll like to avoid doing this to extract 5-gram:
julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")

By changing the output slightly and using SubArrays instead of Tuples, little is lost, but it is possible to avoid allocations and memory copying. If the underlying word list is static, this is OK and faster (in my benchmarks too). The code:
ngram(s,n) = [view(s,i:i+n-1) for i=1:length(s)-n+1]
and the output:
julia> ngram(s,5)
SubString{String}["the","lazy","fox","jumps","over"]
SubString{String}["lazy","fox","jumps","over","the"]
SubString{String}["fox","jumps","over","the","brown"]
SubString{String}["jumps","over","the","brown","dog"]
julia> ngram(s,5)[1][3]
"fox"
For larger word lists the memory requirements are substantially smaller also.
Also note using a generator allows processing the ngrams one-by-one faster and with less memory and might be enough for the desired processing code (counting something or passing through some hash). For example, using #Gnimuc's solution without the collect i.e. just partition(s, n, 1).

Here's a clean one-liner for n-grams of any length.
ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...))
It uses a generator comprehension to iterate over the number of elements, k, to drop. Then, using the splat (...) operator, it unpacks the Drops into zip, and finally collects the Zip into an Array.
julia> ngram(s, 2)
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")
julia> ngram(s, 5)
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")
As you can see, this is very similar to your solution - only a simple comprehension was added to iterate over the number of elements to drop, so that the length could be dynamic.

Another way is to use Iterators.jl's partition():
ngram(s,n) = collect(partition(s, n, 1))

Related

Difference between tuples and lists [duplicate]

What's the difference between tuples/lists and what are their advantages/disadvantages?

Apart from tuples being immutable there is also a semantic distinction that should guide their usage. Tuples are heterogeneous data structures (i.e., their entries have different meanings), while lists are homogeneous sequences. Tuples have structure, lists have order.
Using this distinction makes code more explicit and understandable.
One example would be pairs of page and line number to reference locations in a book, e.g.:
my_location = (42, 11) # page number, line number
You can then use this as a key in a dictionary to store notes on locations. A list on the other hand could be used to store multiple locations. Naturally one might want to add or remove locations from the list, so it makes sense that lists are mutable. On the other hand it doesn't make sense to add or remove items from an existing location - hence tuples are immutable.
There might be situations where you want to change items within an existing location tuple, for example when iterating through the lines of a page. But tuple immutability forces you to create a new location tuple for each new value. This seems inconvenient on the face of it, but using immutable data like this is a cornerstone of value types and functional programming techniques, which can have substantial advantages.
There are some interesting articles on this issue, e.g. "Python Tuples are Not Just Constant Lists" or "Understanding tuples vs. lists in Python". The official Python documentation also mentions this
"Tuples are immutable, and usually contain an heterogeneous sequence ...".
In a statically typed language like Haskell the values in a tuple generally have different types and the length of the tuple must be fixed. In a list the values all have the same type and the length is not fixed. So the difference is very obvious.
Finally there is the namedtuple in Python, which makes sense because a tuple is already supposed to have structure. This underlines the idea that tuples are a light-weight alternative to classes and instances.

Difference between list and tuple
Literal
someTuple = (1,2)
someList = [1,2]
Size
a = tuple(range(1000))
b = list(range(1000))
a.__sizeof__() # 8024
b.__sizeof__() # 9088
Due to the smaller size of a tuple operation, it becomes a bit faster, but not that much to mention about until you have a huge number of elements.
Permitted operations
b = [1,2]
b[0] = 3 # [3, 2]
a = (1,2)
a[0] = 3 # Error
That also means that you can't delete an element or sort a tuple.
However, you could add a new element to both list and tuple with the only difference that since the tuple is immutable, you are not really adding an element but you are creating a new tuple, so the id of will change
a = (1,2)
b = [1,2]
id(a) # 140230916716520
id(b) # 748527696
a += (3,) # (1, 2, 3)
b += [3] # [1, 2, 3]
id(a) # 140230916878160
id(b) # 748527696
Usage
As a list is mutable, it can't be used as a key in a dictionary, whereas a tuple can be used.
a = (1,2)
b = [1,2]
c = {a: 1} # OK
c = {b: 1} # Error

If you went for a walk, you could note your coordinates at any instant in an (x,y) tuple.
If you wanted to record your journey, you could append your location every few seconds to a list.
But you couldn't do it the other way around.

The key difference is that tuples are immutable. This means that you cannot change the values in a tuple once you have created it.
So if you're going to need to change the values use a List.
Benefits to tuples:
Slight performance improvement.
As a tuple is immutable it can be used as a key in a dictionary.
If you can't change it neither can anyone else, which is to say you don't need to worry about any API functions etc. changing your tuple without being asked.

Lists are mutable; tuples are not.
From docs.python.org/2/tutorial/datastructures.html
Tuples are immutable, and usually contain an heterogeneous sequence of
elements that are accessed via unpacking (see later in this section)
or indexing (or even by attribute in the case of namedtuples). Lists
are mutable, and their elements are usually homogeneous and are
accessed by iterating over the list.

This is an example of Python lists:
my_list = [0,1,2,3,4]
top_rock_list = ["Bohemian Rhapsody","Kashmir","Sweet Emotion", "Fortunate Son"]
This is an example of Python tuple:
my_tuple = (a,b,c,d,e)
celebrity_tuple = ("John", "Wayne", 90210, "Actor", "Male", "Dead")
Python lists and tuples are similar in that they both are ordered collections of values. Besides the shallow difference that lists are created using brackets "[ ... , ... ]" and tuples using parentheses "( ... , ... )", the core technical "hard coded in Python syntax" difference between them is that the elements of a particular tuple are immutable whereas lists are mutable (...so only tuples are hashable and can be used as dictionary/hash keys!). This gives rise to differences in how they can or can't be used (enforced a priori by syntax) and differences in how people choose to use them (encouraged as 'best practices,' a posteriori, this is what smart programers do). The main difference a posteriori in differentiating when tuples are used versus when lists are used lies in what meaning people give to the order of elements.
For tuples, 'order' signifies nothing more than just a specific 'structure' for holding information. What values are found in the first field can easily be switched into the second field as each provides values across two different dimensions or scales. They provide answers to different types of questions and are typically of the form: for a given object/subject, what are its attributes? The object/subject stays constant, the attributes differ.
For lists, 'order' signifies a sequence or a directionality. The second element MUST come after the first element because it's positioned in the 2nd place based on a particular and common scale or dimension. The elements are taken as a whole and mostly provide answers to a single question typically of the form, for a given attribute, how do these objects/subjects compare? The attribute stays constant, the object/subject differs.
There are countless examples of people in popular culture and programmers who don't conform to these differences and there are countless people who might use a salad fork for their main course. At the end of the day, it's fine and both can usually get the job done.
To summarize some of the finer details
Similarities:
Duplicates - Both tuples and lists allow for duplicates
Indexing, Selecting, & Slicing - Both tuples and lists index using integer values found within brackets. So, if you want the first 3 values of a given list or tuple, the syntax would be the same:
>>> my_list[0:3]
[0,1,2]
>>> my_tuple[0:3]
[a,b,c]
Comparing & Sorting - Two tuples or two lists are both compared by their first element, and if there is a tie, then by the second element, and so on. No further attention is paid to subsequent elements after earlier elements show a difference.
>>> [0,2,0,0,0,0]>[0,0,0,0,0,500]
True
>>> (0,2,0,0,0,0)>(0,0,0,0,0,500)
True
Differences: - A priori, by definition
Syntax - Lists use [], tuples use ()
Mutability - Elements in a given list are mutable, elements in a given tuple are NOT mutable.
# Lists are mutable:
>>> top_rock_list
['Bohemian Rhapsody', 'Kashmir', 'Sweet Emotion', 'Fortunate Son']
>>> top_rock_list[1]
'Kashmir'
>>> top_rock_list[1] = "Stairway to Heaven"
>>> top_rock_list
['Bohemian Rhapsody', 'Stairway to Heaven', 'Sweet Emotion', 'Fortunate Son']
# Tuples are NOT mutable:
>>> celebrity_tuple
('John', 'Wayne', 90210, 'Actor', 'Male', 'Dead')
>>> celebrity_tuple[5]
'Dead'
>>> celebrity_tuple[5]="Alive"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
Hashtables (Dictionaries) - As hashtables (dictionaries) require that its keys are hashable and therefore immutable, only tuples can act as dictionary keys, not lists.
#Lists CAN'T act as keys for hashtables(dictionaries)
>>> my_dict = {[a,b,c]:"some value"}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
#Tuples CAN act as keys for hashtables(dictionaries)
>>> my_dict = {("John","Wayne"): 90210}
>>> my_dict
{('John', 'Wayne'): 90210}
Differences - A posteriori, in usage
Homo vs. Heterogeneity of Elements - Generally list objects are homogenous and tuple objects are heterogeneous. That is, lists are used for objects/subjects of the same type (like all presidential candidates, or all songs, or all runners) whereas although it's not forced by), whereas tuples are more for heterogenous objects.
Looping vs. Structures - Although both allow for looping (for x in my_list...), it only really makes sense to do it for a list. Tuples are more appropriate for structuring and presenting information (%s %s residing in %s is an %s and presently %s % ("John","Wayne",90210, "Actor","Dead"))

It's been mentioned that the difference is largely semantic: people expect a tuple and list to represent different information. But this goes further than a guideline; some libraries actually behave differently based on what they are passed. Take NumPy for example (copied from another post where I ask for more examples):
>>> import numpy as np
>>> a = np.arange(9).reshape(3,3)
>>> a
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> idx = (1,1)
>>> a[idx]
4
>>> idx = [1,1]
>>> a[idx]
array([[3, 4, 5],
[3, 4, 5]])
The point is, while NumPy may not be part of the standard library, it's a major Python library, and within NumPy lists and tuples are completely different things.

Lists are for looping, tuples are for structures i.e. "%s %s" %tuple.
Lists are usually homogeneous, tuples are usually heterogeneous.
Lists are for variable length, tuples are for fixed length.

The values of list can be changed any time but the values of tuples can't be change.
The advantages and disadvantages depends upon the use. If you have such a data which you never want to change then you should have to use tuple, otherwise list is the best option.

Difference between list and tuple
Tuples and lists are both seemingly similar sequence types in Python.
Literal syntax
We use parenthesis () to construct tuples and square brackets [ ] to get a new list. Also, we can use call of the appropriate type to get required structure — tuple or list.
someTuple = (4,6)
someList = [2,6]
Mutability
Tuples are immutable, while lists are mutable. This point is the base the for the following ones.
Memory usage
Due to mutability, you need more memory for lists and less memory for tuples.
Extending
You can add a new element to both tuples and lists with the only difference that the id of the tuple will be changed (i.e., we’ll have a new object).
Hashing
Tuples are hashable and lists are not. It means that you can use a tuple as a key in a dictionary. The list can't be used as a key in a dictionary, whereas a tuple can be used
tup = (1,2)
list_ = [1,2]
c = {tup : 1} # ok
c = {list_ : 1} # error
Semantics
This point is more about best practice. You should use tuples as heterogeneous data structures, while lists are homogenous sequences.

Lists are intended to be homogeneous sequences, while tuples are heterogeneous data structures.

As people have already answered here that tuples are immutable while lists are mutable, but there is one important aspect of using tuples which we must remember
If the tuple contains a list or a dictionary inside it, those can be changed even if the tuple itself is immutable.
For example, let's assume we have a tuple which contains a list and a dictionary as
my_tuple = (10,20,30,[40,50],{ 'a' : 10})
we can change the contents of the list as
my_tuple[3][0] = 400
my_tuple[3][1] = 500
which makes new tuple looks like
(10, 20, 30, [400, 500], {'a': 10})
we can also change the dictionary inside tuple as
my_tuple[4]['a'] = 500
which will make the overall tuple looks like
(10, 20, 30, [400, 500], {'a': 500})
This happens because list and dictionary are the objects and these objects are not changing, but the contents its pointing to.
So the tuple remains immutable without any exception

The PEP 484 -- Type Hints says that the types of elements of a tuple can be individually typed; so that you can say Tuple[str, int, float]; but a list, with List typing class can take only one type parameter: List[str], which hints that the difference of the 2 really is that the former is heterogeneous, whereas the latter intrinsically homogeneous.
Also, the standard library mostly uses the tuple as a return value from such standard functions where the C would return a struct.

As people have already mentioned the differences I will write about why tuples.
Why tuples are preferred?
Allocation optimization for small tuples
To reduce memory fragmentation and speed up allocations, Python reuses old tuples. If a
tuple no longer needed and has less than 20 items instead of deleting
it permanently Python moves it to a free list.
A free list is divided into 20 groups, where each group represents a
list of tuples of length n between 0 and 20. Each group can store up
to 2 000 tuples. The first (zero) group contains only 1 element and
represents an empty tuple.
>>> a = (1,2,3)
>>> id(a)
4427578104
>>> del a
>>> b = (1,2,4)
>>> id(b)
4427578104
In the example above we can see that a and b have the same id. That is
because we immediately occupied a destroyed tuple which was on the
free list.
Allocation optimization for lists
Since lists can be modified, Python does not use the same optimization as in tuples. However,
Python lists also have a free list, but it is used only for empty
objects. If an empty list is deleted or collected by GC, it can be
reused later.
>>> a = []
>>> id(a)
4465566792
>>> del a
>>> b = []
>>> id(b)
4465566792
Source: https://rushter.com/blog/python-lists-and-tuples/
Why tuples are efficient than lists? -> https://stackoverflow.com/a/22140115

The most important difference is time ! When you do not want to change the data inside the list better to use tuple ! Here is the example why use tuple !
import timeit
print(timeit.timeit(stmt='[1,2,3,4,5,6,7,8,9,10]', number=1000000)) #created list
print(timeit.timeit(stmt='(1,2,3,4,5,6,7,8,9,10)', number=1000000)) # created tuple
In this example we executed both statements 1 million times
Output :
0.136621
0.013722200000000018
Any one can clearly notice the time difference.

A direction quotation from the documentation on 5.3. Tuples and Sequences:
Though tuples may seem similar to lists, they are often used in different situations and for different purposes. Tuples are immutable, and usually contain a heterogeneous sequence of elements that are accessed via unpacking (see later in this section) or indexing (or even by attribute in the case of namedtuples). Lists are mutable, and their elements are usually homogeneous and are accessed by iterating over the list.

In other words, TUPLES are used to store group of elements where the contents/members of the group would not change while LISTS are used to store group of elements where the members of the group can change.
For instance, if i want to store IP of my network in a variable, it's best i used a tuple since the the IP is fixed. Like this my_ip = ('192.168.0.15', 33, 60). However, if I want to store group of IPs of places I would visit in the next 6 month, then I should use a LIST, since I will keep updating and adding new IP to the group. Like this
places_to_visit = [
('192.168.0.15', 33, 60),
('192.168.0.22', 34, 60),
('192.168.0.1', 34, 60),
('192.168.0.2', 34, 60),
('192.168.0.8', 34, 60),
('192.168.0.11', 34, 60)
]

First of all, they both are the non-scalar objects (also known as a compound objects) in Python.
Tuples, ordered sequence of elements (which can contain any object with no aliasing issue)
Immutable (tuple, int, float, str)
Concatenation using + (brand new tuple will be created of course)
Indexing
Slicing
Singleton (3,) # -> (3) instead of (3) # -> 3
List (Array in other languages), ordered sequence of values
Mutable
Singleton [3]
Cloning new_array = origin_array[:]
List comprehension [x**2 for x in range(1,7)] gives you
[1,4,9,16,25,36] (Not readable)
Using list may also cause an aliasing bug (two distinct paths
pointing to the same object).

Just a quick extension to list vs tuple responses:
Due to dynamic nature, list allocates more bit buckets than the actual memory required. This is done to prevent costly reallocation operation in case extra items are appended in the future.
On the other hand, being static, lightweight tuple object does not reserve extra memory required to store them.

Lists are mutable and tuples are immutable.
Just consider this example.
a = ["1", "2", "ra", "sa"] #list
b = ("1", "2", "ra", "sa") #tuple
Now change index values of list and tuple.
a[2] = 1000
print a #output : ['1', '2', 1000, 'sa']
b[2] = 1000
print b #output : TypeError: 'tuple' object does not support item assignment.
Hence proved the following code is invalid with tuple, because we attempted to update a tuple, which is not allowed.

Lists are mutable. whereas tuples are immutable. Accessing an offset element with index makes more sense in tuples than lists, Because the elements and their index cannot be changed.

List is mutable and tuples is immutable. The main difference between mutable and immutable is memory usage when you are trying to append an item.
When you create a variable, some fixed memory is assigned to the variable. If it is a list, more memory is assigned than actually used. E.g. if current memory assignment is 100 bytes, when you want to append the 101th byte, maybe another 100 bytes will be assigned (in total 200 bytes in this case).
However, if you know that you are not frequently add new elements, then you should use tuples. Tuples assigns exactly size of the memory needed, and hence saves memory, especially when you use large blocks of memory.

Finding item at index i of an iterable (syntactic sugar)?

I know that maps, range, filters etc. in python3 return iterables, and only calculate value when required. Suppose that there is a map M. I want to print the i^th element of M.
One way would be to iterate till i^th value, and print it:
for _ in range(i):
next(M)
print(next(M))
The above takes O(i) time, where I have to find the i^th value.
Another way is to convert to a list, and print the i^th value:
print(list(M)[i])
This however, takes O(n) time and O(n) space (where n is the size of the list from which the map M is created). However, this suits the so-called "Pythonic way of writing one-liners."
I was wondering if there is a syntactic sugar to minimise writing in the first way? (i.e., if there is a way which takes O(i) time, no extra space, and is more suited to the "Pythonic way of writing".)

You can use islice:
from itertools import islice
i = 3
print(next(islice(iterable), i, i + 1))
This outputs '3'.
It actually doesn't matter what you use as the stop argument, as long as you call next once.

Thanks to #DeepSpace for the reference to the official docs, I found the following:
from more_itertools import nth
print(nth(M, i))
It prints the element at i^th index of the iterable.

Julia: Using split to construct string arrays with multiple columns

I have a tuple which contains two pieces of information separated by # that looks like x = ("aa#b", "a#c", "a#d"). I can use a comprehension to transform this data into an array in the following way [split(x[i], "#")[j] for i in 1:length(x), j in 1:2].
However, this seems inefficient since I am effectively running the split command twice. Is there a preferred way of handling this case?
Thank you

function hashsplit(x)
out = Array{SubString{String},2}(2,length(x))
for (ind,j) in enumerate(x)
out[:,ind] = split(j,"#")
end
return out
end
Should be faster. Else a simple way with a list comprehension would be
[(split(x[i], "#")...) for i in eachindex(x)] (for a vector of tuples)
cat(2,ans...) or reduce(hcat, ans) if you want a matrix.

Despite it being an old question, here it is my proposed solution
String.(reduce(hcat,split.(a,'#')))
I checked with the #time operator. Compared with the list comprehension proposed by Alexander, which takes 0.042378 seconds (90.93 k allocations: 4.330 MiB),
the amount of time required for the same operation with the proposed solution is 0.000028 seconds (34 allocations: 1.500 KiB).

Data Structure for Subsequence Queries

In a program I need to efficiently answer queries of the following form:
Given a set of strings A and a query string q return all s ∈ A such that q is a subsequence of s
For example, given A = {"abcdef", "aaaaaa", "ddca"} and q = "acd" exactly "abcdef" should be returned.
The following is what I have considered considered so far:
For each possible character, make a sorted list of all string/locations where it appears. For querying interleave the lists of the involved characters, and scan through it looking for matches within string boundaries.
This would probably be more efficient for words instead of characters, since the limited number of different characters will make the return lists very dense.
For each n-prefix q might have, store the list of all matching strings. n might realistically be close to 3. For query strings longer than that we brute force the initial list.
This might speed things up a bit, but one could easily imagine some n-subsequences being present close to all strings in A, which means worst case is the same as just brute forcing the entire set.
Do you know of any data structures, algorithms or preprocessing tricks which might be helpful for performing the above task efficiently for large As? (My ss will be around 100 characters)
Update: Some people have suggested using LCS to check if q is a subsequence of s. I just want to remind that this can be done using a simple function such as:
def isSub(q,s):
i, j = 0, 0
while i != len(q) and j != len(s):
if q[i] == s[j]:
i += 1
j += 1
else:
j += 1
return i == len(q)
Update 2: I've been asked to give more details on the nature of q, A and its elements. While I'd prefer something that works as generally as possible, I assume A will have length around 10^6 and will need to support insertion. The elements s will be shorter with an average length of 64. The queries q will only be 1 to 20 characters and be used for a live search, so the query "ab" will be sent just before the query "abc". Again, I'd much prefer the solution to use the above as little as possible.
Update 3: It has occurred to me, that a data-structure with O(n^{1-epsilon}) lookups, would allow you to solve OVP / disprove the SETH conjecture. That is probably the reason for our suffering. The only options are then to disprove the conjecture, use approximation, or take advantage of the dataset. I imagine quadlets and tries would do the last in different settings.

It could done by building an automaton. You can start with NFA (nondeterministic finite automaton which is like an indeterministic directed graph) which allows edges labeled with an epsilon character, which means that during processing you can jump from one node to another without consuming any character. I'll try to reduce your A. Let's say you A is:
A = {'ab, 'bc'}
If you build NFA for ab string you should get something like this:
+--(1)--+
e | a| |e
(S)--+--(2)--+--(F)
| b| |
+--(3)--+
Above drawing is not the best looking automaton. But there are a few points to consider:
S state is the starting state and F is the ending state.
If you are at F state it means your string qualifies as a subsequence.
The rule of propagating within an autmaton is that you can consume e (epsilon) to jump forward, therefore you can be at more then one state at each point in time. This is called e closure.
Now if given b, starting at state S I can jump one epsilon, reach 2, and consume b and reach 3. Now given end string I consume epsilon and reach F, thus b qualifies as a sub-sequence of ab. So does a or ab you can try yourself using above automata.
The good thing about NFA is that they have one start state and one final state. Two NFA could be easily connected using epsilons. There are various algorithms that could help you to convert NFA to DFA. DFA is a directed graph which can follow precise path given a character -- in particular, it is always in exactly one state at any point in time. (For any NFA, there is a corresponding DFA whose states correspond to sets of states in the NFA.)
So, for A = {'ab, 'bc'}, we would need to build NFA for ab then NFA for bc then join the two NFAs and build the DFA of the entire big NFA.
EDIT
NFA of subsequence of abc would be a?b?c?, so you can build your NFA as:
Now, consider the input acd. To query if ab is subsequence of {'abc', 'acd'}, you can use this NFA: (a?b?c?)|(a?c?d). Once you have NFA you can convert it to DFA where each state will contain whether it is a subsequence of abc or acd or maybe both.
I used link below to make NFA graphic from regular expression:
http://hackingoff.com/images/re2nfa/2013-08-04_21-56-03_-0700-nfa.svg
EDIT 2
You're right! In case if you've 10,000 unique characters in the A. By unique I mean A is something like this: {'abc', 'def'} i.e. intersection of each element of A is empty set. Then your DFA would be worst case in terms of states i.e. 2^10000. But I'm not sure when would that be possible given that there can never be 10,000 unique characters. Even if you have 10,000 characters in A still there will be repetitions and that might reduce states alot since e-closure might eventually merge. I cannot really estimate how much it might reduce. But even having 10 million states, you will only consume less then 10 mb worth of space to construct a DFA. You can even use NFA and find e-closures at run-time but that would add to run-time complexity. You can search different papers on how large regex are converted to DFAs.
EDIT 3
For regex (a?b?c?)|(e?d?a?)|(a?b?m?)
If you convert above NFA to DFA you get:
It actually lot less states then NFA.
Reference:
http://hackingoff.com/compilers/regular-expression-to-nfa-dfa
EDIT 4
After fiddling with that website more. I found that worst case would be something like this A = {'aaaa', 'bbbbb', 'cccc' ....}. But even in this case states are lesser than NFA states.

Tests
There have been four main proposals in this thread:
Shivam Kalra suggested creating an automaton based on all the strings in A. This approach has been tried slightly in the literature, normally under the name "Directed Acyclic Subsequence Graph" (DASG).
J Random Hacker suggested extending my 'prefix list' idea to all 'n choose 3' triplets in the query string, and merging them all using a heap.
In the note "Efficient Subsequence Search in Databases" Rohit Jain, Mukesh K. Mohania and Sunil Prabhakar suggest using a Trie structure with some optimizations and recursively search the tree for the query. They also have a suggestion similar to the triplet idea.
Finally there is the 'naive' approach, which wanghq suggested optimizing by storing an index for each element of A.
To get a better idea of what's worth putting continued effort into, I have implemented the above four approaches in Python and benchmarked them on two sets of data. The implementations could all be made a couple of magnitudes faster with a well done implementation in C or Java; and I haven't included the optimizations suggested for the 'trie' and 'naive' versions.
Test 1
A consists of random paths from my filesystem. q are 100 random [a-z] strings of average length 7. As the alphabet is large (and Python is slow) I was only able to use duplets for method 3.
Construction times in seconds as a function of A size:
Query times in seconds as a function of A size:
Test 2
A consists of randomly sampled [a-b] strings of length 20. q are 100 random [a-b] strings of average length 7. As the alphabet is small we can use quadlets for method 3.
Construction times in seconds as a function of A size:
Query times in seconds as a function of A size:
Conclusions
The double logarithmic plot is a bit hard to read, but from the data we can draw the following conclusions:
Automatons are very fast at querying (constant time), however they are impossible to create and store for |A| >= 256. It might be possible that a closer analysis could yield a better time/memory balance, or some tricks applicable for the remaining methods.
The dup-/trip-/quadlet method is about twice as fast as my trie implementation and four times as fast as the 'naive' implementation. I used only a linear amount of lists for the merge, instead of n^3 as suggested by j_random_hacker. It might be possible to tune the method better, but in general it was disappointing.
My trie implementation consistently does better than the naive approach by around a factor of two. By incorporating more preprocessing (like "where are the next 'c's in this subtree") or perhaps merging it with the triplet method, this seems like todays winner.
If you can do with a magnitude less performance, the naive method does comparatively just fine for very little cost.

As you point out, it might be that all strings in A contain q as a subsequence, in which case you can't hope to do better than O(|A|). (That said, you might still be able to do better than the time taken to run LCS on (q, A[i]) for each string i in A, but I won't focus on that here.)
TTBOMK there are no magic, fast ways to answer this question (in the way that suffix trees are the magic, fast way to answer the corresponding question involving substrings instead of subsequences). Nevertheless if you expect the set of answers for most queries to be small on average then it's worth looking at ways to speed up these queries (the ones yielding small-size answers).
I suggest filtering based on a generalisation of your heuristic (2): if some database sequence A[i] contains q as a subsequence, then it must also contain every subsequence of q. (The reverse direction is not true unfortunately!) So for some small k, e.g. 3 as you suggest, you can preprocess by building an array of lists telling you, for every length-k string s, the list of database sequences containing s as a subsequence. I.e. c[s] will contain a list of the ID numbers of database sequences containing s as a subsequence. Keep each list in numeric order to enable fast intersections later.
Now the basic idea (which we'll improve in a moment) for each query q is: Find all k-sized subsequences of q, look up each in the array of lists c[], and intersect these lists to find the set of sequences in A that might possibly contain q as a subsequence. Then for each possible sequence A[i] in this (hopefully small) intersection, perform an O(n^2) LCS calculation with q to see whether it really does contain q.
A few observations:
The intersection of 2 sorted lists of size m and n can be found in O(m+n) time. To find the intersection of r lists, perform r-1 pairwise intersections in any order. Since taking intersections can only produce sets that are smaller or of the same size, time can be saved by intersecting the smallest pair of lists first, then the next smallest pair (this will necessarily include the result of the first operation), and so on. In particular: sort lists in increasing size order, then always intersect the next list with the "current" intersection.
It is actually faster to find the intersection a different way, by adding the first element (sequence number) of each of the r lists into a heap data structure, then repeatedly pulling out the minimum value and replenishing the heap with the next value from the list that the most recent minimum came from. This will produce a list of sequence numbers in nondecreasing order; any value that appears fewer than r times in a row can be discarded, since it cannot be a member of all r sets.
If a k-string s has only a few sequences in c[s], then it is in some sense discriminating. For most datasets, not all k-strings will be equally discriminating, and this can be used to our advantage. After preprocessing, consider throwing away all lists having more than some fixed number (or some fixed fraction of the total) of sequences, for 3 reasons:
They take a lot of space to store
They take a lot of time to intersect during query processing
Intersecting them will usually not shrink the overall intersection much
It is not necessary to consider every k-subsequence of q. Although this will produce the smallest intersection, it involves merging (|q| choose k) lists, and it might well be possible to produce an intersection that is nearly as small using just a fraction of these k-subsequences. E.g. you could limit yourself to trying all (or a few) k-substrings of q. As a further filter, consider just those k-subsequences whose sequence lists in c[s] are below some value. (Note: if your threshold is the same for every query, you might as well delete all such lists from the database instead, since this will have the same effect, and saves space.)

One thought;
if q tends to be short, maybe reducing A and q to a set will help?
So for the example, derive to { (a,b,c,d,e,f), (a), (a,c,d) }. Looking up possible candidates for any q should be faster than the original problem (that's a guess actually, not sure how exactly. maybe sort them and "group" similar ones in bloom filters?), then use bruteforce to weed out false positives.
If A strings are lengthy, you could make the characters unique based on their occurence, so that would be {(a1,b1,c1,d1,e1,f1),(a1,a2,a3,a4,a5,a6),(a1,c1,d1,d2)}. This is fine, because if you search for "ddca" you only want to match the second d to a second d. The size of your alphabet would go up (bad for bloom or bitmap style operations) and would be different ever time you get new A's, but the amount of false positives would go down.

First let me make sure my understanding/abstraction is correct. The following two requirements should be met:
if A is a subsequence of B, then all characters in A should appear in B.
for those characters in B, their positions should be in an ascending order.
Note that, a char in A might appear more than once in B.
To solve 1), a map/set can be used. The key is the character in string B, and the value doesn't matter.
To solve 2), we need to maintain the position of each characters. Since a character might appear more than once, the position should be a collection.
So the structure is like:
Map<Character, List<Integer>)
e.g.
abcdefab
a: [0, 6]
b: [1, 7]
c: [2]
d: [3]
e: [4]
f: [5]
Once we have the structure, how to know if the characters are in the right order as they are in string A? If B is acd, we should check the a at position 0 (but not 6), c at position 2 and d at position 3.
The strategy here is to choose the position that's after and close to the previous chosen position. TreeSet is a good candidate for this operation.
public E higher(E e)
Returns the least element in this set strictly greater than the given element, or null if there is no such element.
The runtime complexity is O(s * (n1 + n2)*log(m))).
s: number of strings in the set
n1: number of chars in string (B)
n2: number of chars in query string (A)
m: number of duplicates in string (B), e.g. there are 5 a.
Below is the implementation with some test data.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeSet;
public class SubsequenceStr {
public static void main(String[] args) {
String[] testSet = new String[] {
"abcdefgh", //right one
"adcefgh", //has all chars, but not the right order
"bcdefh", //missing one char
"", //empty
"acdh",//exact match
"acd",
"acdehacdeh"
};
List<String> subseqenceStrs = subsequenceStrs(testSet, "acdh");
for (String str : subseqenceStrs) {
System.out.println(str);
}
//duplicates in query
subseqenceStrs = subsequenceStrs(testSet, "aa");
for (String str : subseqenceStrs) {
System.out.println(str);
}
subseqenceStrs = subsequenceStrs(testSet, "aaa");
for (String str : subseqenceStrs) {
System.out.println(str);
}
}
public static List<String> subsequenceStrs(String[] strSet, String q) {
System.out.println("find strings whose subsequence string is " + q);
List<String> results = new ArrayList<String>();
for (String str : strSet) {
char[] chars = str.toCharArray();
Map<Character, TreeSet<Integer>> charPositions = new HashMap<Character, TreeSet<Integer>>();
for (int i = 0; i < chars.length; i++) {
TreeSet<Integer> positions = charPositions.get(chars[i]);
if (positions == null) {
positions = new TreeSet<Integer>();
charPositions.put(chars[i], positions);
}
positions.add(i);
}
char[] qChars = q.toCharArray();
int lowestPosition = -1;
boolean isSubsequence = false;
for (int i = 0; i < qChars.length; i++) {
TreeSet<Integer> positions = charPositions.get(qChars[i]);
if (positions == null || positions.size() == 0) {
break;
} else {
Integer position = positions.higher(lowestPosition);
if (position == null) {
break;
} else {
lowestPosition = position;
if (i == qChars.length - 1) {
isSubsequence = true;
}
}
}
}
if (isSubsequence) {
results.add(str);
}
}
return results;
}
}
Output:
find strings whose subsequence string is acdh
abcdefgh
acdh
acdehacdeh
find strings whose subsequence string is aa
acdehacdeh
find strings whose subsequence string is aaa
As always, I might be totally wrong :)

You might want to have a look into the Book Algorithms on Strings and Sequences by Dan Gusfield. As it turns out part of it is available on the internet. You might also want to read Gusfield's Introduction to Suffix Trees. As it turns out this book covers many approaches for you kind of question. It is considered one of the standard publications in this field.
Get a fast longest common subsequence algorithm implementation. Actually it suffices to determine the length of the LCS. Notice that Gusman's book has very good algorithms and also points to more sources for such algorithms.
Return all s ∈ A with length(LCS(s,q)) == length(q)

Efficient mass string search problem

The Problem: A large static list of strings is provided. A pattern string comprised of data and wildcard elements (* and ?). The idea is to return all the strings that match the pattern - simple enough.
Current Solution: I'm currently using a linear approach of scanning the large list and globbing each entry against the pattern.
My Question: Are there any suitable data structures that I can store the large list into such that the search's complexity is less than O(n)?
Perhaps something akin to a suffix-trie? I've also considered using bi- and tri-grams in a hashtable, but the logic required in evaluating a match based on a merge of the list of words returned and the pattern is a nightmare, furthermore I'm not convinced its the correct approach.

I agree that a suffix trie is a good idea to try, except that the sheer size of your dataset might make it's construction use up just as much time as its usage would save. Theyre best if youve got to query them multiple times to amortize the construction cost. Perhaps a few hundred queries.
Also note that this is a good excuse for parallelism. Cut the list in two and give it to two different processors and have your job done twice as fast.

you could build a regular trie and add wildcard edges. then your complexity would be O(n) where n is the length of the pattern. You would have to replace runs of ** with * in the pattern first (also an O(n) operation).
If the list of words were I am an ox then the trie would look a bit like this:
(I ($ [I])
a (m ($ [am])
n ($ [an])
? ($ [am an])
* ($ [am an]))
o (x ($ [ox])
? ($ [ox])
* ($ [ox]))
? ($ [I]
m ($ [am])
n ($ [an])
x ($ [ox])
? ($ [am an ox])
* ($ [I am an ox]
m ($ [am]) ...)
* ($ [I am an ox]
I ...
...
And here is a sample python program:
import sys
def addWord(root, word):
add(root, word, word, '')
def add(root, word, tail, prev):
if tail == '':
addLeaf(root, word)
else:
head = tail[0]
tail2 = tail[1:]
add(addEdge(root, head), word, tail2, head)
add(addEdge(root, '?'), word, tail2, head)
if prev != '*':
for l in range(len(tail)+1):
add(addEdge(root, '*'), word, tail[l:], '*')
def addEdge(root, char):
if not root.has_key(char):
root[char] = {}
return root[char]
def addLeaf(root, word):
if not root.has_key('$'):
root['$'] = []
leaf = root['$']
if word not in leaf:
leaf.append(word)
def findWord(root, pattern):
prev = ''
for p in pattern:
if p == '*' and prev == '*':
continue
prev = p
if not root.has_key(p):
return []
root = root[p]
if not root.has_key('$'):
return []
return root['$']
def run():
print("Enter words, one per line terminate with a . on a line")
root = {}
while 1:
line = sys.stdin.readline()[:-1]
if line == '.': break
addWord(root, line)
print(repr(root))
print("Now enter search patterns. Do not use multiple sequential '*'s")
while 1:
line = sys.stdin.readline()[:-1]
if line == '.': break
print(findWord(root, line))
run()

If you don't care about memory and you can afford to pre-process the list, create a sorted array of every suffix, pointing to the original word, e.g., for ['hello', 'world'], store this:
[('d' , 'world'),
('ello' , 'hello'),
('hello', 'hello'),
('ld' , 'world'),
('llo' , 'hello'),
('lo' , 'hello'),
('o' , 'hello'),
('orld' , 'world'),
('rld' , 'world'),
('world', 'world')]
Use this array to build sets of candidate matches using pieces of the pattern.
For instance, if the pattern is *or*, find the candidate match ('orld' , 'world') using a binary chop on the substring or, then confirm the match using a normal globbing approach.
If the wildcard is more complex, e.g., h*o, built sets of candidates for h and o and find their intersection before the final linear glob.

You say you're currently doing linear search. Does this give you any data on the most frequently performed query patterns? e.g. is blah* much more common than bl?h (which i'd assume it was) among your current users?
With that kind of prior knowledge you can focus your indexing efforts on the commonly used cases and get them down to O(1), rather than trying to solve the much more difficult, and yet much less worthwhile, problem of making every possible query equally fast.

You can achieve a simple speedup by keeping counts of the characters in your strings. A string with no bs or a single b can never match the query abba*, so there is no point in testing it. This works much better on whole words, if your strings are made of those, since there are many more words than characters; plus, there are plenty of libraries that can build the indexes for you. On the other hand, it is very similar to the n-gram approach you mentioned.
If you do not use a library that does it for you, you can optimize queries by looking up the most globally infrequent characters (or words, or n-grams) first in your indexes. This allows you to discard more non-matching strings up front.
In general, all speedups will be based on the idea of discarding things that cannot possibly match. What and how much to index depends on your data. For example, if the typical pattern length is near to the string length, you can simply check to see if the string is long enough to hold the pattern.

There are plenty of good algorithms for multi-string search. Google "Navarro string search" and you'll see a good analysis of multi-string options. A number of algorithsm are extremely good for "normal" cases (search strings that are fairly long: Wu-Manber; search strings with characters that are modestly rare in the text to be searched: parallel Horspool). Aho-Corasick is an algorithm that guarantees a (tiny) bounded amount of work per input character, no matter how the input text is tuned to create worst behaviour in the search. For programs like Snort, that's really important, in the face of denial-of-service attacks. If you are interested in how a really efficient Aho-Corasick search can be implemented, take a look at ACISM - an Aho-Corasick Interleaved State Matrix.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string