How to quickly filter a list using regex? - python-3.x

Well... I have a trivial request of building an Entry that filter on-the-fly a list of entries. (think of an editor auto-complete feature)
The request is to support a regex filter over the whole list and display only matching entries.
e.g.,
The list contains:
abc.efg.hij.entry
abc.ddd.hij.entry2
hij.some.value.entry
Typing in the Entry
Value : List
hij : abc.efg.hij.entry, abc.ddd.hij.entry2, hij.some.value.entry
ddd : abc.ddd.hij.entry2
dd*entry : abc.ddd.hij.entry2
val : hij.some.value.entry
Here is the code i'm using for filtering the list:
regex = re.compile(r"{0}".format(entry_value), re.IGNORECASE)
display_list = list(filter(regex.search, display_list))
The real life list contains ~300K entries of strings (up to 100 char each) and the performance of the above is very poor, considering a GUI response time.
I've profiled my real test case and it yields ~0.8s for each key typing in the Entry.
Is there a faster way?

If you are doing regular expression pattern matching against a normal python list that contains 300,000 items, it's just naturally going to be slow. Also, if you are going to display 300,000 items in a listbox it's going to be slow to display all of those items.
Your best bet might be to pick a better data structure. For example, on my system I can run your filter against 300,000 items in about 250ms, but a query against an in-memory sqlite database with 300,000 rows takes about half that time. In either case, it can add another second to fully update the display if the result is very large (for example, if all 300,000 match)
Of course, sqlite doesn't support regex out of the box, but you can translate some common patterns to sql patterns (eg: 'foo.*bar' could be translated to 'foo%bar'). For more information on sqlite and regex see How do I use regex in a SQLite query?
Another strategy to employ would be to not search on every character typed. Wait until the user pauses in their typing. So, for example, if they type "Lorem", you don't need to search on "L" and then "Lo", and then "Lor", etc. Instead, schedule the search to happen in 100 ms, and with each keypress you can reschedule the search. This will prevent the searching from slowing down, while still giving the user what appears to be a fairly rapid result.

Related

Mongo DB like search with count is very slow on 50 million collection data

In my application, I have a collection of 50 million data. I am using like search and then count the results on a particular field(i.e Patientfirstname). I also created an index on the Patientfirstname field it improved the performance but still it is taking a lot of time.
db.patients.find({"Patientfirstname":{"$regex":"Testuser"}}).count() without index 40 sec
db.patients.find({"Patientfirstname":{"$regex":"Testuser"}}).count() after adding index on the Patientfirstname field 31 sec
db.patients.find({"Patientfirstname":{"$regex":"Testuser"}}).count()
I tried with a different approach (aggregate) but still, response is very slow
db.patients.aggregate.([{$match:{"Patientfirstname":{"$regex":"Testuser"}}},
{$project:{"Patientfirstname":1,"_id":1}},
{$group : {_id:"$Patientfirstname", count:{$sum:1}}},
{$sort:{"count":-1}} ])
this query also takes the same to time fetch the results 31 sec
another approach was tried but the results are not correct
select only the field from the entire collection and then apply like search and count and result.
db.patients.find({},{Patientfirstname:1,_id:1}).count({"Patientfirstname":{"$regex":"Testuser"}})
applying a filter in the count is not working, entire collection count is displayed
Please help in this query to fetch results faster.Thanks in advance
So here is the deal:
As rightly pointed in the comments, $regex is an operator that would not perform well with or without indexes. Here is the reason why:
Queries without indexes are slow because they executed using COLLSCAN - which is essentially iteration of the whole 50 Million documents on the disk one-by-one, filtering data and returning only the ones that match. Disks being an inherently slow piece of hardware does not help the situation either.
Now, When indexed - MongoDB creates a B-Tree in the RAM. And $regex operator being not very selective in nature, it forces a complete Tree Scan (as compared to a reduced / partial tree scan in case of equalities or ranges) in the index b-tree - which is as bad as a Collection Scan itself. The only reason you get a benefit on 9 seconds is because this Tree Scan occurs in the RAM and not the disk.
Having said that, there are a few alternatives to it:
Optimize your $regex. From the MongoDB Documentation itself:
For case sensitive regular expression queries, if an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan. Further optimization can occur if the regular expression is a "prefix expression", which means that all potential matches start with the same string. This allows MongoDB to construct a "range" from that prefix and only match against those values from the index that fall within that range.
A regular expression is a "prefix expression" if it starts with a caret (^) or a left anchor (\A), followed by a string of simple symbols. For example, the regex /^abc.*/ will be optimized by matching only against the values from the index that start with abc.
Additionally, while /^a/, /^a./, and /^a.$/ match equivalent strings, they have different performance characteristics. All of these expressions use an index if an appropriate index exists; however, /^a./, and /^a.$/ are slower. /^a/ can stop scanning after matching the prefix.
Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes.
Create a Text Index - This would tokenize your text string and enable faster text based searches
If you are deployed on MongoDB Atlas - Then you can use Atlas Search which is a Lucene based Text Search Engine (Works almost like elasticsearch on steroids). This offers significantly greater performance and functionalities like fuzzy text search, text automcomplete etc.

Data structure to index entire document and algorithm for quick search of any size substring

I'm trying to find a data structure (and algorithm) that would allow me to index an entire text document and search for substring of it, no matter the size of the substring. The data structure should be stored in disk, during or at the end of the indexing procedure.
For instance, given the following sentence:
The book is on the table
The algorithm should quickly (O(log(n))) find the occurrences of any subset of the text.
For instance, if the input is book it should find all occurrences of it, but this should also be true for book is and The book is.
Unfortunately, the majority of solutions work by tokenizing the text and making searches using individual tokens. Ordinary databases also index any text without worrying about subset searching (that is why SELECT '%foo%' is done with linear search and takes a lot?).
I could try to develop something from scratch (maybe a variation of reverse index?) but I'd love to discover that somebody did that.
The most similar thing I found is SQLite3 Full-text search.
Thanks!
One approach is to index your document in a suffix tree, and then - each prefix of some suffix - is a substring in the document.
With this approach, all you have to do, is build your suffix tree, and upon querying a substring s, follow nodes in the tree, and if you can follow through the entire query string - it means there is a suffix, which its prefix is the query string - and thus it is also a substring.
If you are querying only complete words, inverted index could be just enough. Inverted index is usually mapping a term (word) to a list of documents it appears in. Instead, for you it will mapping to locations in the document.
Upon query, you need to find for each occurance of word i in the query, its positions (let it be p), and if term i+1 of your query, appears as well in position p+1.
This can be done pretty efficiently, similarly to how inverted index is traditionally doing AND queries, but instead of searching all terms in same document, search terms in increasing positions.

inverted index sets - querying key prefixes

I'm using Redis in order to build an inverted index system for words and the documents that contains those words.
the setup is really simple: Redis Sets where the key of the Set is: i:word and the values of the Set are the documents ids that have this word
let's say i have 2 sets: i:example and i:result
the query - "example result" will intersect i:example and i:result and return all the ids that have both example and result as members
but what i'm looking for is a way to perform (in efficient manner) a query like: "ex res". the result set should contain at least all the ids from the query "example result"
Solutions that i thought of:
create prefix sets of size 2: p:ex - contains {"example", "expertise", "ex"...}. the lookup running time will not be a problem - O(1) to get the set and O(n) to check all elements in the set for words that start with the prefix (where n = set.size()) but i worry about the added size price.
Using scan: but i'm not sure about the running time - query like scan 0 match ex* will take O(n) where n is the number of keys in the db? I know redis is fast but it's probably not an optimized solution for query like "ex machi cont".
The usual way to go about this is the first approach you had mentioned, but usually you'd go with segments that are 3+ chars long. Note that you'll need to have a set for each segment, i.e.g. i:exa, i:exam, i:examp, i:exampl and of course i:example.
This will naturally take up space in your database (hence the suggestion to start at 3 rather than 2 characters). A possible tweak is to keep in the i:len(3) sets only references to i:len(4+) sets instead of document ids. This will required more read operations but will have significant savings in terms of RAM.
You should explore v2.8.9's addition of lexicographical ranges for Sorted Sets. By calling ZRANGEBYLEX you can get ranges of members (i.e.g. all the words that start with ex). While this could be useful in this context by itself, consider that you can also use your Sorted Set's members creatively to encode a word and its document reference. This can help you get over the "loss" of the score (since all scores need to be the same for lexicographical ordering to work). For example, assuming the words "bed" and "beg" in docs 1 and 2:
ZADD index 0 "beg:1" 0 "bed:2"
Lastly, here's a little something to think about too - adding suffix searching (i.e.g, everything that ends with "ample"): https://redislabs.com/blog/how-to-use-redis-at-least-x1000-more-efficiently

String matching algorithm : (multi token strings)

I have a dictionary which contains a big number of strings. Each string could have a range of 1 to 4 tokens (words). Example :
Dictionary :
The Shawshank Redemption
The Godfather \
Pulp Fiction
The Dark Knight
Fight Club
Now I have a paragraph and I need to figure out how many strings in the para are part of the dictionary.
Example, when the para below :
The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been
battling The Godfather for the top spot.
is run against the dictionary, I should be getting the ones in bold as the ones that are part of the dictionary.
How can I do this with the least dictionary calls.
Thanks
You might be better off using a Trie. A Trie is better suited to finding partial matches (i.e. as you search through the text of a paragraph) that are potentially what you're looking for, as opposed to making a bunch of calls to a dictionary that will mostly fail.
The reason why I think a Trie (or some variation) is appropriate is because it's built to do exactly what you're trying to do:
If you use this (or some modification that has the tokenized words at each node instead of a letter), this would be the most efficient (at least that I know of) in terms of storage and retrieval; Storage because instead of storing the word "The" a couple thousand times in each Dict entry that has that word in the title (as is the case with movie titles), it would be stored once in one of the nodes right under the root. The next word, "Shawshank" would be in a child node, and then "redemption" would be in the next, with a total of 3 lookups; then you would move to the next phrase. If it fails, i.e. the phrase is only "The Shawshank Looper", then you fail after the same 3 lookups, and you move to the failed word, Looper (which as it happens, would also be a child node under the root, and you get a hit. This solution works assuming you're reading a paragraph without mashup movie names).
Using a hash table, you're going to have to split all the words, check the first word, and then while there's no match, keep appending words and checking if THAT phrase is in the dictionary, until you get a hit, or you reach the end of the paragraph. So if you hit a paragraph with no movie titles, you would have as many lookups as there are words in the paragraph.
This is not a complete answer, more like an extended-comment.
In literature it's called "multi-pattern matching problem". Since you mentioned that the set of patterns has millions of elements, Trie based solutions will most probably perform poorly.
As far as I know, in practice traditional string search is used with a lot of heuristics. DNA search, antivirus detection, etc. all of these fields need fast and reliable pattern matching, so there should be decent amount of research done.
I can imagine how Rabin-Karp with rolling-hash functions and some filters (Bloom filter) can be used in order to speed up the process. For example, instead of actually matching the substrings, you could first filter (e.g. with weak-hashes) and then actually verify, thus reducing number of verifications needed. Plus this should reduce the work done with the original dictionary itself, as you would store it's hashes, or other filters.
In Python:
import re
movies={1:'The Shawshank Redemption', 2:'The Godfather', 3:'Pretty Woman', 4:'Pulp Fiction'}
text = 'The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been battling The Godfather for the top spot.'
repl_str ='(?P<title>' + '|'.join(['(?:%s)' %movie for movie in movies.values()]) + ')'
result = re.sub(repl_str, '<b>\g<title></b>',text)
Basically it consists of forming up a big substitution instruction string out of your dict values.
I don't know whether regex and sub have a limitation in the size of the substitution instructions you give them though. You might want to check.
lai

Search with attribute values correspondence in Lucene

Here's a text with ambiguous words:
"A man saw an elephant."
Each word has attributes: lemma, part of speech, and various grammatical attributes depending on its part of speech.
For "saw" it is like:
{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}
All this attributes come from the 3rd party tools, Lucene itself is not involved in the word disambiguation.
I want to perform a query like "pos=verb & number=singular" and NOT to get "saw" in the result.
I thought of encoding distinct grammatical annotations into strings like "l:see;pos:verb;t:past|l:saw;pos:noun;n:sg" and searching for regexp "pos\:verb[^\|]+n\:sg", but I definitely can't afford regexp queries due to performance issues.
Maybe some hacks with posting list payloads can be applied?
UPD: A draft of my solution
Here are the specifics of my project: there is a fixed maximum of parses a word can have (say, 8).
So, I thought of inserting the parse number in each attribute's payload and use this payload at the posting lists intersectiion stage.
E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|...
While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing the 'x.1234' entries would be accepted until the intersection stage where they would be rejected because of non-corresponding parse numbers.
I think this is a pretty compact solution, but how hard would be incorporating it into Lucene?
So... the cheater way of doing this is (indeed) to control how you build the lucene index.
When constructing the lucene index, modify each word before Lucene indexes it so that it includes all the necessary attributes of the word. If you index things this way, you must do a lookup in the same way.
One way:
This means for each type of query you do, you must also build an index in the same way.
Example:
saw becomes noun-saw -- index it as that.
saw also becomes noun-past-see -- index it as that.
saw also becomes noun-past-singular-see -- index it as that.
The other way:
If you want attribute based lookup in a single index, you'd probably have to do something like permutation completion on the word 'saw' so that instead of noun-saw, you'd have all possible permutations of the attributes necessary in a big logic statement.
Not sure if this is a good answer, but that's all I could think of.

Resources