Searching for strings withing a string - string

I have a large number of strings (potentially 1,000,000+), and I want to search another string (a document) to see which of these search strings appears in the document.
Not all of the search strings are a single word, so it's not just a case of searching for each word in the document in the list of search strings.
What's the most efficient way of doing this?
I will be doing this for a large number of documents (coming from a feed), and need to do it fast enough that I can process the documents quicker than they're coming in (a second or two at most ideally).
I can potentially come up with a list of stop words that won't appear in the search strings (e.g. 'the', 'and').
Ideally the solution will be in Java, but that's not a requirement as I can always port the code into Java. If it makes any difference, the search strings are currently stored in a MongoDB.

Take a look at Radix trees and Suffix trees.
There is an example on the concurrent-trees project, of how to scan unseen documents efficiently for large numbers of keywords stored in the inverted radix tree in that project. Example code here.

Check out High-perfor­mance pattern matching algo­rithms Java

Related

Data structure for fast full text search

A trie seems like it would work for small strings, but not for large documents, so not sure (1-100's of pages of text). Maybe it is possible to combine an inverted index with a suffix tree to get the best of both worlds. Or perhaps using a b-tree with words stored as nodes, and a trie for each node. Not sure. Wondering what a good data structure would be (b-tree, linked-list, etc.).
I'm thinking of searching documents such as regular books, web pages, and source code, so the idea of storing just words in an inverted index doesn't seem quite right. Would be helpful to know if you need alternative solutions for each or if there is a general one that works for them all, or a combination of them.
You do need an inverted index at the end of the day for interleaving matching results from each of your query terms but an inverted index can be built either from Trie or a Hash Map. A trie would allow fuzzy look-ups, while an hash map based inverted-index would only allow an exact look up of a token.
To optimize for memory usage, you can use memory optimized versions of Trie like Radix Tree or Adaptive Radix Tree (ART). I've had great success using ART for an open source fuzzy search engine project I've been working on: https://github.com/typesense/typesense
With Typesense, I was able to index about 1 million Hacker News titles in about 165 MB of RAM (uncompressed size on disk was 85 MB). You can probably squeeze it in even further if your use case is more specific and don't need some metadata fields I added to the data structure.

inexact string search - short query string to huge database (blast?)

I have an OCR that recognises a few short query strings (4-12 letters) in a given picture. And I would like to match these recognised words against a big database of known words. I've already build a confusion matrix with the used alphabet from the most common mistakes and I tried to do a whole gotoh alignment against all words in my database and found (not surprinsingly) that this is too time consuming.
So I am looking for a heuristic approach to match these words to the database (allowing mismatches). Does anyone know of an available library or algorithm that could help me out?
I've already thought about using BLAST or FASTA but the way I understood it both are limited to the standard amino acid alphabet and I would like to use all letters and numbers.
Thank you for your help!
Not an expert but I've done some reading on bioinformatics (which aren't the topic but related). You could use suffix trees or related data structures to more-quickly search through the database. I believe currently the time required for construction of the tree is linear wrt database length, and time required for querying the tree is linear wrt length of query string, so if you have a lot of query strings that are relatively short this sounds like the perfect data structure for you. More reading can be found on the wikipedia page for suffix trees.

Searching substrings from a large set of strings

Is there a space efficient data structure that can help answer the following question:
Assume I have a database of a large number of strings (in the
millions). I need to be able to answer quickly if a given string is
a substring of one these strings in the database.
Note that it's not even necessary in this case to tell which string it is a substring of, just that it's a substring of one.
As clarification, the ideal is to keep the data as small as possible, but query speed is really the most important issue. The minimum requirement is being able to hold the query data structure in RAM.
The right way to go about this is to avoid using your Java application to answer the question. If you solve the problem in Java, your app is guaranteed to read the entire table, and this is in addition to logic you will have to run on each record.
A better strategy would be to use your database to answer the question. Consider the following SQL query (assuming your database is some SQL flavor):
SELECT COUNT(*) FROM your_table WHERE column LIKE "%substring%"
This query will return the number of rows where 'column' contains some 'substring'. You can issue a JDBC call from your Java application. As a general rule, you should leave the heavy database lifting to your RDBMS; it was created for that.
I am giving a hat tip to this SO post which was the basis for my response: http://www.stackoverflow.com/questions/4122193/how-to-search-for-rows-containing-a-substring
Strings are highly compact structures, so for regular English text it is unlikely that you will find any other kind of structure that would be more space efficient than strings. You can perform various tricks with bits so as to make each character occupy less space in memory, (at the expense of supporting other languages,) but the savings there will be linear.
However, if your strings have a very low degree of variation, (very high level of repetition,) then you might be able to save space by constructing a tree in which each node corresponds to a letter. Each path of nodes in the tree then forms a possible word, as follows:
[c]-+-[a]-+-[t]
+
+-[r]
So, the above tree encodes the following words: cat, car. Of course this will only result in savings if you have a huge number of mostly similar strings.

Search strategy

I'm writing a java program that needs to find possible matches for specified strings. Strings will generally be in the form of
onetwothree one.two.three
onesomethingtwoblah onesomething
where one two and three are parts of an actual title. Candidate matches from the database are in the form one+two+three. The method i have come up with is to compare each token from database candidates with the entire specified string using regex. A counter for the number of database token matches will be used to determine the rank of possible matches.
My concern is the accuracy of matches presented and the method's ability to successfully find matches if they do exist. Is this method efficient?
Depends, if you have a lot of database records and large strings to compare against the search may end up being quite expensive. It would need to pass the entire input string for each record.
You could consider doing a single pass over the input string and search tokens against the database. Some smart search indexed could help speed this up.
When pairing multiple tokens you would need to figure out a way knowing when to stop scanning and advance to a next token. Partial matches could help here; store one+two+three also as seperate one, two and three. Or if the order matters store it also as one, one+two and one+two+three.
Basically as you scan you have a list of candidate DB entries that gets smaller and smaller, comparable to a facet search.

Best way to support wildcard search on a large dictionary?

I am working on a project to search in a large dictionary (100k~1m words). The dictionary items look like {key,value,freq}. Myy task is the development of an incremental search algoritm to support exact match, prefix match and wildcard match. The results should be ordered by freq.
For example:
the dictionary looks like
key1=a,value1=v1,freq1=4
key2=ab,value2=v2,freq2=2
key3=abc,value3=v3 freq3=1
key4=abcd,value4=v4,freq4=3
when a user types 'a', return v1,v4,v2,v3
when a user types 'a?c', return v4,v3
Now my best choice is a suffix tree represented by DAWG data struct, but this method does not support wildcard matches effectively.
Any suggestion?
You need to look at n-grams for indexing your content. If you want to something Out-of-the box, you might want to look at Apache Solr which does a lot of the hard work for you. It also supports prefix, wildcard queries etc.

Resources