Fastest way to search a SQL Server table (or indexed view) column with "like '%search%'"? - search

Suppose there's a table with columns (UserID, FieldID, Value), with half a million records. I want to see if some search term T(N) occurs anywhere in each Value (i.e. Value.Contains( T(N) ) ).
I think I'm just hitting a wall volume wise, just too many values to sift through. I don't think a Full Text index will help, because it's only useful for StartsWith queries that look at individual words, not occurrences anywhere within the string at all.
Is there a good approach to indexing this kind of data for such a search in SQL Server?

A half-million records is not terribly large, although I don't know the size of the field contents. A couple of ideas - this was too long for a comment or else I may have posted as such.
You could implement a full-text search engine like Elastic, Solr, etc and use it as a sidecar. If when you are doing text searches, you are not otherwise making much use of the other data, this might be easy enough. Note that you could put other data for searching into Elastic or Solr, but I'm not sure if you'd want to duplicate all your data, and those tools aren't really great for a transactional data store.
Another option for volumes this small, assuming you only need basic "contains" searching: create two more tables: keywords and keyword_index (or whatever). When saving, tokenize your text content and write out any new keywords to keywords table and then add the data to the join table. Index everything, and then do your search off the keywords table, joining back to the master via the intermediate keyword_index table.
This is fairly hackish, and getting your keyword handling really dialed in (for stemming, etc) may be a pain. It is a reasonable quick & dirty solution for smaller-scale needs though.

Related

ArangoDB: Querying multiple fields at the same time for partial match

I have a database containing product information (SKU, model number, descriptions, etc) and I'd like to have a relatively quick search function where a user can just type in a few letters or a word from any of the the text fields and then get a list of products that contain that phrase in any of those fields.
The number of items in the database will probably not be more than 100,000.
What would be the easiest way to accomplish this, without creating complex queries?
It sounds like you're looking for an autocomplete. There are numerous ways to do this.
Indexing
No matter the solution you choose, you'll want to put some indices on your data. I recommend adding a skiplist to everything you're going to be searching, and an additional fulltext index on any long-form text (such as product description). String comparison uses skiplists, while only a FULLTEXT search will leverage a fulltext index.
Querying
You have some choices here.
LIKE
https://docs.arangodb.com/3.1/AQL/Functions/String.html#like
You could run your search something like:
for product in warehouse
filter like(product.model, #searchTerm, true) or
like(product.sku, #searchTerm, true)
return product
Advantage: simple query syntax, multiple attributes in one search, supports substrings, can search the middle of a body of text.
Disadvantage: relatively slow.
Fulltext
This is a lot more complex for querying, but is very responsive, and is the approach my application uses for its autocomplete.
let sku = (for result in fulltext("warehouse", "sku", "prefix:#seacrhTerm")
return {sku: result.sku, model: result.model, description: result.description}
let model = (for result in fulltext("warehouse", "model", "prefix:#searchTerm")
return {sku: result.sku, model: result.model, description: result.description}
let description = (for result in fulltext("warehouse", "description", "prefix:#searchTerm")
return {sku: result.sku, model: result.model, description: result.description}
let resultsMatch = union(sku,model,description)
return resultsMatch
Advantage: Very fast, extremely responsive, can handle very long bodies of text with ease, searches anywhere in a text body.
Disadvantage: Complex query structure as you need one variable for every attribute you're searching, a fulltext index created on each of those attributes you're searching, and a union at the end. You may need to do a union of the unioned results depending on how advanced your search needs to be. Doesn't support substring searching.
Raw string comparison
Simply create a query that filters for results to be greater than or equal to your search term, but less than your search term with the last letter incremented by 1. Example is in the link under the Foxx portion of my answer. This leverages skiplists.
Advantage: Very fast as long as the field is not tremendously long. Extremely easy to implement.
Disadvantage: Doesn't support substring searches. Only searches the first part of a string. I.e. you must know the beginning of the field you're searching.
This will work very well for quickly searching something like a model number where your users will probably know the beginning of it, but poorly for something like a description in which your users are probably searching for words somewhere in the middle of a body of text.
Foxx
Jan's little Cookbook example is a good place to start:
https://docs.arangodb.com/cookbook/UseCases/PopulatingAnAutocompleteTextbox.html
I would recommend abstracting whatever you do into a Foxx service. It is especially liberating if you need to dynamically build up AQL queries in database, in case you have a huge number of fields and collections to search and you need to generate a Fulltext search dynamically.
Bottom line
Experiment and see which of these works best for you. My best guess is that you will find the Fulltext solution the best if you need to search on product descriptions. If you expect your users to always search the first few letters of a field, just use the comparison with a skiplist as it is very very fast.

full text search in databases

I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks
Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you
(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.

What indexer do I use to find the list in the collection that is most similar to my list?

Lets say I have my list of ingredients:
{'potato','rice','carrot','corn'}
and I want to return lists from a database that are most similar to mine:
{'beans','potato','oranges','lettuce'},
{'carrot','rice','corn','apple'}
{'onion','garlic','radish','eggs'}
My query would return this first:
{'carrot','rice','corn','apple'}
I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.
In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.
What technology should I use to accomplish what I want to do?
Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.
With so much data I can't really loop through it, I need to query everything at once.
I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match.
If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki
Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.

data modelling in cassandra to optimize search results

I was just wondering if I could get some clue/pointers to our kind of simple data modelling problem.
It would be great if somebody can help me in the right direction.
So we have kind of a flat table ex. document
which has all kinds of meta data attached to a document like
UUID documentId,
String organizationId,
Integer totalPageCount,
String docType,
String acountNumber,
String branchNumber,
Double amount,
etc etc...
which we are storing in cassandra .
UUID is the rowkey and we have certain secondary indexes like organization Id.
This table is actaully suppose hold millions of records.
Placing proper indices helps with a lot of queries but with the generic queries I am stuck.
The problem is even with something like 100k records if I throw in a query like
select * from document where orgId='something' and amount > 5 and amount < 50 ...I am begining to see all Read time out problems.
The query still works (although quite slow) if I limit the no of records to something lets say 2000.
The above can be solved by probably placing certain parmas properly but there about dozens of those columns based on which we need to search.
I am still trying to scale it horizontally so to place mutiple records in a single row.
Hoping for a sense of direction.
This is a broad problem, and general solutions are hard to give. However, here's my 2 pennies:
You WANT queries to hit single partitions for quick querying. If you don't hit a rowkey in your query, it's a cluster wide operation. So select * from docs where orgId='something' and amount > 5 and amount < 50 means you will have issues. Hitting a partition key AND an index is way way better than hitting the index without the partition key.
Again, you don't want all docs in a single partition...that's an obvious hotspot, not to mention it can cause size issues - keeping a row around the 100mb mark is a good idea. Several thousand or even several hundred thousand metadata entries per row should be fine - though much of this depends on your specific data.
So we want to hit partition keys, but also want to take advantage of distribution, while preserving efficiency. Hmmm.....
You can create artificial buckets. Decide how many buckets you want, based on expected data volumes. Assuming a few hundred thousand per partition, n buckets gives you n * hundreds of thousands. Make the bucket id the row key. When querying, use something like:
select * from documents where bucketid in (...) and orgId='something' and amount > 5;
[Note: for this, you may want to make the docid the last clustering key, so you don't have to specify it when doing the range query.]
That will result in n fast queries hitting n partitions, where n is the number of buckets.
Also, consider limiting your results. Do you really need 2000 records at a time?
For some information, it may make sense to have separate tables (i.e. some information with one particular clustering scheme in one table, and another in another). Duplication of some information is often ok - but again, this depends on particular scenarios.
Again, it's hard to give a general answer. But does that help?
The problem is not in Cassandra, but in your data model. You need to shift from relation thinking, to a nosql-cassandra thinking. In Cassandra, you write your queries first if you want to get decent O(1) speed. Using secondary indexes in Cassandra is frankly a poor choice. This is due to the fact that your indexes are distributed.
If you don't know your queries upfront, use other technology but not Cassandra. Relational servers are really good, if you can fit all data on 1 server, otherwise have a look at ElasticSearch.
Other option is to use Datastax edition, which contains Solr for full text search.
Lastly, you can have several tables that duplicate information. This will allow you to query for a specific property . This process is called de-normalisation and the idea is that you take a property of your object, make it a primary key and insert it into its own table. The outcome is that you can query that particular table, for that particular property value in O(1) time. The downside is that you now have to duplicate data.

Why should (or shouldn't) a Search Query return back only document IDs?

So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?
I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.
You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.
In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.
You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.
I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...

Resources