ArangoDB: Querying multiple fields at the same time for partial match - arangodb

I have a database containing product information (SKU, model number, descriptions, etc) and I'd like to have a relatively quick search function where a user can just type in a few letters or a word from any of the the text fields and then get a list of products that contain that phrase in any of those fields.
The number of items in the database will probably not be more than 100,000.
What would be the easiest way to accomplish this, without creating complex queries?

It sounds like you're looking for an autocomplete. There are numerous ways to do this.
Indexing
No matter the solution you choose, you'll want to put some indices on your data. I recommend adding a skiplist to everything you're going to be searching, and an additional fulltext index on any long-form text (such as product description). String comparison uses skiplists, while only a FULLTEXT search will leverage a fulltext index.
Querying
You have some choices here.
LIKE
https://docs.arangodb.com/3.1/AQL/Functions/String.html#like
You could run your search something like:
for product in warehouse
filter like(product.model, #searchTerm, true) or
like(product.sku, #searchTerm, true)
return product
Advantage: simple query syntax, multiple attributes in one search, supports substrings, can search the middle of a body of text.
Disadvantage: relatively slow.
Fulltext
This is a lot more complex for querying, but is very responsive, and is the approach my application uses for its autocomplete.
let sku = (for result in fulltext("warehouse", "sku", "prefix:#seacrhTerm")
return {sku: result.sku, model: result.model, description: result.description}
let model = (for result in fulltext("warehouse", "model", "prefix:#searchTerm")
return {sku: result.sku, model: result.model, description: result.description}
let description = (for result in fulltext("warehouse", "description", "prefix:#searchTerm")
return {sku: result.sku, model: result.model, description: result.description}
let resultsMatch = union(sku,model,description)
return resultsMatch
Advantage: Very fast, extremely responsive, can handle very long bodies of text with ease, searches anywhere in a text body.
Disadvantage: Complex query structure as you need one variable for every attribute you're searching, a fulltext index created on each of those attributes you're searching, and a union at the end. You may need to do a union of the unioned results depending on how advanced your search needs to be. Doesn't support substring searching.
Raw string comparison
Simply create a query that filters for results to be greater than or equal to your search term, but less than your search term with the last letter incremented by 1. Example is in the link under the Foxx portion of my answer. This leverages skiplists.
Advantage: Very fast as long as the field is not tremendously long. Extremely easy to implement.
Disadvantage: Doesn't support substring searches. Only searches the first part of a string. I.e. you must know the beginning of the field you're searching.
This will work very well for quickly searching something like a model number where your users will probably know the beginning of it, but poorly for something like a description in which your users are probably searching for words somewhere in the middle of a body of text.
Foxx
Jan's little Cookbook example is a good place to start:
https://docs.arangodb.com/cookbook/UseCases/PopulatingAnAutocompleteTextbox.html
I would recommend abstracting whatever you do into a Foxx service. It is especially liberating if you need to dynamically build up AQL queries in database, in case you have a huge number of fields and collections to search and you need to generate a Fulltext search dynamically.
Bottom line
Experiment and see which of these works best for you. My best guess is that you will find the Fulltext solution the best if you need to search on product descriptions. If you expect your users to always search the first few letters of a field, just use the comparison with a skiplist as it is very very fast.

Related

How to search only inside one string of a Collection in Azure Search?

I've a collection fields like:
["city of god"]
["god of war", "city of war"]
I want to perform a search on the field with 'city' AND 'god' and I want only 'city of god' to be returned.
Yet, the second field is also return regardless of the terms being in two different strings within the collection.
Anyway to make the search strict to within strings and not to the entire collection?
Each searchable field in the index is treated as a bag of terms, so for “city AND god” you’re matching on all terms of that field in the whole document, not only the terms within sub-documents (in this case individual strings in the collection).
One way to get around this would be to specify a reasonable estimate of distance between these terms within a single string of a collection and use that to issue a proximity search query to get the desired result. For your specific example, assuming that the terms would be within 5 words of each other, the following query should work -
&queryType=full&search=fieldName:"city god"~5
Using proximity search is really useful as it helps that words don’t have to be in the provided order in the phrase query with large enough proximity value e.g., “city god”~5 would also match “god bless the city”.
Make sure to include queryType=full in your query string as proximity search is part of the full query syntax and would not work otherwise. You can check some other examples here.

Fastest way to search a SQL Server table (or indexed view) column with "like '%search%'"?

Suppose there's a table with columns (UserID, FieldID, Value), with half a million records. I want to see if some search term T(N) occurs anywhere in each Value (i.e. Value.Contains( T(N) ) ).
I think I'm just hitting a wall volume wise, just too many values to sift through. I don't think a Full Text index will help, because it's only useful for StartsWith queries that look at individual words, not occurrences anywhere within the string at all.
Is there a good approach to indexing this kind of data for such a search in SQL Server?
A half-million records is not terribly large, although I don't know the size of the field contents. A couple of ideas - this was too long for a comment or else I may have posted as such.
You could implement a full-text search engine like Elastic, Solr, etc and use it as a sidecar. If when you are doing text searches, you are not otherwise making much use of the other data, this might be easy enough. Note that you could put other data for searching into Elastic or Solr, but I'm not sure if you'd want to duplicate all your data, and those tools aren't really great for a transactional data store.
Another option for volumes this small, assuming you only need basic "contains" searching: create two more tables: keywords and keyword_index (or whatever). When saving, tokenize your text content and write out any new keywords to keywords table and then add the data to the join table. Index everything, and then do your search off the keywords table, joining back to the master via the intermediate keyword_index table.
This is fairly hackish, and getting your keyword handling really dialed in (for stemming, etc) may be a pain. It is a reasonable quick & dirty solution for smaller-scale needs though.

What indexer do I use to find the list in the collection that is most similar to my list?

Lets say I have my list of ingredients:
{'potato','rice','carrot','corn'}
and I want to return lists from a database that are most similar to mine:
{'beans','potato','oranges','lettuce'},
{'carrot','rice','corn','apple'}
{'onion','garlic','radish','eggs'}
My query would return this first:
{'carrot','rice','corn','apple'}
I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.
In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.
What technology should I use to accomplish what I want to do?
Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.
With so much data I can't really loop through it, I need to query everything at once.
I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match.
If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki
Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.

Tagging and Analysing a Search Query

I'm developing a search engine which functions taking the semantics of data into account, unlike the usual keyword based index. I managed to develop a reasonable index for the search using metadata extraction methods and RDF, but I have difficulty in using such methods on the search query itself since the search query is very much shorter that the actual data. any idea how to perform a successful tagging of a search query, using similar methods, natural language processing, etc. ?
Thank You!
Yes, the sample size of a typical query is too small for semantic analysis to be of any value.
One approach might be to constrain or expand your query using drop-down menus for things like "Named Entities" or "Subject Verb Object" tuples.
Another approach would be to expand simple keywords using rules created from your metadata so that, for example, a query for 'car' might be expanded to the tuple pattern
(*,[drive,operate,sell],[car,automobile,vehicle])
before submission.
Finally, you might try expanding the query with a non-semantically valuable prefix and/or suffix to get the query size large enough to trigger OpenCalais' recognizer.
Something like 'The user has specified the following terms in her query: one, two, three.'.
And once the results are returned, filter out all results that match only the added prefix/suffix.
Just a few quick thoughts.
You need to build semantic tree. It will based on the combination of keywords.
For example, automobile -->vehicle --> car this relation technical aspect of car. travel --
hire/rent-->vehicle-->car this is something related to travel and rent a car.
In this case MongoDB will help you a lot.

Complex search query in lucene (querying fields which are indexed as numeric, analyzed or not-analyzed using a simple analyzer)

Hi I am building a search application using lucene. Some of my queries are complex. For example, My documents contain the fields location and population where location is a not-analyzed field and population is a numeric field. Now I need to return all the documents that have location as "san-francisco" and population between 10000 and 20000. If I combine these two fields and build a query like this:
location:san-francisco AND population:[10000 TO 20000], i am not getting the correct result. Any suggestions on why this could be happening and what I can do.
Also while building complex queries some of the fields that I am including are analyzed while others are not analyzed. For instance the location field is not analyzed and contains terms like chicago, san-francisco and so on. While the summary field is analyzed and it generally contains a descriptive paragraph.
Consider this query:
location:san-francisco AND summary:"great restaurants"
Now if I use a StandardAnalyzer while searching I do not get the correct results when the location field contains a term like san-francisco or los-angeles (i.e it cannot handle the hyphen in between) but if I use a keyword analyzer for the query I do not get correct results either because it cannot search for the phrase "great restaurants" in the summary field.
First, I would recommend tackling this one problem at a time. From my reading of your post, it sounds like you have multiple issues:
You're unsure why a particular query
is not returning any results.
You're unsure why some fields are not being analyzed.
You're having problems with the built-in analyzers dealing with
hyphens.
That's how your post reads. If that's correct, I would suggest you post each question separately. You'll get better answers if the question is precise. It's overwhelming trying to answer your question in the current format.
Now, let me take a stab in the dark at some of your problems:
For your first problem, if you're getting into really complex queries in Lucene, ask yourself whether it makes sense to be doing these queries here, rather than in a proper database. For a more generic answer, I'd try isolating the problem by removing parts of the query until you get results back. Once you find out what part of the query is causing no results, we can debug that further.
For the second problem, check the document you're adding to Lucene. Lucene provides options to store data but not index it. Make sure you've got the right option specified when adding fields to the document.
For the third problem, if the built-in analyzers don't work out for you, breaking on hyphens, just build your own analyzer. I ran into a similar issue with the '#' symbol, and to solve the problem, I wrote a custom analyzer that dealt with it properly. You could do the same for hyphens.
You should use PerFieldAnalyzerWrapper. As the name suggests, you can use different analyzers for different field. In this case, you can use KeywordAnalyzer for city name and StandardAnalyzer for text.

Resources