Tagging and Analysing a Search Query - search

I'm developing a search engine which functions taking the semantics of data into account, unlike the usual keyword based index. I managed to develop a reasonable index for the search using metadata extraction methods and RDF, but I have difficulty in using such methods on the search query itself since the search query is very much shorter that the actual data. any idea how to perform a successful tagging of a search query, using similar methods, natural language processing, etc. ?
Thank You!

Yes, the sample size of a typical query is too small for semantic analysis to be of any value.
One approach might be to constrain or expand your query using drop-down menus for things like "Named Entities" or "Subject Verb Object" tuples.
Another approach would be to expand simple keywords using rules created from your metadata so that, for example, a query for 'car' might be expanded to the tuple pattern
(*,[drive,operate,sell],[car,automobile,vehicle])
before submission.
Finally, you might try expanding the query with a non-semantically valuable prefix and/or suffix to get the query size large enough to trigger OpenCalais' recognizer.
Something like 'The user has specified the following terms in her query: one, two, three.'.
And once the results are returned, filter out all results that match only the added prefix/suffix.
Just a few quick thoughts.

You need to build semantic tree. It will based on the combination of keywords.
For example, automobile -->vehicle --> car this relation technical aspect of car. travel --
hire/rent-->vehicle-->car this is something related to travel and rent a car.
In this case MongoDB will help you a lot.

Related

What is the difference between arango search and filter keyword

Can somebody explain in details on difference between :
Search and filter keyword
I have already gone through https://www.arangodb.com/learn/search/tutorial/ -> SEARCH vs FILTER
Do anybody has any other experience on the difference?
Thanks,
Nilotpal
FILTER corresponds to the WHERE clause in SQL. It does, what the name says. It uses all sorts of arithmetic and AQL operators to filter the search result. It can make use of regular indexes. There is no ranking of filtered results. Filters operate on single collection result sets.
SEARCH offers a full fledged search engine very much like what you would get from regular search engines like Google's page ranking based on a grammar that you could formulate on your own and can operate on multiple collection contents. Its most natural functionality would be a full text search and ranking. In that use it would be a much more powerful version of the full-text index. But it can do much more: normalisation, tokenisation based on language ...
The list goes on and on. Please refer to the documentation of search here:
https://www.arangodb.com/docs/stable/arangosearch.html

In Azure Search, how do you run a "contains" search with multiple terms in searchtext?

I am using Azure Search in full query mode on top of CosmosDB and I want to run a query for any documents with a field that contains the string "azy do". This should match, for example, a document containing "lazy dog".
Reading the Azure Search documentation, it looks like this is impossible due to the term-based indexes it uses.
Rejected solutions
0 matches since it is looking for whole words:"azy do"
Doesn't work since regexes are not allowed to span multiple terms:/.azy do./
This "works", to the extent that it will match "lazy dog", but this does not respect the ordering of the query and will also match "dog lazy", for example /.azy./ AND /.do./
Is there any way of doing this correctly in Azure Search?
If you cannot achieve that via regular expression in the Lucene Query syntax, then is not possible. You may want to vote for supporting contains here.
It should be /.lazy|dog./
So, split the terms based on whitespaces and add a pipe(|) delimiter which stands for OR.
Shortly, Azure Search is not designed to support this scenario. You might be better off using the CONTAINS function in Cosmos DB or its equivalent, depending on what query language you use.
Azure Search is designed for finding terms or phrases that occur in unstructured content (documents) and returning the most relevant documents. The process of extracting and indexing those searchable terms is customizable and described here: How full text search works in Azure Search.

Using top with Azure Search Suggestions

I am building a search page with Azure Search. On my page, I have a search box. I want to provide suggestions to the users. In an attempt to do this, I'm using the Suggestions endpoint on my index. At this time, I have a request that includes the following query string:
search=sta&suggesterName=sites&$top=3
My question is, how does top determine which three results to return? Is it the first three matches it encounters when going through the search index? Or is it something else? Based on the URL structure, I don't think it's using a scoring profile. So, I ruled out relevancy. But then I started reading about the minimumCoverage field and I got confused.
If the suggest endpoint just returns the first [top] matches it encounters, then why is the minimumCoverage field even needed?
In general, $top will give you the top N results based on whatever order the rest of the query specifies. For queries with no $orderby, the sort order is descending by relevance score. This applies to both Suggest and Search.
Note that just because you don't have a scoring profile (such as with Suggest), that doesn't mean Azure Search doesn't calculate relevance scores for each document. Scoring profiles can influence the score, but they do not completely define it.
For queries with an $orderby, the order of results is defined first by the fields in the $orderby, and then by score if there are any ties to be broken.
minimumCoverage has nothing to do with ordering or $top. It has to do with the way search queries are distributed. Every query is executed concurrently against different subsets of the index (this happens regardless of whether or not you have multiple search units). Sometimes one of these subsets fails to execute for whatever reason, usually when your search service is under heavy load. The minimumCoverage parameter provides a way to relax the rule that normally says "X% of the index must successfully execute the query in order to consider the overall query a success" (X is 100 by default for Search and 80 by default for Suggest). This is a way to tradeoff completeness of search results for higher availability in case of heavy load or partial outages.

What indexer do I use to find the list in the collection that is most similar to my list?

Lets say I have my list of ingredients:
{'potato','rice','carrot','corn'}
and I want to return lists from a database that are most similar to mine:
{'beans','potato','oranges','lettuce'},
{'carrot','rice','corn','apple'}
{'onion','garlic','radish','eggs'}
My query would return this first:
{'carrot','rice','corn','apple'}
I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.
In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.
What technology should I use to accomplish what I want to do?
Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.
With so much data I can't really loop through it, I need to query everything at once.
I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match.
If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki
Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.

Complex search query in lucene (querying fields which are indexed as numeric, analyzed or not-analyzed using a simple analyzer)

Hi I am building a search application using lucene. Some of my queries are complex. For example, My documents contain the fields location and population where location is a not-analyzed field and population is a numeric field. Now I need to return all the documents that have location as "san-francisco" and population between 10000 and 20000. If I combine these two fields and build a query like this:
location:san-francisco AND population:[10000 TO 20000], i am not getting the correct result. Any suggestions on why this could be happening and what I can do.
Also while building complex queries some of the fields that I am including are analyzed while others are not analyzed. For instance the location field is not analyzed and contains terms like chicago, san-francisco and so on. While the summary field is analyzed and it generally contains a descriptive paragraph.
Consider this query:
location:san-francisco AND summary:"great restaurants"
Now if I use a StandardAnalyzer while searching I do not get the correct results when the location field contains a term like san-francisco or los-angeles (i.e it cannot handle the hyphen in between) but if I use a keyword analyzer for the query I do not get correct results either because it cannot search for the phrase "great restaurants" in the summary field.
First, I would recommend tackling this one problem at a time. From my reading of your post, it sounds like you have multiple issues:
You're unsure why a particular query
is not returning any results.
You're unsure why some fields are not being analyzed.
You're having problems with the built-in analyzers dealing with
hyphens.
That's how your post reads. If that's correct, I would suggest you post each question separately. You'll get better answers if the question is precise. It's overwhelming trying to answer your question in the current format.
Now, let me take a stab in the dark at some of your problems:
For your first problem, if you're getting into really complex queries in Lucene, ask yourself whether it makes sense to be doing these queries here, rather than in a proper database. For a more generic answer, I'd try isolating the problem by removing parts of the query until you get results back. Once you find out what part of the query is causing no results, we can debug that further.
For the second problem, check the document you're adding to Lucene. Lucene provides options to store data but not index it. Make sure you've got the right option specified when adding fields to the document.
For the third problem, if the built-in analyzers don't work out for you, breaking on hyphens, just build your own analyzer. I ran into a similar issue with the '#' symbol, and to solve the problem, I wrote a custom analyzer that dealt with it properly. You could do the same for hyphens.
You should use PerFieldAnalyzerWrapper. As the name suggests, you can use different analyzers for different field. In this case, you can use KeywordAnalyzer for city name and StandardAnalyzer for text.

Resources