Is it possible to find documents using incomplete words and ArangoSearch? - arangodb

For example, let's pretend my document contained the attribute "description" and the value of it was "Quick brown fox". Could ArangoSearch use the input, "Quic" and be able to find the document that contains the description, "Quick brown fox"?
As far as I know, ArangoSearch can only find matches if the token/word is completed. Is this true?
Here's some query code to show what I'm talking about. If the binding variable, #searchInputValue, takes the value of "Quic", it won't find the document, but if it takes the value of "Quick", it does find the document.
FOR document IN v_test
SEARCH ANALYZER(
(
document.description IN TOKENS('#searchInputValue', 'text_en')
)
, 'text_en'
)
RETURN document

You can use the FULLTEXT function of AQL:
https://docs.arangodb.com/3.0/AQL/Functions/Fulltext.html
However you can't write the prefix syntax directly in AQL when using Input Parameters. You have to foormat the searchInputValue, to pass:
Quic,+prefix:Quic
So you can write your query as:
FOR res IN FULLTEXT(v_test, "description", #searchInputValue)
RETURN res

Related

Arango wildcard query

I am working on building a simple arango query where if the user enters: "foo bar" (starting to type Foo Barber), the query returns results. The issue I am running in to is going from a normal single space separated string (i.e. imagine LET str = "foo barber" at the top), to having multiple wildcard queries like shown below.
Also, open to other queries that would work for this, i.e. LIKE, PHRASE or similar.
The goal is when we have a single string like 'foo bar', search results are returned for Foo Barber and similar.
FOR doc IN movies SEARCH PHRASE(doc.name,
[
{WILDCARD: ["%foo%"]},
{WILDCARD: ["%bar%"]}
], "text_en") RETURN doc
If you want to find Black Knight but not Knight Black if the search phrase is black kni, then you should probably avoid tokenizing Analyzers such as text_en.
Instead, create a norm Analyzer that removes diacritics and allows for case-insensitive searching. In arangosh:
var analyzers = require("#arangodb/analyzers");
analyzers.save("norm_en", "norm", {"locale": "en_US.utf-8", "accent": false, "case": "lower"}, []);
Add the Analyzer in the View definition for the desired field (should be title and not name, shouldn't it?). You should then be able to run queries like:
FOR doc IN movies SEARCH ANALYZER(STARTS_WITH(doc.title, TOKENS("Black Kni", "norm_en")[0]), "norm_en") RETURN doc
FOR doc IN movies SEARCH ANALYZER(LIKE(doc.title, TOKENS("Black Kni%", "norm_en")[0]), "norm_en") RETURN doc
FOR doc IN movies SEARCH ANALYZER(LIKE(doc.title, CONCAT(TOKENS(SUBSTITUTE("Black Kni", ["%", "_"], ["\\%", "\\_"]), "norm_en")[0], "%")), "norm_en") RETURN doc
The search phrase Black Kni is normalized to black kni and then used for a prefix search, either using STARTS_WITH() or LIKE() with a trailing wildcard %. The third example escapes user-entered wildcard characters.

get list of collections having all words exists in field which added in given string in mongoDB

I want to search list of collections from mongoDB have all the keywords of given string.
For e.g.
I have a collection
{
"id":1
"text":"go for shopping",
"description":"you can visit this branch as well"
}
{
"id":2
"text":"check exiting discount",
"description":"We have various discount options"
}
Now, If I will pass string like "I want to go for shopping" w.r.t. text field in find query of mongoDB. Then I should get first collection as output because text field value "go for shopping" exists in the input string passed in find query.
This can be achieved through $text operator in MongoDB. But you have to createIndex on the "text" field in your database.(or whichever filed you want to be matched, I would suggest you rename it in your db to avoid confusion)
db.yourCollectionName.createIndex({"text":"text"})
The first field here is the "text" field in your database, and the second one is the mongo operator.
Then you can pass any query like,
db.yourCollectionName.find({$text: {$search: "I want to go for shopping"}})
The "$text" here is the mongo operator.
This would return all documents which have any of the keywords above.
Maybe you can read more around this and improvise and modify.
Ref: MongoDb $text
You can do so through regular expression. MongoDb provides the provision of matching strings through regex patterns.
In your case you could do something like:
db.yourCollectionName.find({text:{$regex:"go for shopping" }})
This will return you all the documents having the phrase "go for shopping" in the text field.
Ref: MongoDb Regex

Solr - Include a field only if other fields where found

Assuming I have the fields
textFieldA
textFieldB
specialC
in my index. Now I want to query these with
textFieldA:"searchVal" textFieldB:"searchVal" specialC:"somecode"
But I only want to boost matches on specialC if there were also matches on at least one of the other fields.
Example:
DocumentA:
textFieldA:"This is a test" textFieldB:"for clarification" specialC:"megacode"
DocumentB:
textFieldA:"Doesnt contain" textFieldB:"searched word here" specialC:"megacode"
DocumentC:
textFieldA:"But this again" textFieldB:"contains test" specialC:"supercode"
Now when searching for example with
textFieldA:"test" textFieldB:"test" specialC:"supercode"
I want the results
DocumentC
DocumentA
with document C having the highest rank, but document B being excluded.
How can this be achieved?
q=textFieldA:"test" OR textFieldB:"test" OR textFieldA:"test" AND specialC:"supercode" OR textFieldB:"test" AND specialC:"supercode"&bq=(specialC:"supercode")^100
Should return only DocumentC and DocumentA in the desired order. bq means boosting one field/ field value, see more here https://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F .
As far as I know query boosting works only if you actually query for the thing you want to boost (kind of intuitive). That is why I added the last 2 parts to the query.

Lucene multiphrasequery search with wildcard

I have been trying to do a lucene search query where entering "Foo B" would return "Foo Bar", Foo Bear, Foo Build" etc. but will not return a record with an ID of "Foo" and the word "Bar" in say its 'description' field.
I have looked into multiphrasequery but it never returns any results, below is what I have been trying
Term firstTerm = new Term("jobTitle", "Entry");
Term secondTerm = new Term("jobTitle", "Artist");
Term asdTerm = new Term(fld)
Term[] tTerms = new Term[]{firstTerm, secondTerm};
MultiPhraseQuery multiPhrasequery = new MultiPhraseQuery();
multiPhrasequery.add( tTerms );
org.hibernate.Query hibQuery = fullTextSession.createFullTextQuery(multiPhrasequery, this.type).setSort(sort);
results = hibQuery.list();
The likely problem that I see is capitalization. "Entry" and "Artist" are not getting passed through a query parser, and so will not be run through an analyzer, and so are case sensitive. The field you are indexing is probably analyzed with an analyzer that includes a LowercaseFilter, so the end terms would not contain leading capitals. Without knowing how you index your documents, I can't say that will fix it with any certainty, but it seems the most likely possibility.
That fixed, the query you've created should match anything with either the term "entry" or "artist" in the jobTitle field.

How do I get all hits from a cts:search() in Marklogic

I have a collection containing lots of documents.
when I search the collection, I need to get a list of matches independent of documents. So if I search for the word "pie". I would get back a list of documents, properly sorted by relevance. However, some of these documents contain the word "pie" on more then one place. I would like to get back a list of all matches, unrelated to the document where the match was found. Also, this list of all hits would need the be sorted by relevance (weight), again totally independent of the document (not grouped by the document).
Following code searches and returns matches grouped by the document...
let $searchfor := "pie"
let $query := cts:and-query((
cts:element-word-query(xs:QName("title"), ($searchfor), (), 16),
cts:element-word-query(xs:QName("para"), ($searchfor), (), 10)
))
let $resultset := cts:search(fn:collection("docs"), $query)[0 to 100]
for $n in $resultset
return cts:score($n)
What I need is $n to be the "match-node", not a "document-node"...
Thanks!
Document relevance is determined by TFIDF. Matches contribute to a document's score but don't have scores relative to each other. cts:search already returns results ordered by document relevance, so you could do this to get match nodes ordered by their ancestor document score:
let $searchfor := "pie"
let $query := cts:and-query((
cts:element-word-query(xs:QName("title"), ($searchfor), (), 16),
cts:element-word-query(xs:QName("para"), ($searchfor), (), 10)
))
return
cts:search(//(title|para),$query)[0 to 100]/cts:highlight(.,$query,element match {$cts:node})//match/*
You need to split the document (fragment it) into smaller documents. Every textnode could be a document, with an stored original xpath so that the context is not lost.
I recommend that you look at the Search API (http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf and http://community.marklogic.com/pubs/5.0/apidocs/SearchAPI.html). This API will give what you want, providing match nodes as well as the URIs for the actual documents. You should also find it easier to use for the general cases, although there will be edge cases where you will need to revert back to cts:search.
search:search is the specific function you will want to use. It will give you back responses similar to this:
<search:response total="1" start="1" page-length="10" xmlns=""
xmlns:search="http://marklogic.com/appservices/search">
<search:result index="1" uri="/foo.xml"
path="fn:doc("/foo.xml")" score="328"
confidence="0.807121" fitness="0.901397">
<search:snippet>
<search:match path="fn:doc("/foo.xml")/foo">
<search:highlight>hello</search:highlight></search:match>
</search:snippet>
</search:result>
<search:qtext>hello sample-property-constraint:boo</search:qtext>
<search:report id="SEARCH-FLWOR">(cts:search(fn:collection(),
cts:and-query((cts:word-query("hello", ("lang=en"), 1),
cts:properties-query(cts:word-query("boo", ("lang=en"), 1))),
()), ("score-logtfidf"), 1))[1 to 10]
</search:report>
<search:metrics>
<search:query-resolution-time>PT0.647S</search:query-resolution-time>
<search:facet-resolution-time>PT0S</search:facet-resolution-time>
<search:snippet-resolution-time>PT0.002S</search:snippet-resolution-time>
<search:total-time>PT0.651S</search:total-time>
</search:metrics>
</search:response>
Here you can see that every result has one or possibly more match elements defined.
How would you determine the relevance of a word independent of the document? Relevance is a measure of document relevance, not word relevance. I don't know how one would measure word relevance.
You could potentially return all words ordered by document relevance, then words for each document in "document order" which means the order in which they appear in the document. That would be relatively easy to do with search:search where you iterate over all results and extract each matching word. What would you present with each match? Its surrounding snippet?
Keep in mind that what you're asking for would potentially take a long time to execute.

Resources