MarkLogic diacritic-insensitive snippet - search

For now I'm using this code to generate snippet, based on a JSON document that I'm getting from MarkLogic search.
xquery version "1.0-ml";
module namespace searchlib="http://ihs.com/lib/searchlib";
import module namespace search="http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";
import module namespace json="http://marklogic.com/xdmp/json" at "/MarkLogic/json/json.xqy";
declare function searchlib:get-snippet($docc,$words) {
let $doc:= json:transform-from-json($docc)
let $squery := search:parse($words)
let $result := <result>{search:snippet($doc,$squery,
<transform-results apply="snippet" xmlns="http://marklogic.com/appservices/search">
<max-snippet-chars>255</max-snippet-chars>
</transform-results>)}</result>
return $result//search:match
};
When performing search I'm using:
cts.jsonPropertyValueQuery(fieldname, values,
['case-insensitive', 'diacritic-insensitive'])
So search works diacritic-insensitive and produces good results, but in search:snippet I'm not able to pass diacritic-insensitive option as in cts.jsonPropertyValueQuery.
In documentation I can see this in description
Options to define the search grammar and control the search. See description for $options for the function search:search. Note that you cannot specify the apply attribute on the transform-results option with search:snippet; to use a different snippetting function, use search:search or search:resolve instead.
But in here it is:
search:snippet(
$result as node(),
$cts-query as schema-element(cts:query),
[$options as element(search:transform-results)?]
) as element(search:snippet)
So does it mean I can't pass other options to search:snippet? Or is there a option to do this?
I'm testing it using chávez and it is producing results, but snippets are generated properly only for documents containing exact match that means that document
Chavez did something
Will not get highligh on Chavez and
Chávez did something
Will get a highligh
Thanks in advance!

Problem was in not passing options to search:snippet, but to search:parse
xquery version "1.0-ml";
module namespace searchlib="http://ihs.com/lib/searchlib";
import module namespace search="http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";
import module namespace json="http://marklogic.com/xdmp/json" at "/MarkLogic/json/json.xqy";
declare function searchlib:get-snippet($docc,$words) {
let $doc:= json:transform-from-json($docc)
let $squery := search:parse($words,
<options xmlns="http://marklogic.com/appservices/search">
<term>
<term-option>case-insensitive</term-option>
<term-option>diacritic-insensitive</term-option>
</term>
</options>, "cts:query")
let $result := <result>{search:snippet($doc,$squery,
<transform-results apply="snippet" xmlns="http://marklogic.com/appservices/search">
<max-snippet-chars>255</max-snippet-chars>
</transform-results>)}</result>
return $result//search:match
};
passing
<term-option>diacritic-insensitive</term-option>
to search:parse made it work.
Here is explanation from MarkLogic:
The search:snippet() function allows you to extract matching text and
returns the matches wrapped in a containing node, with highlights
tagged. However, to allow the search:snippet to extract the correct
text, the cts:query() that is passed to the snippet should match the
sequence of values. For search:snippet, cts:query is typically a
result of a call to search:parse. The search:parse() function parses
query text according to given options and returns the appropriate
cts:query XML.

Related

Chaining filters in JMESpath

I want to be able to chain together multiple filters using JMESpath but it appears you cannot filter on the output of a filter.
My working example is as follows:
// document:
{
pips: {
ancestors:[{
id: 'p01234567'
}],
episode: {
more: 'data',
goes: 'here
}
}
}
// working filter: `[pips][?ancestors[?pid=='p01234567'] && episode]`
But I would like to write my filter instead as follows, effectively to filter the output of another filter:
[pips][?ancestors[?pid=='p01234567']][?episode]
Any idea why this doesn't work?
I am building this in NodeJS using the following NPM package: https://www.npmjs.com/package/jmespath
Is there a mistake in the syntax I am using, is there a bug in the library I am using, or am I just trying to do something that is outside what JMESpath allows?
Thank you!
I found the reason why - projections are evaluated in two steps, with the left-hand-side creating a JSON array of initial values and the right-hand-side is the expression.
Solution: "Pipe expressions" which allow you to"operate on the result of a projection".
So instead of the incorrect expression from before: [pips][?ancestors[?pid=='p01234567']][?episode]
This instead should be written as: [pips][?ancestors[?pid=='p01234567']] | [?episode]
And to undo the conversion of the initial document into an array, we can convert this back to an object like this: [pips][?ancestors[?pid=='p01234567']] | [?episode] | [0]
As a side note, I observed that using parentheses () also works, but using pipes are a bit cleaner.

How to search for a specific dynamic pattern of a field's in mongodb.?

I need to search mongodb collection for a specific pattern field. I tried using {$exists:true}; However, this gives results only if you provide exact field.
I tried using {$exists:true} for my field. But this does not give results if you give some pattern.
{
"field1":"value1",
"field2":"value2",
"field3":object
{/arjun1/pat1: 1,
/arjun2/pat2: 3,
/arjun3/pat3: 5
}
"field4":"value4",
}
From some field, I get the keys pat3 & field3. From this I would need to find out if the value /arjun3/pat3 exists in the document.
If I use {"field3./arjun3/pat3":{$exists:true}}, this would give me results. But the problem is I get only field3 and pat3 and I need to use some pattern matching like field3.*.pat3 and then use $expr or $exists; which I'm not exactly sure how to. Please help.
you could try something of this kind
db.arjun.find(
{"field3" : {
"$elemMatch" : { $and: [
{"arjun3.pat3" : {$exists:true}},
{"arjun3.pat3" : 5}
]
}}}
);
You can either go for regex (re module) for SQL like pattern matching, and compile your own custom wildcard. But if you don't want that then you can simple use the fnmatch module, it is a builtin library of python which allows wildcard matching for multiple characters (via*) or a single character (via ?).
import fnmatch
a = "hello"
print(fnmatch.fnmatch(a, "h*"))
OUTPUT:-
True

Is there a way to prevent partial word matching using Sitecore Search and Lucene?

Is there a way when using Sitecore Search and Lucene to not match partial words? For example when searching for "Bos" I would like to NOT match the word "Boston". Is there a way to require the entire word to match? Here is a code snippet. I am using FieldQuery.
bool _foundHits = false;
_index = SearchManager.GetIndex("product_version_index");
using (IndexSearchContext _searchContext = _index.CreateSearchContext())
{
QueryBase _query = new FieldQuery("title", txtProduct.Text.Trim());
SearchHits _hits = _searchContext.Search(_query, 1000);
...
}
You may want to try something like this to get the query you want to run. It will put the + in (indicating a required term) and quote the term, so it should exactly match what you're looking for, its worked for me. Providing you're passing in BooleanClause.Occur.MUST.
protected BooleanQuery GetBooleanQuery(string fieldName, string term, BooleanClause.Occur occur)
{
QueryParser parser = new QueryParser(fieldName, new StandardAnalyzer());
BooleanQuery query = new BooleanQuery();
query.Add(parser.Parse(term), occur);
return query;
}
Essentially so your query ends up being parsed to +title:"Bos", you could also download Luke and play around with the query syntax in there, its easier if you know what the syntax should be and then work backwards to see what query objects will generate that.
You have to place the query in double quotes for the exact match results. Lucene supports many such opertators and boolean parameters that can be found here: http://lucene.apache.org/core/2_9_4/queryparsersyntax.html
It depends on field type. If you have memo or text field then partial matching is applied. If you want exact matching use string field instead. There you can find some details: https://www.cmsbestpractices.com/bug-how-to-fix-solr-exact-string-matching-with-sitecore/ .

examine stripping out search words

I'm using umbraco and I have examine up and running however my query is having words stripped out
For example:
I am searching on "man on the moon" with the following line of code, the variable "searchTerm" should contain "man on the moon":
var Searcher = ExamineManager.Instance.SearchProviderCollection["MySearcher"];
var searchCriteria = Searcher.CreateSearchCriteria();
var query = searchCriteria.Field("Name", searchTerm).Compile();
however, the query is generated as this when I debug:
{ SearchIndexType: , LuceneQuery: +Name:"man moon" }
Notice how it has removed the words "on the" from the searchTerm?
Presumably these are because they are deemed as STOP/reserved words. However, this means I do not get the search results I expect.
How can I get around this?
Internally the StopAnalyzer class is used by the StandardAnalyzer as part of the standard indexing process. The StopAnalyzer (http://lucenenet.apache.org/docs/3.0.3/d7/df5/_stop_analyzer_8cs_source.html#l00054) contains a method which allows you to substitute a different set of stopwords as an ISet type parameter rather than use the standard ENGLISH_STOP_WORDS_SET (line 134).
And I read here (http://webcache.googleusercontent.com/search?q=cache:sA-uyAC015UJ:our.umbraco.org/m%3Fmode%3Dtopic%26id%3D25600+&cd=2&hl=en&ct=clnk&gl=uk) that you can get Examine to use an empty set of stopwords by adding the following line to your application_start method in global.asax
Lucene.Net.Analysis.StopAnalyzer.ENGLISH_STOP_WORDS_SET = new System.Collections.Hashtable();
So with an empty set of stopwords your man in the moon should be back.
A bit of an odd idea but as an alternative you could also add a StopAnalyzer to ExamineSettings.config to create an index of docs with only the stop words and then AND them with your standardanalyzer result set?

Is there standard method for managing camel cased strings in groovy?

For example groovy converts getSomeProperty() method to someProperty.
I need the same for my string. prefixMyString converted to myString.
Is there standard way to do so?
Groovy doesn't actually convert getSomeProperty() into someProperty. It only converts the other way, turning someProperty into getSomeProperty()
It does this using the capitalize(String property) method on org.codehaus.groovy.runtime.MetaClassHelper. You can run this in the console to see it work:
org.codehaus.groovy.runtime.MetaClassHelper.capitalize('fredFlinstone')
// outputs 'FredFlintstone'
The full conversion, including adding set, get, or is, can be found in the class groovy.lang.MetaProperty, under the methods getGetterName and getSetterName.
To convert the other way, you'll have to write your own code. However, that's relatively simple:
def convertName(String fullName) {
def out = fullName.replaceAll(/^prefix/, '')
out[0].toLowerCase() + out[1..-1]
}
println convertName('prefixMyString') // outputs: myString
println convertName('prefixMyOTHERString') // outputs: myOTHERString
Just change the prefix to meet your needs. Note that it's a regex, so you have to escape it.
EDIT: I made a mistake. There actually is a built-in Java method to decapitalize, so you can use this:
def convertName(String fullName) {
java.beans.Introspector.decapitalize(fullName.replaceAll(/^prefix/, ''))
}
It works nearly the same, but uses the built-in Java class for handling the decapitalization. This method handles uppercase characters a little differently, so that prefixUPPERCASETest returns UPPERCASETest.

Resources