Lucene analyzer for first name - search

Is there a Lucene analyzer out there that tokenizes name parts with their short name equivalents (e.g. Mike and Michael, Rich and Richard, Suzie and Susan), etc?
Fuzzy match on Levenshtein distance is a solution I know, and some implementors seem to pair fuzzy match with the soundex algorithm. Surely somebody has made a swipe at just plain listing all of these short names somewhere?
EDIT: The toughest part of this question is where to get the synonym data from?

I am not aware of any specific nickname filter out there.
A SynonymFilter would make it reasonably easy to generate though, if you had a data source for it. This appears to be a good source of nickname data:
https://code.google.com/p/nickname-and-diminutive-names-lookup/
You would need to generate the SynonymMap to pass into the SynonymFilter ctor, which should look something like this (I think):
SynonymMap.Builder builder = new SynonymMap.Builder(true);
builder.add(new CharsRef("Mike"), new CharsRef("Michael"), false);
builder.add(new CharsRef("Rich"), new CharsRef("Richard"), false);
builder.add(new CharsRef("Suzie"), new CharsRef("Susan"), false);
SynonymMap map = builder.build();

Related

AEM Query builder exclude a folder in search

I need to create a query where the params are like:
queryParams.put("path", "/content/myFolder");
queryParams.put("1_property", "myProperty");
queryParams.put("1_property.operation", "exists");
queryParams.put("p.limit", "-1");
But, I need to exclude a certain path inside this blanket folder , say: "/content/myFolder/wrongFolder" and search in all other folders (whose number keeps on varying)
Is there a way to do so ? I didn't find it exactly online.
I also tried the unequals operation as the parent path is being saved in a JCR property, but still no luck. I actually need unlike to avoid all occurrences of the path. But there is no such thing:
path=/main/path/to/search/in
group.1_property=cq:parentPath
group.1_property.operation=unequals
group.1_property.value=/path/to/be/avoided
group.2_property=myProperty
group.2_property.operation=exists
group.p.or=true
p.limit=-1
This is an old question but the reason you got more results later lies in the way in which you have constructed your query. The correct way to write a query like this would be something like:
path=/main/path/where
property=myProperty
property.operation=exists
property.value=true
group.p.or=true
group.p.not=true
group.1_path=/main/path/where/first/you/donot/want/to/search
group.2_path=/main/path/where/second/you/donot/want/to/search
p.limit=-1
A couple of notes: your group.p.or in your last comment would have applied to all of your groups because they weren't delineated by a group number. If you want an OR to be applied to a specific group (but not all groups), you would use:
path=/main/path/where
group.1_property=myProperty
group.1_property.operation=exists
group.1_property.value=true
2_group.p.or=true
2_group.p.not=true
2_group.3_path=/main/path/where/first/you/donot/want/to/search
2_group.4_path=/main/path/where/second/you/donot/want/to/search
Also, the numbers themselves don't matter - they don't have to be sequential, as long as property predicate numbers aren't reused, which will cause an exception to be thrown when the QB tries to parse it. But for readability and general convention, they're usually presented that way.
I presume that your example was just thrown together for this question, but obviously your "do not search" paths would have to be children of the main path you want to search or including them in the query would be superfluous, the query would not be searching them anyway otherwise.
AEM Query Builder Documentation for 6.3
Hope this helps someone in the future.
Using QueryBuilder you can execute:
map.put("group.p.not",true)
map.put("group.1_path","/first/path/where/you/donot/want/to/search")
map.put("group.2_path","/second/path/where/you/donot/want/to/search")
Also I've checked PredicateGroup's class API and they provide a setNegated method. I've never used it myself, but I think you can negate a group and combine it into a common predicate with the path you are searching on like:
final PredicateGroup doNotSearchGroup = new PredicateGroup();
doNotSearchGroup.setNegated(true);
doNotSearchGroup.add(new Predicate("path").set("path", "/path/where/you/donot/want/to/search"));
final PredicateGroup combinedPredicate = new PredicateGroup();
combinedPredicate.add(new Predicate("path").set("path", "/path/where/you/want/to/search"));
combinedPredicate.add(doNotSearchGroup);
final Query query = queryBuilder.createQuery(combinedPredicate);
Here is the query to specify operator on given specific group id.
path=/content/course/
type=cq:Page
p.limit=-1
1_property=jcr:content/event
group.1_group.1_group.daterange.lowerBound=2019-12-26T13:39:19.358Z
group.1_group.1_group.daterange.property=jcr:content/xyz
group.1_group.2_group.daterange.upperBound=2019-12-26T13:39:19.358Z
group.1_group.2_group.daterange.property=jcr:content/abc
group.1_group.3_group.relativedaterange.property=jcr:content/courseStartDate
group.1_group.3_group.relativedaterange.lowerBound=0
group.1_group.2_group.p.not=true
group.1_group.1_group.p.not=true

Search phrase in a sentence using Lucene 5.5

Purpose: To build a dictionary (Sample Dictionary taken from Gutenberg project). This application should have the capability to return the "word" is part of the meaning is provided. Example:
CONSOLE
Con*sole", v. t. [imp. & p.p. Consoled; p.pr. & vb.n. Consoling.]
Etym: [L. consolari,. p.p. consolatus; con- + solari to console, comfort: cf. F. consoler. See Solace.]
Defn: To cheer in distress or depression; to alleviate the grief and raise the spirits of; to relieve; to comfort; to soothe. And empty heads console with empty sound. Pope. I am much consoled by the reflection that the religion of Christ has been attacked in vain by all the wits and philosophers, and its triumph has been complete. P. Henry.
Syn. -- To comfort; solace; soothe; cheer; sustain; encourage; support. See Comfort.
So if my query is "To cheer in distress", it should return me "Console" as the output.
Am trying to build this tool using Lucene 5.5 (lower versions won't do for now). This is what I tried:
Indexing:
Document doc = new Document();<br>
doc.add(new Field(MEANING, meaningOfWord, Store.YES, Field.Index.ANALYZED));<br>
doc.add(new Field(WORD, word, Store.YES, Field.Index.ANALYZED));<br>
indexWriter.addDocument(doc);<br>
Analyzing:
Analyzer analyzer = new WhitespaceAnalyzer();<br>
QueryParser parser = new QueryParser(MEANING, analyzer);<br>
parser.setAllowLeadingWildcard(true);<br>
parser.setAutoGeneratePhraseQueries(true);<br>
Query query = parser.parse(".*" + searchString + ".*");<br>
TopDocs tophits = isearcher.search(query, null, 1000);<br>
This (tophits) is not returning me what I want. (I have been trying Lucene from last week or so, so please excuse if this is very naive). Any clues?
Sounds like a different analyzer was used when the documents were indexed. Probably KeywordAnalyzer or something. You (usually) need to pass the same analyzer to IndexWriter when indexing your documents as the one you will use when searching. Also, bear in mind, after correcting the IndexWriter's analyzer, you will need to reindex your documents in order for them to be indexed correctly.
Wrapping what should be a simple phrase query in wildcards is a extremely poor substitute for analyzing correctly.
Found the solution, use WildCardQuery, like this:
WildcardQuery wildCardQ = new WildcardQuery(new Term(MEANING, searchString));
But for incorrect words/phrases, it sometimes takes long time to come back with the answer.

Sentence aware search with Lucene SpanQueries

Is it possible to use a Lucene SpanQuery to find all occurrences where the terms "red" "green" and "blue" all appear within a single sentence?
My first (incomplete/incorrect) approach is to write an analyzer that places a special sentence marker token and the beginning of a sentence in the same position as the first word of the sentence and to then query for something similar to the following:
SpanQuery termsInSentence = new SpanNearQuery(
SpanQuery[] {
new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN)),
new SpanTermQuery( new Term ("red")),
new SpanTermQuery( new Term ("green")),
new SpanTermQuery( new Term ("blue")),
},
999999999999,
false
);
SpanQuery nextSentence = new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN));
SpanNotQuery notInNextSentence = new SpanNotQuery(termsInSentence,nextSentence);
The problem, of course, is that nextSentence isn't really the next sentence, it's any sentence marker, including the one in the sentence that termsInSentence matches. Therefore this won't work.
My next approach is to create the analyzer that places the token before the sentence (that is before the first word rather than in the same position as the first word). The problem with this is that I then have to account for the extra offset caused by MY_SPECIAL_SENTENCE_TOKEN. What's more, this will particularly be bad at first when I'm using a naive pattern to split sentences (e.g. split on /\.\s+[A-Z0-9]/) because I'll have to account for all of the (false) sentence markers when I search for U. S. S. Enterprise.
So... how should I approach this?
I would index each sentence as a Lucene document, including a field that marks what source document the sentence came from. Depending on your source material, the overhead of sentence/LuceneDoc may acceptable.
Actually, looks like you are quite close to the solution. I think indexing an end-of-sentence flag is a good approach. The problem is that your end-of-sentence flag is in your SpanNearQuery, which is what is throwing you off. You are asking it to find a span which both contains and does not contain MY_SPECIAL_SENTENCE_TOKEN. The query contradicts itself, so, of course, it won't find any matches. What you really need to know, is that the three terms ("red", "green", and "blue") occur in a span that does not overlap with MY_SPECIAL_SENTENCE_TOKEN (that is, the sentence token doesn't appear in between those terms).
Also, the lack of field names in the Term ctors would be problem, but Lucene should throw an exception complaining about that, so guessing that's not the real problem here. Could be that the Lucene version at the time this was written did not complain about mismatched fields in SpanNears, so perhaps worth mentioning.
This appears to work to me:
SpanQuery termsInSentence = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery( new Term ("text", "red")),
new SpanTermQuery( new Term ("text", "green")),
new SpanTermQuery( new Term ("text", "blue")),
},
9999,
false
);
SpanQuery nextSentence = new SpanTermQuery( new Term ("text", MY_SPECIAL_SENTENCE_TOKEN));
SpanQuery notInNextSentence = new SpanNotQuery(termsInSentence,nextSentence);
As far as where to split sentences, instead of using the naive regex approach, I would try using java.text.Breakiterator. It's not perfect, but it does a pretty good job.

Using Groovy find String from Exclusion List

In groovy, I want to search text (which is typically an xml structure) and find an occurrence of the ignore list.
For example:
My different search data requests are (reduced for clarity, but most are large):
<CustomerRQ field='a'></CustomerRQ>
<AddressRQ field='a'></AddressRQ>
My ignore list is:
CustomerRQ
CustomerRS
Based on the above two incoming requests of "customer" and "address", I want to ignore "Customer" since it's in my ignore list, but I want to identify "address" as a hit.
The overall intent is to use this for logging. I want to not log some incoming requests based on my "ignore" list, but all others will be logged.
Here's some pseudo code that may be on the right track but not really.
def list = ["CustomerRQ", "CustomerRS"]
println(list.contains("<CustomerRQ field='a'>"))
I'm not sure, but I think a closure will work in this case, but learning the groovy ropes here. Maybe a regexp will work as well. But the importance is to search in the incoming string (indexOf, exists...) across all of my exclusions list.
A quick solution:
shouldIgnore = list.inject(false) { bool, val -> bool || line.contains(val) }
Whether or not this is the best idea depends on information we don't have; it may be better to do something more-XMLy rather than checking against a string.

how to do OR search in nutch?

Say,search for results whose Field is 'A' or 'B'?
it seems the default is AND.
Never worked with Nutch actively, but since it's based on Lucene, shouldn't Lucene's rules apply? That is to say, the Query Parser Syntax should be applicable. See if this helps.
i was recently started working with nutch .you need to modify the query.java in nutch to get OR query exicuted.
Add the code in Query.java
public void addShouldTerm(String term, String field) {
clauses.add(new Clause(new Term(term), field, false, false, this.conf));
}
public void addShouldTerm(String term) {
addShouldTerm(term, Clause.DEFAULT_FIELD);
}
and form your query like
Query query= new Query(conf);
query.addNotRequiredTerm("A");
query.addNotRequiredTerm("B");
you will get the results for A Or B.
Please correct me if any other way of doing or better way.
Never used nutch for querying (just for indexing), but the schmea.xml should conatin a defaultOperator which can be set to AND or OR.

Resources