Search phrase in a sentence using Lucene 5.5

Search phrase in a sentence using Lucene 5.5 - search

Purpose: To build a dictionary (Sample Dictionary taken from Gutenberg project). This application should have the capability to return the "word" is part of the meaning is provided. Example:
CONSOLE
Con*sole", v. t. [imp. & p.p. Consoled; p.pr. & vb.n. Consoling.]
Etym: [L. consolari,. p.p. consolatus; con- + solari to console, comfort: cf. F. consoler. See Solace.]
Defn: To cheer in distress or depression; to alleviate the grief and raise the spirits of; to relieve; to comfort; to soothe. And empty heads console with empty sound. Pope. I am much consoled by the reflection that the religion of Christ has been attacked in vain by all the wits and philosophers, and its triumph has been complete. P. Henry.
Syn. -- To comfort; solace; soothe; cheer; sustain; encourage; support. See Comfort.
So if my query is "To cheer in distress", it should return me "Console" as the output.
Am trying to build this tool using Lucene 5.5 (lower versions won't do for now). This is what I tried:
Indexing:
Document doc = new Document();<br>
doc.add(new Field(MEANING, meaningOfWord, Store.YES, Field.Index.ANALYZED));<br>
doc.add(new Field(WORD, word, Store.YES, Field.Index.ANALYZED));<br>
indexWriter.addDocument(doc);<br>
Analyzing:
Analyzer analyzer = new WhitespaceAnalyzer();<br>
QueryParser parser = new QueryParser(MEANING, analyzer);<br>
parser.setAllowLeadingWildcard(true);<br>
parser.setAutoGeneratePhraseQueries(true);<br>
Query query = parser.parse(".*" + searchString + ".*");<br>
TopDocs tophits = isearcher.search(query, null, 1000);<br>
This (tophits) is not returning me what I want. (I have been trying Lucene from last week or so, so please excuse if this is very naive). Any clues?

Sounds like a different analyzer was used when the documents were indexed. Probably KeywordAnalyzer or something. You (usually) need to pass the same analyzer to IndexWriter when indexing your documents as the one you will use when searching. Also, bear in mind, after correcting the IndexWriter's analyzer, you will need to reindex your documents in order for them to be indexed correctly.
Wrapping what should be a simple phrase query in wildcards is a extremely poor substitute for analyzing correctly.

Found the solution, use WildCardQuery, like this:
WildcardQuery wildCardQ = new WildcardQuery(new Term(MEANING, searchString));
But for incorrect words/phrases, it sometimes takes long time to come back with the answer.

Related

AEM Query builder exclude a folder in search

I need to create a query where the params are like:
queryParams.put("path", "/content/myFolder");
queryParams.put("1_property", "myProperty");
queryParams.put("1_property.operation", "exists");
queryParams.put("p.limit", "-1");
But, I need to exclude a certain path inside this blanket folder , say: "/content/myFolder/wrongFolder" and search in all other folders (whose number keeps on varying)
Is there a way to do so ? I didn't find it exactly online.
I also tried the unequals operation as the parent path is being saved in a JCR property, but still no luck. I actually need unlike to avoid all occurrences of the path. But there is no such thing:
path=/main/path/to/search/in
group.1_property=cq:parentPath
group.1_property.operation=unequals
group.1_property.value=/path/to/be/avoided
group.2_property=myProperty
group.2_property.operation=exists
group.p.or=true
p.limit=-1

This is an old question but the reason you got more results later lies in the way in which you have constructed your query. The correct way to write a query like this would be something like:
path=/main/path/where
property=myProperty
property.operation=exists
property.value=true
group.p.or=true
group.p.not=true
group.1_path=/main/path/where/first/you/donot/want/to/search
group.2_path=/main/path/where/second/you/donot/want/to/search
p.limit=-1
A couple of notes: your group.p.or in your last comment would have applied to all of your groups because they weren't delineated by a group number. If you want an OR to be applied to a specific group (but not all groups), you would use:
path=/main/path/where
group.1_property=myProperty
group.1_property.operation=exists
group.1_property.value=true
2_group.p.or=true
2_group.p.not=true
2_group.3_path=/main/path/where/first/you/donot/want/to/search
2_group.4_path=/main/path/where/second/you/donot/want/to/search
Also, the numbers themselves don't matter - they don't have to be sequential, as long as property predicate numbers aren't reused, which will cause an exception to be thrown when the QB tries to parse it. But for readability and general convention, they're usually presented that way.
I presume that your example was just thrown together for this question, but obviously your "do not search" paths would have to be children of the main path you want to search or including them in the query would be superfluous, the query would not be searching them anyway otherwise.
AEM Query Builder Documentation for 6.3
Hope this helps someone in the future.

Using QueryBuilder you can execute:
map.put("group.p.not",true)
map.put("group.1_path","/first/path/where/you/donot/want/to/search")
map.put("group.2_path","/second/path/where/you/donot/want/to/search")
Also I've checked PredicateGroup's class API and they provide a setNegated method. I've never used it myself, but I think you can negate a group and combine it into a common predicate with the path you are searching on like:
final PredicateGroup doNotSearchGroup = new PredicateGroup();
doNotSearchGroup.setNegated(true);
doNotSearchGroup.add(new Predicate("path").set("path", "/path/where/you/donot/want/to/search"));
final PredicateGroup combinedPredicate = new PredicateGroup();
combinedPredicate.add(new Predicate("path").set("path", "/path/where/you/want/to/search"));
combinedPredicate.add(doNotSearchGroup);
final Query query = queryBuilder.createQuery(combinedPredicate);

Here is the query to specify operator on given specific group id.
path=/content/course/
type=cq:Page
p.limit=-1
1_property=jcr:content/event
group.1_group.1_group.daterange.lowerBound=2019-12-26T13:39:19.358Z
group.1_group.1_group.daterange.property=jcr:content/xyz
group.1_group.2_group.daterange.upperBound=2019-12-26T13:39:19.358Z
group.1_group.2_group.daterange.property=jcr:content/abc
group.1_group.3_group.relativedaterange.property=jcr:content/courseStartDate
group.1_group.3_group.relativedaterange.lowerBound=0
group.1_group.2_group.p.not=true
group.1_group.1_group.p.not=true

DocumentDB Replace not Working

I recently realized that DocumentDB supports stand alone update operations via ReplaceDocumentAsync.
I've replaced the Upsert operation below with the Replace operation.
var result = _client
.UpsertDocumentAsync(_collectionUri, docObject)
.Result;
So this is now:
var result = _client
.ReplaceDocumentAsnyc(_collectionUri, docObject)
.Result;
However, now I get the exception:
Microsoft.Azure.Documents.BadRequestException : ResourceType Document is unexpected.
ActivityId: b1b2fd71-3029-4d0d-bd5d-87d8d0a2fc95
No idea why, upsert and replace are of the same vein and the object is the same that worked for upsert, so I would expect it to work without problems.
All help appreciated.
Thanks
Update: Have tried to implement this using the SelfLink approach, and it works for Replace, but selflink does not work with Upsert. The behavior is quite confusing. I don't like that I have to build a self link in code using string concatenation.

I'm afraid that building the selflink with string concatenation is your only option here because ReplaceDocument(...) requires a link to the document. You show a link to the collection in your example. It won't suck the id out and find the document as you might wish.
The NPM module, documentdb-utils, has library functions for building these links but it's just using string concatenation. I have seen an equivalent library for .NET but I can't remember where. Maybe it was in an Azure example or even in the SDK now.

You can build a document link for a replace using the UriFactory helper class:
var result = _client
.ReplaceDocumentAsync(UriFactory.CreateDocumentUri(databaseId, collectionId, docObject.Id), docObject)
.Result;
Unfortunately it's not very intuitive, as Larry has already pointed out, but a replace expects a document to already be there, while an upsert is what it says on the tin. Two different use-cases, I would say.

In order to update a document, you need to provide the Collection Uri. If you provide the Document Uri it returns the following:
ResourceType Document is unexpected.
Maybe the _collectionUri is a Document Uri, the assignment should look like this:
_collectionUri = UriFactory.CreateDocumentCollectionUri(DatabaseName, CollectionName);

using SearchTerm in JavaMail

I have been using JavaMail quite for sometime to develop a simple Mail application. I have also developed a simple search facility using SearchTerm concept in JavaMail. I wanted to search emails by sender, recevier, date, content or subject. So, I have the following sample SearchTerm combinations for the above parameters:
SearchTerm searchSenderOrSubjectTerm = new OrTerm(termSender, termSub);
SearchTerm searchSenderOrDate = new OrTerm(termSender, termRecvDate);
SearchTerm searchSubjectOrSenderOrDate = new OrTerm(searchSenderOrSubjectTerm, searchSenderOrDate);
SearchTerm searchSubjectOrContentOrSenderOrDate = new OrTerm(searchSubjectOrSenderOrDate, termContent);
SearchTerm searchSubjectOrContentOrSenderOrRecvrOrDate = new OrTerm(searchSubjectOrContentOrSenderOrDate, termRecvr);
//return the search results
searchResults = folder.search(searchSubjectOrContentOrSenderOrRecvrOrDate);
This is working fine and returns the required results. But the problem with this approach is that it is taking too much time to search and return the results. I was just wondering whether the problem is the internal SearchTerm implementation or from the above approach. So, can you guys share me your experience on this especially on the performance issue? This is taking too much time and I am not exactly sure where the problem is.
regards,

If you're using IMAP, the searching is all done on the server, so the performance depends on the server. If you're using POP3, the searching is done by downloading all the messages to the client and searching there. Use IMAP. :-)
You can simplify your search by using a single OrTerm with an array of all the other terms. I don't know if that will make any performance difference, however.

Unless you use Google's IMAP extensions, you are applying the search criteria locally.
To search on the server with JavaMail, you'll want to do something like this:
GmailStore store = (GmailStore) session.getStore("gimap");
store.connect("imap.gmail.com", "[your-account#gmail.com", "[your-pw]");
GmailFolder inbox = (GmailFolder) store.getFolder("[Gmail]/All Mail");
inbox.open(Folder.READ_ONLY);
Message[] foundMessages = inbox.search(new GmailRawSearchTerm("to:somebody#email.com"));
A more complete example here: http://scandilabs.com/technology/knowledge/How_to_search_gmail_accounts_via_JavaMail_and_IMAP

Lucene analyzer for first name

Is there a Lucene analyzer out there that tokenizes name parts with their short name equivalents (e.g. Mike and Michael, Rich and Richard, Suzie and Susan), etc?
Fuzzy match on Levenshtein distance is a solution I know, and some implementors seem to pair fuzzy match with the soundex algorithm. Surely somebody has made a swipe at just plain listing all of these short names somewhere?
EDIT: The toughest part of this question is where to get the synonym data from?

I am not aware of any specific nickname filter out there.
A SynonymFilter would make it reasonably easy to generate though, if you had a data source for it. This appears to be a good source of nickname data:
https://code.google.com/p/nickname-and-diminutive-names-lookup/
You would need to generate the SynonymMap to pass into the SynonymFilter ctor, which should look something like this (I think):
SynonymMap.Builder builder = new SynonymMap.Builder(true);
builder.add(new CharsRef("Mike"), new CharsRef("Michael"), false);
builder.add(new CharsRef("Rich"), new CharsRef("Richard"), false);
builder.add(new CharsRef("Suzie"), new CharsRef("Susan"), false);
SynonymMap map = builder.build();

Sentence aware search with Lucene SpanQueries

Is it possible to use a Lucene SpanQuery to find all occurrences where the terms "red" "green" and "blue" all appear within a single sentence?
My first (incomplete/incorrect) approach is to write an analyzer that places a special sentence marker token and the beginning of a sentence in the same position as the first word of the sentence and to then query for something similar to the following:
SpanQuery termsInSentence = new SpanNearQuery(
SpanQuery[] {
new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN)),
new SpanTermQuery( new Term ("red")),
new SpanTermQuery( new Term ("green")),
new SpanTermQuery( new Term ("blue")),
},
999999999999,
false
);
SpanQuery nextSentence = new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN));
SpanNotQuery notInNextSentence = new SpanNotQuery(termsInSentence,nextSentence);
The problem, of course, is that nextSentence isn't really the next sentence, it's any sentence marker, including the one in the sentence that termsInSentence matches. Therefore this won't work.
My next approach is to create the analyzer that places the token before the sentence (that is before the first word rather than in the same position as the first word). The problem with this is that I then have to account for the extra offset caused by MY_SPECIAL_SENTENCE_TOKEN. What's more, this will particularly be bad at first when I'm using a naive pattern to split sentences (e.g. split on /\.\s+[A-Z0-9]/) because I'll have to account for all of the (false) sentence markers when I search for U. S. S. Enterprise.
So... how should I approach this?

I would index each sentence as a Lucene document, including a field that marks what source document the sentence came from. Depending on your source material, the overhead of sentence/LuceneDoc may acceptable.

Actually, looks like you are quite close to the solution. I think indexing an end-of-sentence flag is a good approach. The problem is that your end-of-sentence flag is in your SpanNearQuery, which is what is throwing you off. You are asking it to find a span which both contains and does not contain MY_SPECIAL_SENTENCE_TOKEN. The query contradicts itself, so, of course, it won't find any matches. What you really need to know, is that the three terms ("red", "green", and "blue") occur in a span that does not overlap with MY_SPECIAL_SENTENCE_TOKEN (that is, the sentence token doesn't appear in between those terms).
Also, the lack of field names in the Term ctors would be problem, but Lucene should throw an exception complaining about that, so guessing that's not the real problem here. Could be that the Lucene version at the time this was written did not complain about mismatched fields in SpanNears, so perhaps worth mentioning.
This appears to work to me:
SpanQuery termsInSentence = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery( new Term ("text", "red")),
new SpanTermQuery( new Term ("text", "green")),
new SpanTermQuery( new Term ("text", "blue")),
},
9999,
false
);
SpanQuery nextSentence = new SpanTermQuery( new Term ("text", MY_SPECIAL_SENTENCE_TOKEN));
SpanQuery notInNextSentence = new SpanNotQuery(termsInSentence,nextSentence);
As far as where to split sentences, instead of using the naive regex approach, I would try using java.text.Breakiterator. It's not perfect, but it does a pretty good job.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Search phrase in a sentence using Lucene 5.5 - search

Found the solution, use WildCardQuery, like this: WildcardQuery wildCardQ = new WildcardQuery(new Term(MEANING, searchString)); But for incorrect words/phrases, it sometimes takes long time to come back with the answer.

Related

AEM Query builder exclude a folder in search

DocumentDB Replace not Working

using SearchTerm in JavaMail

Lucene analyzer for first name

Sentence aware search with Lucene SpanQueries

Categories

Resources