Sentence aware search with Lucene SpanQueries - search

Is it possible to use a Lucene SpanQuery to find all occurrences where the terms "red" "green" and "blue" all appear within a single sentence?
My first (incomplete/incorrect) approach is to write an analyzer that places a special sentence marker token and the beginning of a sentence in the same position as the first word of the sentence and to then query for something similar to the following:
SpanQuery termsInSentence = new SpanNearQuery(
SpanQuery[] {
new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN)),
new SpanTermQuery( new Term ("red")),
new SpanTermQuery( new Term ("green")),
new SpanTermQuery( new Term ("blue")),
},
999999999999,
false
);
SpanQuery nextSentence = new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN));
SpanNotQuery notInNextSentence = new SpanNotQuery(termsInSentence,nextSentence);
The problem, of course, is that nextSentence isn't really the next sentence, it's any sentence marker, including the one in the sentence that termsInSentence matches. Therefore this won't work.
My next approach is to create the analyzer that places the token before the sentence (that is before the first word rather than in the same position as the first word). The problem with this is that I then have to account for the extra offset caused by MY_SPECIAL_SENTENCE_TOKEN. What's more, this will particularly be bad at first when I'm using a naive pattern to split sentences (e.g. split on /\.\s+[A-Z0-9]/) because I'll have to account for all of the (false) sentence markers when I search for U. S. S. Enterprise.
So... how should I approach this?

I would index each sentence as a Lucene document, including a field that marks what source document the sentence came from. Depending on your source material, the overhead of sentence/LuceneDoc may acceptable.

Actually, looks like you are quite close to the solution. I think indexing an end-of-sentence flag is a good approach. The problem is that your end-of-sentence flag is in your SpanNearQuery, which is what is throwing you off. You are asking it to find a span which both contains and does not contain MY_SPECIAL_SENTENCE_TOKEN. The query contradicts itself, so, of course, it won't find any matches. What you really need to know, is that the three terms ("red", "green", and "blue") occur in a span that does not overlap with MY_SPECIAL_SENTENCE_TOKEN (that is, the sentence token doesn't appear in between those terms).
Also, the lack of field names in the Term ctors would be problem, but Lucene should throw an exception complaining about that, so guessing that's not the real problem here. Could be that the Lucene version at the time this was written did not complain about mismatched fields in SpanNears, so perhaps worth mentioning.
This appears to work to me:
SpanQuery termsInSentence = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery( new Term ("text", "red")),
new SpanTermQuery( new Term ("text", "green")),
new SpanTermQuery( new Term ("text", "blue")),
},
9999,
false
);
SpanQuery nextSentence = new SpanTermQuery( new Term ("text", MY_SPECIAL_SENTENCE_TOKEN));
SpanQuery notInNextSentence = new SpanNotQuery(termsInSentence,nextSentence);
As far as where to split sentences, instead of using the naive regex approach, I would try using java.text.Breakiterator. It's not perfect, but it does a pretty good job.

Related

Search phrase in a sentence using Lucene 5.5

Purpose: To build a dictionary (Sample Dictionary taken from Gutenberg project). This application should have the capability to return the "word" is part of the meaning is provided. Example:
CONSOLE
Con*sole", v. t. [imp. & p.p. Consoled; p.pr. & vb.n. Consoling.]
Etym: [L. consolari,. p.p. consolatus; con- + solari to console, comfort: cf. F. consoler. See Solace.]
Defn: To cheer in distress or depression; to alleviate the grief and raise the spirits of; to relieve; to comfort; to soothe. And empty heads console with empty sound. Pope. I am much consoled by the reflection that the religion of Christ has been attacked in vain by all the wits and philosophers, and its triumph has been complete. P. Henry.
Syn. -- To comfort; solace; soothe; cheer; sustain; encourage; support. See Comfort.
So if my query is "To cheer in distress", it should return me "Console" as the output.
Am trying to build this tool using Lucene 5.5 (lower versions won't do for now). This is what I tried:
Indexing:
Document doc = new Document();<br>
doc.add(new Field(MEANING, meaningOfWord, Store.YES, Field.Index.ANALYZED));<br>
doc.add(new Field(WORD, word, Store.YES, Field.Index.ANALYZED));<br>
indexWriter.addDocument(doc);<br>
Analyzing:
Analyzer analyzer = new WhitespaceAnalyzer();<br>
QueryParser parser = new QueryParser(MEANING, analyzer);<br>
parser.setAllowLeadingWildcard(true);<br>
parser.setAutoGeneratePhraseQueries(true);<br>
Query query = parser.parse(".*" + searchString + ".*");<br>
TopDocs tophits = isearcher.search(query, null, 1000);<br>
This (tophits) is not returning me what I want. (I have been trying Lucene from last week or so, so please excuse if this is very naive). Any clues?
Sounds like a different analyzer was used when the documents were indexed. Probably KeywordAnalyzer or something. You (usually) need to pass the same analyzer to IndexWriter when indexing your documents as the one you will use when searching. Also, bear in mind, after correcting the IndexWriter's analyzer, you will need to reindex your documents in order for them to be indexed correctly.
Wrapping what should be a simple phrase query in wildcards is a extremely poor substitute for analyzing correctly.
Found the solution, use WildCardQuery, like this:
WildcardQuery wildCardQ = new WildcardQuery(new Term(MEANING, searchString));
But for incorrect words/phrases, it sometimes takes long time to come back with the answer.

Document has too many paragraphs

I'm writing a Java agent that outputs some text in rich text field:
StringBuffer sb = new StringBuffer();
for (String line : datalines) {
if (sb.length() + line.length()< 64000){
sb.append(line);
sb.append(' ');
} else {
// flush buffer
rt.appendText(sb.toString());
rt.addNewLine();
sb = new StringBuffer();
}
}
// write the rest of the buffer
rt.appendText(sb.toString());
rt.addNewLine();
However, if the text is long, in the end I'm not able to open the document in UI. With a message: "Document has too many paragraphs - it must be split into several documents".
I know that "too many paragraphs" is an old issue. I've seen on old forums a lot of suffering and some unhelpful advice. But how many is "too many"? I just counted that I'm writing 533 paragraphs. Is it too many? I agree that the paragraph size is not bad at all, and total size is some 34 Mb. But size-wise it's peanuts for RT. I've tried to cut paragraph at 30K size - the same problem.
Found a funny document on IBM site about this problem on 8.5.1: http://www-01.ibm.com/support/docview.wss?uid=swg1LO53879 that claims that APAR is "Closed as fixed if next." And yes, I'm running 9.0.1, in case they meant "fixed in the next release" or some such.
Any thoughts about how many is too many or what really is the limit we are hitting and how to estimate at least - when we approach "too many"? And what are the strategies. Apart from writing less.
Frankly I can just write N non-summary text fields in this case, or use an attachment, but my passion for RT does not let me really let this go.
(Notes really has this quirks, undocumented things, wrong error messages, etc, etc)
Maybe try to insert .addPageBreak() each N lines or force a paragraph with .addNewLine(1,true) in case the message is wrong and the problem is you only have one parapgraph.

Tracking a tweeter hashtag (keyword) with stream API

I am trying to track all tweets by given hashtag or keyword. The problem is I can stream the tweets when I use a simple keyword like 'animal' but when I change it to say 'animal4666' then it doesn't work. No reply is received. I am using the code below.
twit.stream('statuses/filter', { track: 'animal4666' }, function(stream) {
stream.on('data', function(data) {
console.log(util.inspect(data));
});
});
I have made two tweets from different account like following:
'#animal4666 a'
'#animal4666 trying to find out what is going on?'
Above tweets are successfully retrieved using search API but because of the rate limitations on search API I need to use stream API so that i can check for new tweets every two seconds with node.js
The addon I am using of node.js: https://github.com/jdub/node-twitter
Can someone please help?
If you look the code of the library you are using, it seems that there is nothing potentially wrong
it only has a workaround when you pass an array in the track param, but is not the case
// Workaround for node-oauth vs. twitter commas-in-params bug
if ( params && params.track && Array.isArray(params.track) ) {
params.track = params.track.join(',')
}
So looking in to the official api docs for track method, I see two caveats that may are relevant.
Each phrase must be between 1 and 60 bytes, inclusive.
I think yours are shorter but is something to take in mind
And what I think is your real problem:
Exact matching of phrases (equivalent to quoted phrases in most search
engines) is not supported.
Punctuation and special characters will be considered part of the term
they are adjacent to. In this sense, "hello." is a different track
term than "hello". However, matches will ignore punctuation present in
the Tweet. So "hello" will match both "hello world" and "my brother
says hello." Note that punctuation is not considered to be part of a
#hashtag or #mention, so a track term containing punctuation will not match either #hashtags or #mentions.
You can check online your tweet text to see if it match here

Lucene analyzer for first name

Is there a Lucene analyzer out there that tokenizes name parts with their short name equivalents (e.g. Mike and Michael, Rich and Richard, Suzie and Susan), etc?
Fuzzy match on Levenshtein distance is a solution I know, and some implementors seem to pair fuzzy match with the soundex algorithm. Surely somebody has made a swipe at just plain listing all of these short names somewhere?
EDIT: The toughest part of this question is where to get the synonym data from?
I am not aware of any specific nickname filter out there.
A SynonymFilter would make it reasonably easy to generate though, if you had a data source for it. This appears to be a good source of nickname data:
https://code.google.com/p/nickname-and-diminutive-names-lookup/
You would need to generate the SynonymMap to pass into the SynonymFilter ctor, which should look something like this (I think):
SynonymMap.Builder builder = new SynonymMap.Builder(true);
builder.add(new CharsRef("Mike"), new CharsRef("Michael"), false);
builder.add(new CharsRef("Rich"), new CharsRef("Richard"), false);
builder.add(new CharsRef("Suzie"), new CharsRef("Susan"), false);
SynonymMap map = builder.build();

Using Groovy find String from Exclusion List

In groovy, I want to search text (which is typically an xml structure) and find an occurrence of the ignore list.
For example:
My different search data requests are (reduced for clarity, but most are large):
<CustomerRQ field='a'></CustomerRQ>
<AddressRQ field='a'></AddressRQ>
My ignore list is:
CustomerRQ
CustomerRS
Based on the above two incoming requests of "customer" and "address", I want to ignore "Customer" since it's in my ignore list, but I want to identify "address" as a hit.
The overall intent is to use this for logging. I want to not log some incoming requests based on my "ignore" list, but all others will be logged.
Here's some pseudo code that may be on the right track but not really.
def list = ["CustomerRQ", "CustomerRS"]
println(list.contains("<CustomerRQ field='a'>"))
I'm not sure, but I think a closure will work in this case, but learning the groovy ropes here. Maybe a regexp will work as well. But the importance is to search in the incoming string (indexOf, exists...) across all of my exclusions list.
A quick solution:
shouldIgnore = list.inject(false) { bool, val -> bool || line.contains(val) }
Whether or not this is the best idea depends on information we don't have; it may be better to do something more-XMLy rather than checking against a string.

Resources