Document has too many paragraphs - lotus-notes

I'm writing a Java agent that outputs some text in rich text field:
StringBuffer sb = new StringBuffer();
for (String line : datalines) {
if (sb.length() + line.length()< 64000){
sb.append(line);
sb.append(' ');
} else {
// flush buffer
rt.appendText(sb.toString());
rt.addNewLine();
sb = new StringBuffer();
}
}
// write the rest of the buffer
rt.appendText(sb.toString());
rt.addNewLine();
However, if the text is long, in the end I'm not able to open the document in UI. With a message: "Document has too many paragraphs - it must be split into several documents".
I know that "too many paragraphs" is an old issue. I've seen on old forums a lot of suffering and some unhelpful advice. But how many is "too many"? I just counted that I'm writing 533 paragraphs. Is it too many? I agree that the paragraph size is not bad at all, and total size is some 34 Mb. But size-wise it's peanuts for RT. I've tried to cut paragraph at 30K size - the same problem.
Found a funny document on IBM site about this problem on 8.5.1: http://www-01.ibm.com/support/docview.wss?uid=swg1LO53879 that claims that APAR is "Closed as fixed if next." And yes, I'm running 9.0.1, in case they meant "fixed in the next release" or some such.
Any thoughts about how many is too many or what really is the limit we are hitting and how to estimate at least - when we approach "too many"? And what are the strategies. Apart from writing less.
Frankly I can just write N non-summary text fields in this case, or use an attachment, but my passion for RT does not let me really let this go.

(Notes really has this quirks, undocumented things, wrong error messages, etc, etc)
Maybe try to insert .addPageBreak() each N lines or force a paragraph with .addNewLine(1,true) in case the message is wrong and the problem is you only have one parapgraph.

Related

How does the fetchNext(int) method works in jooq?

From the documentation of fetchNext(int number) -
"This will conveniently close the Cursor, after the last Record was fetched."
Assuming number=100 and there are 1000 records in total.
Will it close the cursor once the 100th record was fetched, or when the 1000th was fetched?
In other words, what is the "last record" referred in the documentation?
Cursor<Record> records = dsl.select...fetchLazy();
while (records.hasNext()) {
records.fetchNext(100).formatCSV(out);
}
out.close();
This convenience is a historic feature in jOOQ, which will be removed eventually: https://github.com/jOOQ/jOOQ/issues/8884. As with every Closeable resource in Java, you should never rely on this sort of auto closing. It is always better to eagerly close the resource when you know you're done using it. In your case, ideally, wrap the code in a try-with-resources statement.
What the Javadoc means is that the underlying JDBC ResultSet will be closed as soon as jOOQ's call to ResultSet.next() yields false, i.e. the database returns no more records. So, no. If there are 1000 records in total from your select, and you're only fetching 100, then the cursor will not be closed. If it were, this wouldn't be a "convenience feature", but break all sorts of other API, including the one you've called. It's totally possible to call fetchNext(100) twice, or in a loop, as you did.

Search phrase in a sentence using Lucene 5.5

Purpose: To build a dictionary (Sample Dictionary taken from Gutenberg project). This application should have the capability to return the "word" is part of the meaning is provided. Example:
CONSOLE
Con*sole", v. t. [imp. & p.p. Consoled; p.pr. & vb.n. Consoling.]
Etym: [L. consolari,. p.p. consolatus; con- + solari to console, comfort: cf. F. consoler. See Solace.]
Defn: To cheer in distress or depression; to alleviate the grief and raise the spirits of; to relieve; to comfort; to soothe. And empty heads console with empty sound. Pope. I am much consoled by the reflection that the religion of Christ has been attacked in vain by all the wits and philosophers, and its triumph has been complete. P. Henry.
Syn. -- To comfort; solace; soothe; cheer; sustain; encourage; support. See Comfort.
So if my query is "To cheer in distress", it should return me "Console" as the output.
Am trying to build this tool using Lucene 5.5 (lower versions won't do for now). This is what I tried:
Indexing:
Document doc = new Document();<br>
doc.add(new Field(MEANING, meaningOfWord, Store.YES, Field.Index.ANALYZED));<br>
doc.add(new Field(WORD, word, Store.YES, Field.Index.ANALYZED));<br>
indexWriter.addDocument(doc);<br>
Analyzing:
Analyzer analyzer = new WhitespaceAnalyzer();<br>
QueryParser parser = new QueryParser(MEANING, analyzer);<br>
parser.setAllowLeadingWildcard(true);<br>
parser.setAutoGeneratePhraseQueries(true);<br>
Query query = parser.parse(".*" + searchString + ".*");<br>
TopDocs tophits = isearcher.search(query, null, 1000);<br>
This (tophits) is not returning me what I want. (I have been trying Lucene from last week or so, so please excuse if this is very naive). Any clues?
Sounds like a different analyzer was used when the documents were indexed. Probably KeywordAnalyzer or something. You (usually) need to pass the same analyzer to IndexWriter when indexing your documents as the one you will use when searching. Also, bear in mind, after correcting the IndexWriter's analyzer, you will need to reindex your documents in order for them to be indexed correctly.
Wrapping what should be a simple phrase query in wildcards is a extremely poor substitute for analyzing correctly.
Found the solution, use WildCardQuery, like this:
WildcardQuery wildCardQ = new WildcardQuery(new Term(MEANING, searchString));
But for incorrect words/phrases, it sometimes takes long time to come back with the answer.

Can I read the font from one document, and embed that font in a brand new document, using iTextSharp?

In the below code (based on code previously provided by Chris Haas), I am reading the fonts from an existing document. Using this method, I am able to re-use those font objects elsewhere in the existing document. However, now I want to use this method to read the fonts in document "A", and embed them when I'm creating brand-new document "B". Can this be done?
The BaseFont.CreateFont method here is taking a PRindirectReference as an argument, which keeps me from being able to specify "BaseFont.EMBEDDED" as an argument, as can be seen in overloaded versions of the method where the specific path to a font is known.
internal static HybridDictionary findAllFonts(PdfReader reader)
{
HybridDictionary fd = new HybridDictionary();
//Get the document's acroform dictionary
PdfDictionary acroForm = (PdfDictionary)PdfReader.GetPdfObject(reader.Catalog.Get(PdfName.ACROFORM));
//Bail if there isn't one
if (acroForm == null)
{
return null;
}
//Get the resource dictionary
var DR = acroForm.GetAsDict(PdfName.DR);
//Get the font dictionary (required per spec)
var fontDict = DR.GetAsDict(PdfName.FONT);
foreach (var internalFontName in fontDict.Keys)
{
var internalFontDict = (PdfDictionary)PdfReader.GetPdfObject(fontDict.Get(internalFontName));
var baseFontName = (PdfName)PdfReader.GetPdfObject(internalFontDict.Get(PdfName.BASEFONT));
//Console.WriteLine(baseFontName.ToString().Substring(1, baseFontName.ToString().Length - 1));
var iRef = (PRIndirectReference)fontDict.GetAsIndirectObject(internalFontName);
if (iRef != null)
{
fd.Add(baseFontName.ToString().Substring(1, baseFontName.ToString().Length - 1).ToLower(),
BaseFont.CreateFont(iRef));
}
}
return fd;
}
This won't always be possible because usually fonts aren't embedded entirely. Instead you'll have subsets of the font. A glyph that is present in one subset may not be present in another subset.
Moreover, you'll face encoding problems: suppose that you have a document where Arial is used as a simple font for Greek glyphs. In that case, you'll have a maximum of 256 characters that can't be reused if you want to use Arial in another document to render a Russian text, or a text in Latin-1.
Even if you use Unicode, then you'll still have a problem, because there is not a single font that contains all Unicode characters. There are 1,114,112 code points in Unicode whereas a character identifier in a composite font can only be a number from 0 to 65,535...
You should really abandon the idea of reusing fonts that are present in existing documents to create new documents. On one side, it smells like you're trying to do something that is illegal (don't you have a license to use the actual fonts). On the other side your question sounds like: I have carrot soup, please tell me how to extract the original carrots from that soup so that I can reuse them for another purpose. You may have some results if you can still find large chunks of carrots in the soup, but in most cases, you'll fail.
For instance: if you have an elementary Type 1 font that is fully embedded, you should be able to copy all the essential elements of the font descriptor, but as soon as you're faced with the modern way of storing font subsets inside a PDF, you'll get stuck discovering that you're trying to do something that is simply impossible.

Tracking a tweeter hashtag (keyword) with stream API

I am trying to track all tweets by given hashtag or keyword. The problem is I can stream the tweets when I use a simple keyword like 'animal' but when I change it to say 'animal4666' then it doesn't work. No reply is received. I am using the code below.
twit.stream('statuses/filter', { track: 'animal4666' }, function(stream) {
stream.on('data', function(data) {
console.log(util.inspect(data));
});
});
I have made two tweets from different account like following:
'#animal4666 a'
'#animal4666 trying to find out what is going on?'
Above tweets are successfully retrieved using search API but because of the rate limitations on search API I need to use stream API so that i can check for new tweets every two seconds with node.js
The addon I am using of node.js: https://github.com/jdub/node-twitter
Can someone please help?
If you look the code of the library you are using, it seems that there is nothing potentially wrong
it only has a workaround when you pass an array in the track param, but is not the case
// Workaround for node-oauth vs. twitter commas-in-params bug
if ( params && params.track && Array.isArray(params.track) ) {
params.track = params.track.join(',')
}
So looking in to the official api docs for track method, I see two caveats that may are relevant.
Each phrase must be between 1 and 60 bytes, inclusive.
I think yours are shorter but is something to take in mind
And what I think is your real problem:
Exact matching of phrases (equivalent to quoted phrases in most search
engines) is not supported.
Punctuation and special characters will be considered part of the term
they are adjacent to. In this sense, "hello." is a different track
term than "hello". However, matches will ignore punctuation present in
the Tweet. So "hello" will match both "hello world" and "my brother
says hello." Note that punctuation is not considered to be part of a
#hashtag or #mention, so a track term containing punctuation will not match either #hashtags or #mentions.
You can check online your tweet text to see if it match here

Sentence aware search with Lucene SpanQueries

Is it possible to use a Lucene SpanQuery to find all occurrences where the terms "red" "green" and "blue" all appear within a single sentence?
My first (incomplete/incorrect) approach is to write an analyzer that places a special sentence marker token and the beginning of a sentence in the same position as the first word of the sentence and to then query for something similar to the following:
SpanQuery termsInSentence = new SpanNearQuery(
SpanQuery[] {
new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN)),
new SpanTermQuery( new Term ("red")),
new SpanTermQuery( new Term ("green")),
new SpanTermQuery( new Term ("blue")),
},
999999999999,
false
);
SpanQuery nextSentence = new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN));
SpanNotQuery notInNextSentence = new SpanNotQuery(termsInSentence,nextSentence);
The problem, of course, is that nextSentence isn't really the next sentence, it's any sentence marker, including the one in the sentence that termsInSentence matches. Therefore this won't work.
My next approach is to create the analyzer that places the token before the sentence (that is before the first word rather than in the same position as the first word). The problem with this is that I then have to account for the extra offset caused by MY_SPECIAL_SENTENCE_TOKEN. What's more, this will particularly be bad at first when I'm using a naive pattern to split sentences (e.g. split on /\.\s+[A-Z0-9]/) because I'll have to account for all of the (false) sentence markers when I search for U. S. S. Enterprise.
So... how should I approach this?
I would index each sentence as a Lucene document, including a field that marks what source document the sentence came from. Depending on your source material, the overhead of sentence/LuceneDoc may acceptable.
Actually, looks like you are quite close to the solution. I think indexing an end-of-sentence flag is a good approach. The problem is that your end-of-sentence flag is in your SpanNearQuery, which is what is throwing you off. You are asking it to find a span which both contains and does not contain MY_SPECIAL_SENTENCE_TOKEN. The query contradicts itself, so, of course, it won't find any matches. What you really need to know, is that the three terms ("red", "green", and "blue") occur in a span that does not overlap with MY_SPECIAL_SENTENCE_TOKEN (that is, the sentence token doesn't appear in between those terms).
Also, the lack of field names in the Term ctors would be problem, but Lucene should throw an exception complaining about that, so guessing that's not the real problem here. Could be that the Lucene version at the time this was written did not complain about mismatched fields in SpanNears, so perhaps worth mentioning.
This appears to work to me:
SpanQuery termsInSentence = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery( new Term ("text", "red")),
new SpanTermQuery( new Term ("text", "green")),
new SpanTermQuery( new Term ("text", "blue")),
},
9999,
false
);
SpanQuery nextSentence = new SpanTermQuery( new Term ("text", MY_SPECIAL_SENTENCE_TOKEN));
SpanQuery notInNextSentence = new SpanNotQuery(termsInSentence,nextSentence);
As far as where to split sentences, instead of using the naive regex approach, I would try using java.text.Breakiterator. It's not perfect, but it does a pretty good job.

Resources