Retain boilerplate using boilerpipe - boilerpipe

I am using boilerpipe library to analyzer news articles. There news articles contain a lot of boilerplate such as copyright information, side pane of related articles, etc. Boilerpipe removes all that information. Is it possible to return the boilerplate information? I need to analyzer and extract some stuff from copyright statement, etc.
Also, does it contains some sort of confidence for each text block as to whether it is boilerplate or not?
Thanks.

You can get the entire text or traverse the actual text blocks by using the Document classes boilerplate provides:
final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL(url));
final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
// doc.getText(true, true) will give you all the text
// doc.getTextBlocks will let you traverse the document

Related

Using Aho-Corasick, can strings be added after the initial tree is built?

I want to search for strings inside a large number of documents. I have a predefined list of strings available that I want to find in each document. Each document contains a header at the beginning followed by the text and in the header are additional strings I want to search for in the text below the header.
On each iteration of document, is it possible to add the header strings after creating the initial tree that was made from the main list? Or modify the original data structure to include the new strings?
If this is not practical to do, is there an alternative search method that would be more appropriate?
If each document has its own set of strings to search for, it seems like you could just build one global Aho-Corasick matcher and then a second, per-document matcher. Then, as you process the characters in the document, feed each into both of the matching automata and report all matches found this way. That eliminates the need to add new strings to the master automaton and to remove them when you're done. Plus, the slowdown should be pretty minimal.
Hope this helps!

How can I read the translation labels from profile documents in XPages?

I am xpages enabling an old Notes application which is using profile documents to store translated labels. The translated lables in the notes form are read from the profile document using #GetProfileField depending on which language the user have selected in their profile.
I have read that profile documents are not recommended to use with xpages so I need a better solution for my xpages users. but it is important that users using Notes client still use the "old" profile document solution.
How can I provide these translation lables to my xpages users?
Thanks
Thomas
In addition to Knut's answer there is also the option to "double" your translated labels via the way to prefer in XPages dev by using the localization options as described here: http://www-10.lotus.com/ldd/ddwiki.nsf/dx/UsingLocalizationOptions.htm
You need to split the task into two. First have a function that is called inside the XPage to get the label you are looking for, secondly have a way to provide that value inside the function.
Making a direct call to the profile isn't a good idea since it fixes the way you provide the data (besides potentially creating a memory leak if you don't recycle dilligently). I would see 4 potential solutions:
Define your profile document as additional data source and simply bind the labels to items in the document. Saves you most of the recycling work, but couples tight
Use a SsJS function: getLabel(name). It would check for a scope variable (a Map) and if not found load it - currently from your profile. If application scope is good enough, you touch the profile once only- speed. If you change the loader later on - you don't need to change anything in the XPage.
Use a managed bean. Same approach as #2, only now you can use el data binding. Your bean needs to implement Map
If the labels hardly change do a design time conversion and write the profile doc out into properties files (works nicely with ODP) and use XPages internal mechanism for internationalization
Let us know how it goes
You can use profile documents for this use case as the content gets changed only with new versions of your project probably. So, you can easily live with profile document's caching.
You get the label translation from a profile document with
var doc = database.getProfileDocument("LabelsEnglish", "");
var label = doc.getItemValueString("label1");
doc.recycle();
return label;
You could read all labels in an application scope variable Map too and do your own caching. This way profile documents would get read only once.
if (!applicationScope.labels) {
var map = new java.util.HashMap();
var doc = database.getProfileDocument("LabelsEnglish", "");
var allItems = doc.getItems();
for (var i = 0; i < allItems.size(); i++) {
var item = allItems.elementAt(i);
item.getName();
map.put(item.getName(), item.getValueString());
item.recycle();
}
doc.recycle();
applicationScope.labels = map;
}
Execute the SSJS code above in a custom control which is included in every XPage (e.g. application layout custom control) in before page load event so you can be sure application scope variable "labels" is initialized when you want to use it. You can access the labels easily with EL
applicationScope.labels.label1

Extracting Important words from a sentence using Node

I admit that I havent searched extensively in the SO database. I tried reading the natural npm package but doesnt seem to provide the feature. I would like to know if the below requirement is somewhat possible ?
I have a database that has list of all cities of a country. I also have rating of these cities (best place to live, worst place to live, best rated city, worsrt rated city etc..). Now from the User interface, I would like to enable the user to enter free text and from there I should be able to search my database.
For e.g Best place to live in California
or places near California
or places in California
From the above sentence, I want to extract the nouns only (may be ) as this will be name of the city or country that I can search for.
Then extract 'best' means I can sort is a particular order etc...
Any suggestions or directions to look for?
I risk a chance that the question will be marked as 'debatable'. But the reason I posted is to get some direction to proceed.
[I came across this question whilst looking for some use cases to test a module I'm working on. Obviously the question is a little old, but since my module addresses the question I thought I might as well add some information here for future searchers.]
You should be able to do what you want with a POS chunker. I've recently released one for Node that is modelled on chunkers provided by the NLTK (Python) and Standford NLP (Java) libraries (the chunk() and TokensRegex() methods, resepectively).
The module processes strings that already contain parts-of-speech, so first you'll need to run your text through a parts-of-speech tagger, such as pos:
var pos = require('pos');
var words = new pos.Lexer().lex('Best place to live in California');
var tags = new pos.Tagger()
.tag(words)
.map(function(tag){return tag[0] + '/' + tag[1];})
.join(' ');
This will give you:
Best/JJS place/NN to/TO live/VB in/IN California/NNP ./.
Now you can use pos-chunker to find all proper nouns:
var chunker = require('pos-chunker');
var places = chunker.chunk(tags, '[{ tag: NNP }]');
This will give you:
Best/JJS place/NN to/TO live/VB in/IN {California/NNP} ./.
Similarly you could extract verbs to understand what people want to do ('live', 'swim', 'eat', etc.):
var verbs = chunker.chunk(tags, '[{ tag: VB }]');
Which would yield:
Best/JJS place/NN to/TO {live/VB} in/IN California/NNP ./.
You can also match words, sequences of words and tags, use lookahead, group sequences together to create chunks (and then match on those), and other such things.
You probably don't have to identify what is a noun. Since you already have a list of city and country names that your system can handle, you just have to check whether the user input contains one of these names.
Well firstly you'll need to find a way to identify nouns. There is no core node module or anything that can do this for you. You need to loop through all words in the string and then compare them against some kind of dictionary database so you can find each word and check if it's a noun.
I found this api which looks pretty promising. You query the API for a word and it sends you back a blob of data like this:
<?xml version="1.0" encoding="UTF-8"?>
<results>
<result>
<term>consistent, uniform</term>
<definition>the same throughout in structure or composition</definition>
<partofspeech>adj</partofspeech>
<example>bituminous coal is often treated as a consistent and homogeneous product</example>
</result>
</results>
You can see that it includes a partofspeech member which tells you that the word "consistent" is an adjective.
Another (and better) option if you have control over the text being stored is to use some kind of markup language to identify important parts of the string before you save it. Something like BBCode. I even found a BBCode node module that will help you do this.
Then you can save your strings to the database like this:
Best place to live in [city]California[/city] or places near [city]California[/city] or places in [city]California[/city].
or
My name is [first]Alex[/first] [last]Ford[/last].
If you're letting user's type whole sentences of text and then you're trying to figure out what parts of those sentences is data you should use in your app then you're making things very unnecessarily hard on yourself. You should either ask them to input important pieces of data into their own text boxes or you should give the user a formatting language such as the aforementioned BBCode syntax so they can identify important bits for you. The job of finding out which parts of a string are important is going to be a huge one for you I think.

How do I know if I need an XDocument or XElement?

I understand that they are very similar and that the XDocument represents a whole document and an XElement represents a fragment of a whole document, but they seem to be very similar when it comes to loading and querying/updating the data. I am going to have templates saved to a disk and when I load them, I want to query over them and insert,update, and delete sections of data be it attributes or elements. Does XDocument or XElement make a difference here? Does it make a difference if I build the template on the dynamically first?
For starters, they behave in a different way when loading a document, which means you'll have to write your queries in a different way when you choose one over the other. Except for that, msdn states that
The XDocument class contains the information necessary for a valid XML document. This includes an XML declaration, processing instructions, and comments.
Note that you only have to create XDocument objects if you require the specific functionality provided by the XDocument class. In many circumstances, you can work directly with XElement. Working directly with XElement is a simpler programming model
So I'd stick to XElement, unless any of above mentioned metadata about xml is needed (which doesnt seem to be the case).

Help to restructure my Doc/View more correctly

Edited by OP.
My program is in need of a lot of cleanup and restructuring.
In another post I asked about leaving the MFC DocView framework and going to the WinProc & Message Loop way (what is that called for short?). Well at present I am thinking that I should clean up what I have in Doc View and perhaps later convert to non-MFC it that even makes sense. My Document class currently has almost nothing useful in it.
I think a place to start is the InitInstance() function (posted below).
In this part:
POSITION pos=pDocTemplate->GetFirstDocPosition();
CLCWDoc *pDoc=(CLCWDoc *)pDocTemplate->GetNextDoc(pos);
ASSERT_VALID(pDoc);
POSITION vpos=pDoc->GetFirstViewPosition();
CChildView *pCV=(CChildView *)pDoc->GetNextView(vpos);
This seem strange to me. I only have one doc and one view. I feel like I am going about it backwards with GetNextDoc() and GetNextView(). To try to use a silly analogy; it's like I have a book in my hand but I have to look up in it's index to find out what page the Title of the book is on. I'm tired of feeling embarrassed about my code. I either need correction or reassurance, or both. :)
Also, all the miscellaneous items are in no particular order. I would like to rearrange them into an order that may be more standard, structured or straightforward.
ALL suggestions welcome!
BOOL CLCWApp::InitInstance()
{
InitCommonControls();
if(!AfxOleInit())
return FALSE;
// Initialize the Toolbar dll. (Toolbar code by Nikolay Denisov.)
InitGuiLibDLL(); // NOTE: insert GuiLib.dll into the resource chain
SetRegistryKey(_T("Real Name Removed"));
// Register document templates
CSingleDocTemplate* pDocTemplate;
pDocTemplate = new CSingleDocTemplate(
IDR_MAINFRAME,
RUNTIME_CLASS(CLCWDoc),
RUNTIME_CLASS(CMainFrame),
RUNTIME_CLASS(CChildView));
AddDocTemplate(pDocTemplate);
// Parse command line for standard shell commands, DDE, file open
CCmdLineInfo cmdInfo;
ParseCommandLine(cmdInfo);
// Dispatch commands specified on the command line
// The window frame appears on the screen in here.
if (!ProcessShellCommand(cmdInfo))
{
AfxMessageBox("Failure processing Command Line");
return FALSE;
}
POSITION pos=pDocTemplate->GetFirstDocPosition();
CLCWDoc *pDoc=(CLCWDoc *)pDocTemplate->GetNextDoc(pos);
ASSERT_VALID(pDoc);
POSITION vpos=pDoc->GetFirstViewPosition();
CChildView *pCV=(CChildView *)pDoc->GetNextView(vpos);
if(!cmdInfo.m_Fn1.IsEmpty() && !cmdInfo.m_Fn2.IsEmpty())
{
pCV->OpenF1(cmdInfo.m_Fn1);
pCV->OpenF2(cmdInfo.m_Fn2);
pCV->DoCompare(); // Sends a paint message when complete
}
// enable file manager drag/drop and DDE Execute open
m_pMainWnd->DragAcceptFiles(TRUE);
m_pMainWnd->ShowWindow(SW_SHOWNORMAL);
m_pMainWnd->UpdateWindow(); // paints the window background
pCV->bDoSize=true; //Prevent a dozen useless size calculations
return TRUE;
}
Thanks
Hard to give you good recommendations without knowing what your program shall do. I have only a few general remarks:
Your InitInstance does not look very messed up for me. It's pretty much standard with a bit of custom code in it.
Also the ugly construction to retrieve the first view from the application class (the chain GetDocTemplate -> GetDoc -> GetView) is standard to my knowledge. I actually don't know another way. You might think about moving it into a separate method like CChildView* CLCWApp::GetFirstView() but well, that's only cosmetic as long as you need it only at one place.
What you are doing and which data you are placing in your Document class and in your View class(es) is more a semantic question if you only have one view. (You have only one document anyway because it's an SDI application.). From a technical viewpoint often both is possible.
But to be open for (perhaps) later extensions to more than one view and to follow the standard pattern of a doc/view architecture there are a few rules of thumb:
Data which exist and have a meaning independent of the way to present and view them (a document file, a database handle, etc.) belong to the document class. I don't know what your pCV->OpenF1(cmdInfo.m_Fn1) ... and so on does but if it's something like a file or filename or a parameter to be used to access data in any way OpenF1 might be better a method of the document class.
Methods which do any kind of data processing or modification of your underlying data belong to the document class as well
Data and methods which are only needed for a specific way to display a document belong to a view class (for instance a selected font, colours, etc.)
On the other side: If you have a fixed number of views which open with the document it might not be wrong to put view specific data into the document, especially if you want to make those view parameters persistent. An example would be a file with some statistical data - your document - and a splitter frame with two views: one displays the data as a grid table and the other as a pie chart. The table has "view data" describing the order of and width of columns, the pie chart has data to configure the colours of the pie pieces and the legend location, for instance. If you want to make sure that the user gets the last view configuration displayed when he opens the document file you have to store these view parameters somewhere. It wouldn't be wrong or bad design in my opinion to store those parameters in the document too, to store and retrieve them from any permanent storage, even if you need them only in the view classes.
If your application allows to open an unlimited number of views for a document dynamically and those views are only temporary as long as the application runs, storing all view configuration parameters directly in the view classes seems more natural to me. Otherwise in the document you would need to manage any kind of dynamic data structure and establish a relationship between a View and an entry in this data structure (an index in an array, or a key in a map, etc.)
If you are in doubt whether to place any data in the document or view class I'd prefer the document because you always have the easy GetDocument() accessor in the View class to retrieve members or call methods of the Doc. To fetch data from the View into the Document requires to iterate through the list of views. (Remember: Doc-View is a 1-n relationship, even in a SDI application.)
Just a few cents.

Resources