nlp- Difference between Sentences and a Document in Stanford OpenNLP? - nlp

Let us say we have an article that we want to annotate. If we input the text as one really long Sentence as opposed to a Document, does Stanford do anything differently between annotating that one long Sentence as opposed to looping through every Sentence in the Document and culminating all of its results together?
EDIT: I ran a test and it seems like the two approaches return two different NER sets. I might be just doing it wrong, but it's certainly super interesting and I'm curious as to why this happens.

To confirm: you mean Stanford CoreNLP (as opposed to Apache OpenNLP), right?
The main difference in the CoreNLP Simple API between a Sentence and a Document is tokenization. A Sentence will force the entire text to be considered as a single sentence, even if it has punctuation. A Document will first tokenize the text into a list of sentences, and then annotate each sentence.
Note that for annotators like the constituency parser, very long sentences will take prohibitively long to annotate. Also, note that coreference only works on documents, not sentences.

Related

Passing multiple sentences to BERT?

I have a dataset with paragraphs that I need to classify into two classes. These paragraphs are usually 3-5 sentences long. The overwhelming majority of them are less than 500 words long. I would like to make use of BERT to tackle this problem.
I am wondering how I should use BERT to generate vector representations of these paragraphs and especially, whether it is fine to just pass the whole paragraph into BERT?
There have been informative discussions of related problems here and here. These discussions focus on how to use BERT for representing whole documents. In my case the paragraphs are not that long, and indeed could be passed to BERT without exceeding its maximum length of 512. However, BERT was trained on sentences. Sentences are relatively self-contained units of meaning. I wonder if feeding multiple sentences into BERT doesn't conflict fundamentally with what the model was designed to do (although this appears to be done regularly).
I think your question is based on a misconception. Even though the BERT paper uses the term sentence quite often, it is not referring to a linguistic sentence. The paper defines a sentence as
an arbitrary span of contiguous text, rather than an actual linguistic sentence.
It is therefore completely fine to pass whole paragraphs to BERT and a reason why they can handle those.

How does TreeTagger get the lemma of a word?

I am using TreeTagger to get the lemmas of words in Spanish, but I have observed there are too much words which are not transformed as should be. I would like to know how this operations works, if it is done with techniques such as decision trees or machine learning algorithms or it simply contains a list of words with its corresponding lemma. Does someone know it?
Thanks!!
On basis of personal communication via email with H. Schmid, the author of TreeTagger, the answer to your question is:
The lemmatization function is based on the XTAG Project, which includes a morphological analyzer. Within the XTAG project several corpora have been analyzed. Considerung TreeTagger, especially the analysis of the Penn Treebank Corpus seems relevant, since this corpus is the training corpus for the English parameter file of TreeTagger. Considering lemmatization, the lemmata have simply been stored in a lexicon. TreeTagger finally uses this lexicon as a lookup table.
Hence, with TreeTagger you may only retreive the lemmata that are available in the lexicon.
In case you need additional funtionality regarding lemmatization beyond the options in TreeeTagger, you will need a morphological analyzer and, depending on your approach, a suitable training corpus, although this does not seem mandatoriy, since several analyzers perform quite well even when directly applied on the corpus of interest to be analyzed.

NLP: Pre-processing in doc2vec / word2vec

A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences:
The corpora were lemmatized and POS-tagged with the Stanford CoreNLP (Manning et al., 2014) and each token was replaced with its lemma and POS tag
(http://www.ep.liu.se/ecp/131/039/ecp17131039.pdf)
For pre-processing, we tokenise and lowercase the words using Stanford CoreNLP
(https://arxiv.org/pdf/1607.05368.pdf)
So my questions are:
Why does the first paper apply POS-tagging? Would each token then be replaced with something like {lemma}_{POS} and the whole thing used to train the model? Or are the tags used to filter tokens?
For example, gensims WikiCorpus applies lemmatization per default and then only keeps a few types of part of speech (verbs, nouns, etc.) and gets rid of the rest. So what is the recommended way?
The quote from the second paper seems to me like they only split up words and then lowercase them. This is also what I first tried before I used WikiCorpus. In my opinion, this should give better results for document embeddings as most of POS types contribute to the meaning of a sentence. Am I right?
In the original doc2vec paper I did not find details about their pre-processing.
For your first question, the answer is "it depends on what you are trying to accomplish!"
There isn't a recommended way per say, to pre-process text. To clean a text corpus, usually the first steps are tokenization and lemmatization. Next, to remove not important terms/tokens, you can remove stop-words or even apply POS tags, to be able to remove tokens based on their grammatical category, based on the assumption that some grammatical categories (such as adjectives), do not contain valuable information for modelling a topic for example. But this purely depends on the type of analysis you are going to follow after the pre-processing step.
For you second part of the question, as explained above, tokenisation and lower case tokens, are standard parts of the pre-processing routine. So I also suspect, that regardless of the ML algorithm used later on, your results will be better if you carefully pre-process your data. I am not sure whether POS tags contribute to the meaning of a sentence though.
Hope I provided some valuable feedback to your research. If not you could provide a code sample to further discuss this issue.

Detecting content based on position in sentence with OpenNLP

I've successfully used OpenNLP for document categorization and also was able to extract names from trained samples and using regular expressions.
I was wondering if it is also possible to extract names (or more generally speaking, subjects) based on their position in a sentence?
E.g. instead of training with concrete names that are know a priori, like Travel to <START:location> New York </START>, I would prefer not to provide concrete examples but let OpenNLP decide that anything appearing at the specified position could be an entity. That way, I wouldn't have to provide each and every possible option (which is impossible in my case anyway) but only provide one for the possible surrounding sentence.
that is context based learning and Opennlp already does that. you've to train it with proper and more examples to get good results.
for example, when there is Professor X in our sentence, Opennlp trained model.bin gives you output X as a name whereas when X is present in the sentence without professor infront of it, it might not give output X as a name.
according to its documentation, give 15000 sentences of training data and you can expect good results.

What is the best way to split a sentence for a keyword extraction task?

I'm doing a keyword extraction using TD-IDF on a large number of documents. Currenly I'm splitting each sentence based on n-gram. More particularly I'm using tri-gram. However, this is not the best way to split each sentence into ints constituting keywords. For example a noun phrase like 'triple heart bypass' may not always get detected as one term.
The other alternative to chunk each sentence into its constituting elements look to be part of speech tagging and chunking in Open NLP. In this approach phrase like 'triple heart bypass' always gets extracted as a whole but the downside is in TF-IDF the frequency of extracted terms (phrases) dramatically drops.
Does anyone have any suggestion on either or these two approaches or have any other ideas to improve the quality of the keywords?
What is :
the goal of your application ?
--impacts the tokenization rules and defines the quality of your keywords
type of documents?
--chunking is not the same if you have forum data or news article data.
You can implement some boundary recognizer by yourself, or using a statistical model as in openNLP.
The typical pipeline is that you should first tokenize as simple as possible, apply stop words removal (language-dependent), and then if needed POS tagging-based filtering (but this is a costly operation).
other options : java.text.BreakIterator, com.ibm.icu.text.BreakIterator, com.ibm.icu.text.RuleBasedBreakIterator...

Resources