I've just started working with Stanford NLP core.
My problem is that many of the sentences in my corpus do not end with a period (full stop).
Frankly, a bit of string parsing with regular expressions could probably fix the issue, but with some degree of error.
I am curious as to whether Stanford NLP can identify missing periods.
It looks like edu.stanfordn.nlp.process.DocumentPreprocessor can be used to split paragraphs into sentences, though I am not sure how well it works without proper punctuation.
There are many other sentence-level tokenizers which you can use to preprocess your corpus , check out NLTK's nltk.tokenize.punkt module which uses a ML algorithm to make sentence tokens in the absence of good capitalization/punctuation.
For my Bachelorthesis I need to train different word embedding algorithms on the same corpus to benchmark them.
I am looking to find preprocessing steps but am not sure which ones to use and which ones might be less useful.
I already looked for some studies but also wanted to ask if someone has experience with this.
My objective is to train Word2Vec, FastText and GloVe Embeddings on the same corpus. Not too sure which one now, but I think of Wikipedia or something similar.
In my opinion:
remove non-alphabetic characters with regex or similar
Stopword removal
catching Phrases
are the logical options.
But I heard that stopword removal can be kind of tricky, because there is a chance that some embeddings still contain stopwords due to the fact that automatic stopword removal might not fit to any model/corpus.
Also I have not decided if I want to choose spacy or nltk as library, spacy is mightier but nltk is mainly used at the chair I am writing.
Preprocessing is like hyperparameter optimization or neural architecture search. There isn't a theoretical answer to "which one should I use". The applied section of this field (NLP) is far ahead of the theory. You just run different combinations until you find the one that works best (according to your choice of metric).
Yes Wikipedia is great, and almost everyone uses it (plus other datasets). I've tried spacy and it's powerful, but I think I made a mistake with it and I ended up writing my own tokenizer which worked better. YMMV. Again, you just have to jump in and try almost everything. Check with your advisor that you have enough time and computing resources.
I'm trying to tokenize sentences using spacy.
The text includes lots of abbreviations and comments which ends with a period. Also, the text was obtained with OCR and sometimes there are line breaks in the middle of sentences. Spacy doesn't seem to be performing so well in these situations.
I have extracted some examples of how I want these sentences to be split. Is there any way to train spacy's sentence tokenizer?
Spacy is a little unusual in that the default sentence segmentation comes from the dependency parser, so you can't train a sentence boundary detector directly as such, but you can add your own custom component to the pipeline or pre-insert some boundaries that the parser will respect. See their documentation with examples: Spacy Sentence Segmentation
For the cases you're describing it would potentially be useful also be able to specify that a particular position is NOT a sentence boundary, but as far as I can tell that's not currently possible.
I am using TreeTagger to get the lemmas of words in Spanish, but I have observed there are too much words which are not transformed as should be. I would like to know how this operations works, if it is done with techniques such as decision trees or machine learning algorithms or it simply contains a list of words with its corresponding lemma. Does someone know it?
On basis of personal communication via email with H. Schmid, the author of TreeTagger, the answer to your question is:
The lemmatization function is based on the XTAG Project, which includes a morphological analyzer. Within the XTAG project several corpora have been analyzed. Considerung TreeTagger, especially the analysis of the Penn Treebank Corpus seems relevant, since this corpus is the training corpus for the English parameter file of TreeTagger. Considering lemmatization, the lemmata have simply been stored in a lexicon. TreeTagger finally uses this lexicon as a lookup table.
Hence, with TreeTagger you may only retreive the lemmata that are available in the lexicon.
In case you need additional funtionality regarding lemmatization beyond the options in TreeeTagger, you will need a morphological analyzer and, depending on your approach, a suitable training corpus, although this does not seem mandatoriy, since several analyzers perform quite well even when directly applied on the corpus of interest to be analyzed.
A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences:
The corpora were lemmatized and POS-tagged with the Stanford CoreNLP (Manning et al., 2014) and each token was replaced with its lemma and POS tag
For pre-processing, we tokenise and lowercase the words using Stanford CoreNLP
So my questions are:
Why does the first paper apply POS-tagging? Would each token then be replaced with something like {lemma}_{POS} and the whole thing used to train the model? Or are the tags used to filter tokens?
For example, gensims WikiCorpus applies lemmatization per default and then only keeps a few types of part of speech (verbs, nouns, etc.) and gets rid of the rest. So what is the recommended way?
The quote from the second paper seems to me like they only split up words and then lowercase them. This is also what I first tried before I used WikiCorpus. In my opinion, this should give better results for document embeddings as most of POS types contribute to the meaning of a sentence. Am I right?
In the original doc2vec paper I did not find details about their pre-processing.
For your first question, the answer is "it depends on what you are trying to accomplish!"
There isn't a recommended way per say, to pre-process text. To clean a text corpus, usually the first steps are tokenization and lemmatization. Next, to remove not important terms/tokens, you can remove stop-words or even apply POS tags, to be able to remove tokens based on their grammatical category, based on the assumption that some grammatical categories (such as adjectives), do not contain valuable information for modelling a topic for example. But this purely depends on the type of analysis you are going to follow after the pre-processing step.
For you second part of the question, as explained above, tokenisation and lower case tokens, are standard parts of the pre-processing routine. So I also suspect, that regardless of the ML algorithm used later on, your results will be better if you carefully pre-process your data. I am not sure whether POS tags contribute to the meaning of a sentence though.
Hope I provided some valuable feedback to your research. If not you could provide a code sample to further discuss this issue.
I am extracting causal sentences from the accident reports on water. I am using NLTK as a tool here. I manually created my regExp grammar by taking 20 causal sentence structures [see examples below]. The constructed grammar is of the type
grammar = r'''Cause: {<DT|IN|JJ>?<NN.*|PRP|EX><VBD><NN.*|PRP|VBD>?<.*>+<VBD|VBN>?<.*>+}'''
Now the grammar has 100% recall on the test set ( I built my own toy dataset with 50 causal and 50 non causal sentences) but a low precision. I would like to ask about:
How to train NLTK to build the regexp grammar automatically for
extracting particular type of sentences.
Has any one ever tried to extract causal sentences. Example
causal sentences are:
There was poor sanitation in the village, as a consequence, she had
health problems.
The water was impure in her village, For this reason, she suffered
from parasites.
She had health problems because of poor sanitation in the village.
I would want to extract only the above type of sentences from a
large text.
Had a brief discussion with the author of the book: "Python Text Processing with NLTK 2.0 Cookbook", Mr.Jacob Perkins. He said, "a generalized grammar for sentences is pretty hard. I would instead see if you can find common tag patterns, and use those. But then you're essentially do classification by regexp matching. Parsing is usually used to extract phrases within a sentence, or to produce deep parse trees of a sentence, but you're just trying to identify/extract sentences, which is why I think classification is a much better approach. Consider including tagged words as features when you try this, since the grammar could be significant." taking his suggestions I looked at the causal sentences I had and I found out that these sentences have words like
as a result
as a consequence
For this reason
For all these reasons
because of
on account of
due to
for the reason
so, that
These words are indeed connecting cause and effect in a sentence. Using these connectors it is now easy to extract causal sentences. A detailed report can be found on arxiv: https://arxiv.org/pdf/1507.02447.pdf