Clarification about opennlp algorithm - nlp

I'm trying to get in with openNlp. I need it to get new organizations(startups) from news websites (for example: techcrunch). I have a model with organizations, which I use to recognize organizations in publications(en-ner-organization). And here I have a question:
In case there is a publication about new startup, which was born yesterday,
will openNlp recognize it as organization?
As far as I understand - no. Until I don't train model with this new startup, right?
If all my assumptions are correct, the model partially contains of organizations names, so if I want my model to recognize new organization, I have to train it with it's name.
Thanks

As far as I know, OpenNLP should use a statistical model to address named entity recognition: this means that, if OpenNLP has been properly trained with enough data, it should be able to recognize new startups (it's not a grep of known tokens over a file).
Of course metrics such as precision, recall and F1 are useful to determine the accuracy of the algorithm.

Related

Building on existing models on spacy

This is a question regarding training models on SPACY3.x.
I couldn't find a good answer/solution on StackOverflow hence the query.
If I am using the existing model in spacy like the en model and want to add my own entities in the model and train it, let's say since I work in the biomedical domain, things like virus name, shape, length, temperature, temperature value, etc. I don't want to lose the entities tagged by Spacy like organization names, country, etc.
All suggestions are appreciated.
Thanks
There are a few ways to do that.
The best way is to train your own model separately and then combine both models in one pipeline, with one before the other. See the double NER example project for an overview of that.
It's also possible to update the pretrained NER model, see this example project. However this isn't usually a good idea, and definitely not if you're adding completely different entities. You'll run into what's called "catastrophic forgetting", where even though you're technically updating the model, it ends up forgetting everything not represented in your current training data.

Named Entity Recognition Systems for German Texts

I am working on a Named Entity Recognition (NER) project in which I got a large amount of text in the sense that it is too much to read or skim read. Therefore, I want to create an overview of what is mentioned by extracting named entities (places, names, times, maybe topics) and create an index of kind (entity, list of pages/lines where it is mentioned). I have worked through Standford's NLP lecture, (parts of) Eisenstein's Introduction to NLP book found some literature and systems for English texts. As my corpus is in German, I would like to ask how I can approach this problem. Also, this is my first NLP project, so I would not know if I could solve this challenge even if texts were in English.
As a first step
are there German NER systems out there which I could use?
The further roadmap of my project is:
How can I avoid mapping misspellings or rare names to a NUL/UNK token? This is relevant because there are also some historic passages that use words no longer in use or that follow old orthography. I think the relevant terms are tokenisation or stemming.
I thought about fine-tuning or transfer learning the base NER model to a corpus of historic texts to improve NER.
A major challenge is that there is no annotated dataset for my corpus available and I could only manually annotate a tiny fraction of it. So I would be happy for hints on German annotated datasets which I could incorporate into my project.
Thank you in advance for your inputs and fruitful discussions.
Most good NLP toolkits can perform NER in German:
Stanford NLP
Spacy
probably NLTK and OpenNLP as well
What is crucial to understand is that using NER software like the above means using a pretrained model, i.e. a model which has been previously trained on some standard corpus with standard annotated entities.
Btw you can usually find the original annotated dataset by looking at the documentation. There's one NER corpus here.
This is convenient and might suit your goal, but sometimes it doesn't collect exactly every that you would like it to collect, especially if your corpus is from a very specific domain. If you need more specific NER, you must train your own model and this requires obtaining some annotated data (i.e. manually annotating or paying somebody to do it).
Even in this case, a NER model is statistical and it will unavoidably make some mistakes, don't expect perfect results.
About misspellings or rare names: a NER model doesn't care (or not too much) about the actual entity, because it's not primarily based on the words in the entity. It's based on indications in the surrounding text, for example in the sentence "It was announced by Mr XYZ that the event would take place in July", the NER model should find 'Mr XYZ' as a person due to "announced by" and 'July' as a date because of "take place in". However if the language used in the corpus is very different from the training data used for the model, the performance could be very bad.

Sentiment Analysis: Is there a way to extract positive and negative aspects in reviews?

Currently, I'm working on a project where I need to extract the relevant aspects used in positive and negative reviews in real time.
For the notions of more negative and positive, it will be a question of contextualizing the word. Distinguish between a word that sounds positive in a negative context (consider irony).
Here is an example:
Very nice welcome!!! We ate very well with traditional dishes as at home, the quality but also the quantity are in appointment!!!*
Positive aspects: welcome, traditional dishes, quality, quantity
Can anyone suggest to me some tutorials, papers or ideas about this topic?
Thank you in advance.
This task is called Aspect Based Sentiment Analysis (ABSA). Most popular is the format and dataset specified in the 2014 Semantic Evaluation Workshop (Task 5) and its updated versions in the following years.
Overview of model efficiencies over the years:
https://paperswithcode.com/sota/aspect-based-sentiment-analysis-on-semeval
Good source for ressources and repositories on the topic (some are very advanced but there are some more starter friendly ressources in there too):
https://github.com/ZhengZixiang/ABSAPapers
Just from my general experience in this topic a very powerful starting point that doesn't require advanced knowledge in machine learning model design is to prepare a Dataset (such as the one provided for the SemEval2014 Task) that is in a Token Classification Format and use it to fine-tune a pretrained transformer model such as BERT, RoBERTa or similar. Check out any tutorial on how to do fine-tuning on a token classification model like this one in huggingface. They usually use the popular task of Named Entity Recognition (NER) as the example task but for the ABSA-Task you basically do the same thing but with other labels and a different dataset.
Obviously an even easier approach would be to take more rule-based approaches or combine a rule-based approach with a trained sentiment analysis model/negation detection etc., but I think generally with a rule-based approach you can expect a much inferior performance compared to using state-of-the-art models as transformers.
If you want to go even more advanced than just fine-tuning the pretrained transformer models then check out the second and third link I provided and look at some of the machine learning model designs specifically designed for Aspect Based Sentiment Analysis.

How to fine tune BERT on unlabeled data?

I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.

Is it possible to supplement Naive Bayes text classification algorithm with author information?

I am working on a text classification project where I am trying to assign topic classifications to speeches from the Congressional Record.
Using topic codes from the Congressional Bills Project (http://congressionalbills.org/), I've tagged speeches that mention a specific bill as belonging to the topic of the bill. I'm using this as my "training set" for the model.
I have a "vanilla" Naive Bayes classifier working well-enough, but I keep feeling like I could get better accuracy out of the algorithm by incorporating information about the member of Congress who is making the speech (e.g. certain members are much more likely to talk about Foreign Policy than others).
One possibility would be to replace the prior in the NB classifier (usually defined as the proportion of documents in the training set that have the given classification) with speaker's observed prior speeches.
Is this worth pursuing? Are there existing approaches that have followed this same kind of logic? I'm a little bit familiar with the "author-topic models" that come out of Latent Dirichlet Allocation models, but I like the simplicity of the NB model.
There is no need to modify anything, simply add this information to your Naive Bayes and it will work just fine.
And as it was previously mentioned in the comment - do not change any priors - prior probability is P(class), this has nothing to do with actual features.
Just add to your computations another feature corresponding to the authorship, e.g. "author:AUTHOR" and train Naive Bayes as usual, ie. compute P(class|author:AUTHOR) for each class and AUTHOR and use it later on in your classification process.If your current representation is a bag of words, it is sufficient to add a "artificial" word of form "author:AUTHOR" to it.
One other option would be to train independent classifier for each AUTHOR, which would capture person-specific type of speech, for example - one uses lots of words "environment" only when talking about "nature", while other simply likes to add this word in each speach "Oh, in our local environment of ...". Independent NBs would capture these kind of phenomena.

Resources