Which are best for Name Entity Recognition for Gujarati Language Text? - nlp

I am finding out the best working models for Name Entity Recognition in Gujarati Text. I know only 1 of them that is Indic Bert model of hugging face. Can anyone suggest other model which documentation or code available for Name Entity Recognition in Gujarati Language??
I found only IndicBERT model of Hugging Face. I want know other mode or any link where the code is available for Name Entity Recognition.

The recent work Joshi [1] offers L3Cube-GujaratiBERT, available on HuggingFace here. You'll have to fine-tune the model on your specific down-stream task (i.e. Named Entity Recognition in Gujarati). There is a list of Indic NER datasets here, of relevance to your problem is the AI4Bharat Naamapadam dataset which has Gujarati as one of the 11 available Indic languages.
Additional Info
In [1], Joshi initially created the L3Cube-HindBERT and L3Cube-DevBERT models pre-trained on Hindi and Devanagari script (Hindi + Marathi) monolingual corpora, respectively. These offered a modest improvement in performance over the alternative MuRIL, IndicBERT and XLM-R multi-lingual offerings. Given the improvement, the author released other Indic language-based models, namely: Kannada, Telugu, Malayalam, Tamil, Gujarati, Assamese, Odia, Bengali, and Punjabi (all can be found at https://huggingface.co/l3cube-pune).
References
[1] Joshi, R., 2022. L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. arXiv preprint arXiv:2211.11418.

Related

Named Entity Recognition Systems for German Texts

I am working on a Named Entity Recognition (NER) project in which I got a large amount of text in the sense that it is too much to read or skim read. Therefore, I want to create an overview of what is mentioned by extracting named entities (places, names, times, maybe topics) and create an index of kind (entity, list of pages/lines where it is mentioned). I have worked through Standford's NLP lecture, (parts of) Eisenstein's Introduction to NLP book found some literature and systems for English texts. As my corpus is in German, I would like to ask how I can approach this problem. Also, this is my first NLP project, so I would not know if I could solve this challenge even if texts were in English.
As a first step
are there German NER systems out there which I could use?
The further roadmap of my project is:
How can I avoid mapping misspellings or rare names to a NUL/UNK token? This is relevant because there are also some historic passages that use words no longer in use or that follow old orthography. I think the relevant terms are tokenisation or stemming.
I thought about fine-tuning or transfer learning the base NER model to a corpus of historic texts to improve NER.
A major challenge is that there is no annotated dataset for my corpus available and I could only manually annotate a tiny fraction of it. So I would be happy for hints on German annotated datasets which I could incorporate into my project.
Thank you in advance for your inputs and fruitful discussions.
Most good NLP toolkits can perform NER in German:
Stanford NLP
Spacy
probably NLTK and OpenNLP as well
What is crucial to understand is that using NER software like the above means using a pretrained model, i.e. a model which has been previously trained on some standard corpus with standard annotated entities.
Btw you can usually find the original annotated dataset by looking at the documentation. There's one NER corpus here.
This is convenient and might suit your goal, but sometimes it doesn't collect exactly every that you would like it to collect, especially if your corpus is from a very specific domain. If you need more specific NER, you must train your own model and this requires obtaining some annotated data (i.e. manually annotating or paying somebody to do it).
Even in this case, a NER model is statistical and it will unavoidably make some mistakes, don't expect perfect results.
About misspellings or rare names: a NER model doesn't care (or not too much) about the actual entity, because it's not primarily based on the words in the entity. It's based on indications in the surrounding text, for example in the sentence "It was announced by Mr XYZ that the event would take place in July", the NER model should find 'Mr XYZ' as a person due to "announced by" and 'July' as a date because of "take place in". However if the language used in the corpus is very different from the training data used for the model, the performance could be very bad.

Which Deep Learning Algorithm does Spacy uses when we train Custom model?

When we train custom model, I do see we have dropout and n_iter parameters to tune, but which deep learning algorithm does Spacy Uses to train Custom Models? Also, when Adding new Entity type is it good to create blank or train it on existing model?
Which learning algorithm does spaCy use?
spaCy has its own deep learning library called thinc used under the hood for different NLP models. for most (if not all) tasks, spaCy uses a deep neural network based on CNN with a few tweaks. Specifically for Named Entity Recognition, spacy uses:
A transition based approach borrowed from shift-reduce parsers, which is described in the paper Neural Architectures for Named Entity Recognition by Lample et al.
Matthew Honnibal describes how spaCy uses this on a YouTube video.
A framework that's called "Embed. Encode. Attend. Predict" (Starting here on the video), slides here.
Embed: Words are embedded using a Bloom filter, which means that word hashes are kept as keys in the embedding dictionary, instead of the word itself. This maintains a more compact embeddings dictionary, with words potentially colliding and ending up with the same vector representations.
Encode: List of words is encoded into a sentence matrix, to take context into account. spaCy uses CNN for encoding.
Attend: Decide which parts are more informative given a query, and get problem specific representations.
Predict: spaCy uses a multi layer perceptron for inference.
Advantages of this framework, per Honnibal are:
Mostly equivalent to sequence tagging (another task spaCy offers models for)
Shares code with the parser
Easily excludes invalid sequences
Arbitrary features are easily defined
For a full overview, Matthew Honnibal describes how the model works in this YouTube video. Slides could be found here.
Note: This information is based on slides from 2017. The engine might have changed since then.
When adding a new entity type, should we create a blank model or train an existing one?
Theoretically, when fine-tuning a spaCy model with new entities, you have to make sure the model doesn't forget representations for previously learned entities. The best thing, if possible, is to train a model from scratch, but that might not be easy or possible due to lack of data or resources.
EDIT Feb 2021: spaCy version 3 now uses the Transformer architecture as its deep learning model.

Training an NER classifier to recognise Author names

I want to use NER(CRF classifier) to identify Author names in a query. I trained NER following the method given in nlp.stanford.edu site using the training file:training-data.col. And tested using the file:testing-data.tsv.
The NER is tagging every input as Author, even the data that is tagged as non-Author in the training data. Can anyone tell me why NER is tagging the non-Authors in training data as Authors and how to train NER to identify Authors(I have the list of Author names to train).
Any suggestions for reference material on NER other than nlp.stanford.edu site will be helpful.
That's a very small piece of training data, so I'm not surprised that it made the wrong inferences. Since the only example it has seen of "Atal" is as Author, it's tagging "Atal" as such.
But more so, if you want to discriminate between people listed at the beginning as Author and people listed in the text as 0, Stanford NER is not going to do that. Stanford NER is intended to make long distance inferences about the named-entity tags of tokens in natural language text. In other words, it's doing the opposite of what you're trying to do.
You could probably do this with some simple pattern recognition---if your documents are formatted in a similar way, with the authors together, I would start with exploiting that. You could use the NER to tag the authors as PERSON, and then use that tag as a feature in your own tagging.

Named entity recognition with NLTK or Stanford NER using custom corpus

I am trying to train a NER model in Indian with custom NE (named entity) dictionary for chunking. I refer to NLTK and Stanford NER repectively:
NLTK
I found the nltk.chunk.named_entity.NEChunkParser nechunkparser able to train on a custom corpus. However, the format of training corpus was not specified in the documentation or the comment of the source code.
Where could I find some guide to the custom corpus for NER in NLTK?
Stanford NER
According to the question, the FAQ of Stanford NER gives direction of how to train a custom NER model.
One of the major concern is that default Stanford NER does not support Indian. So is it viable to feed an Indian NER corpus to the model?
Your Training corpus needs to be in a .tsv file extension.
The file should some what look like this:
John PER
works O
at O
Intel ORG
This is just for representation of the data as i do not know which Indian language you are targeting. But Your data must always be Tab Separated values. First will be the token and the other value its associated label.
I have tried NER by building my custom data (in English though) and have built a model.
So I guess its pretty much possible for Indian languages also.

Is there a way to use french in Stanford CoreNLP sentiment analysis?

I am aware that only the English model is available for sentiment analysis but I found edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz in stanford-parser-3.5.2-models.jar. I'm actually looking at https://github.com/stanfordnlp/CoreNLP Is it possible to use this model instead of englishPCFG.sez.gz with CoreNLP and if so, how ?
CoreNLP does not include sentiment models for languages other than English. While we do ship French parser models, there is no available French sentiment model to use with the parser.
You may be able to find French sentiment analysis training data. There is plenty of information available about how to do this if you're interested; see e.g. this SO post.

Resources