Can stanford ner tagger process hindi or nepali language? - nlp

I would like to train a ner model using stanford-ner.jar CRFClassifier for Nepali or Hindi language. Can I simply use the java command line mentioned in the here

Yes if you supply training data you can produce a new model. Note that when running the NER system, you will need to tokenize the text in the same way it was tokenized for the training process.
There is some more info about training NER models here: https://stanfordnlp.github.io/CoreNLP/ner.html

Related

Retrain the multi language NER model(ner_ontonotes_bert_mult) from DeepPavlov with a dataset in a different language

I have successfully installed the multi-language NER model from DeepPavlov(ner_ontonotes_bert_mult). I want to retrain this model with new data(in the same format as they suggest in the documentation page) that are in the Albanian language.Is this possible(to retrain the multi-language NER model from DeepPavlov with data in a different language), or the retrain works only if we have English data??
Yes, you can fine-tune the model on any language that was used for Multilingual BERT training https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages.
It is also possible to fine-tune on languages that are not from the list above if multilingual vocabulary has a good coverage for your language.

What is the difference between IBM NL Classifier and NLU custom model classification?

What is the difference between IBM NL Classifier and NLU custom model classification?
NL Classifier is trained on text (probably short text)
and when checking NLU custom model, it can also be trained on custom data for classification.
Anyone know what the differences are?
Thanks
You would typically use NLC to train up a corpus that is able to recognise intent.
NLU makes use of a lexicon based corpus to pull out information like language semantics and identify entities, keywords, concepts and so on.

Is there a "best" tokenization for NER training in OpenNLP?

Is there a "best" tokenization for NER training in OpenNLP? I noticed that OpenNLP provides a max-entropy tokenizer that allows you to tokenize based on a trained model. I also noticed that OpenNLP provides a simple tokenizer. If I use the same tokenizer during runtime that I used to train my model, does it matter which tokenizer I use?
I would rather use the simple tokenizer for my application.
For most applications the quality of your tokenizer isn't very important and as long as you use the same one in training and after you should be fine.
However, the only way to be sure is to try the different tokenizers and compare the results - for some applications the difference between a good tokenizer and a great one may matter.

How to train new labels in NLTK for name entity recognition

i am new to python I need to extract a job titles from text and I need to know how to train data for name entity recognition and where to train the data
To train a named-entity recognizer or other chunker with custom categories (including job titles), you need a corpus that is annotated with the categories you are interested in. Then you can read the nltk book, especially chapter 7.2 on chunking, which shows you how to train a chunker with the nltk.

Named entity recognition with NLTK or Stanford NER using custom corpus

I am trying to train a NER model in Indian with custom NE (named entity) dictionary for chunking. I refer to NLTK and Stanford NER repectively:
NLTK
I found the nltk.chunk.named_entity.NEChunkParser nechunkparser able to train on a custom corpus. However, the format of training corpus was not specified in the documentation or the comment of the source code.
Where could I find some guide to the custom corpus for NER in NLTK?
Stanford NER
According to the question, the FAQ of Stanford NER gives direction of how to train a custom NER model.
One of the major concern is that default Stanford NER does not support Indian. So is it viable to feed an Indian NER corpus to the model?
Your Training corpus needs to be in a .tsv file extension.
The file should some what look like this:
John PER
works O
at O
Intel ORG
This is just for representation of the data as i do not know which Indian language you are targeting. But Your data must always be Tab Separated values. First will be the token and the other value its associated label.
I have tried NER by building my custom data (in English though) and have built a model.
So I guess its pretty much possible for Indian languages also.

Resources