Building Speech Recognition for a closed vocabulary - cmusphinx

I can create voice recognition for my limited set of words using the following link.
http://www.speech.cs.cmu.edu/tools/lmtool-new.html
But how do I give feedback to the language model to train better for my voice.
For example, the phonetic values in .dic files are for american accent (I want to train it to indian accent).

Language model has nothing to do with voice, it operates with words. Use SphinxTrain to tailor the acoustic model to the accent you need and read how to adapt existing model or create new one.

Related

Which are best for Name Entity Recognition for Gujarati Language Text?

I am finding out the best working models for Name Entity Recognition in Gujarati Text. I know only 1 of them that is Indic Bert model of hugging face. Can anyone suggest other model which documentation or code available for Name Entity Recognition in Gujarati Language??
I found only IndicBERT model of Hugging Face. I want know other mode or any link where the code is available for Name Entity Recognition.
The recent work Joshi [1] offers L3Cube-GujaratiBERT, available on HuggingFace here. You'll have to fine-tune the model on your specific down-stream task (i.e. Named Entity Recognition in Gujarati). There is a list of Indic NER datasets here, of relevance to your problem is the AI4Bharat Naamapadam dataset which has Gujarati as one of the 11 available Indic languages.
Additional Info
In [1], Joshi initially created the L3Cube-HindBERT and L3Cube-DevBERT models pre-trained on Hindi and Devanagari script (Hindi + Marathi) monolingual corpora, respectively. These offered a modest improvement in performance over the alternative MuRIL, IndicBERT and XLM-R multi-lingual offerings. Given the improvement, the author released other Indic language-based models, namely: Kannada, Telugu, Malayalam, Tamil, Gujarati, Assamese, Odia, Bengali, and Punjabi (all can be found at https://huggingface.co/l3cube-pune).
References
[1] Joshi, R., 2022. L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. arXiv preprint arXiv:2211.11418.

Named Entity Recognition Systems for German Texts

I am working on a Named Entity Recognition (NER) project in which I got a large amount of text in the sense that it is too much to read or skim read. Therefore, I want to create an overview of what is mentioned by extracting named entities (places, names, times, maybe topics) and create an index of kind (entity, list of pages/lines where it is mentioned). I have worked through Standford's NLP lecture, (parts of) Eisenstein's Introduction to NLP book found some literature and systems for English texts. As my corpus is in German, I would like to ask how I can approach this problem. Also, this is my first NLP project, so I would not know if I could solve this challenge even if texts were in English.
As a first step
are there German NER systems out there which I could use?
The further roadmap of my project is:
How can I avoid mapping misspellings or rare names to a NUL/UNK token? This is relevant because there are also some historic passages that use words no longer in use or that follow old orthography. I think the relevant terms are tokenisation or stemming.
I thought about fine-tuning or transfer learning the base NER model to a corpus of historic texts to improve NER.
A major challenge is that there is no annotated dataset for my corpus available and I could only manually annotate a tiny fraction of it. So I would be happy for hints on German annotated datasets which I could incorporate into my project.
Thank you in advance for your inputs and fruitful discussions.
Most good NLP toolkits can perform NER in German:
Stanford NLP
Spacy
probably NLTK and OpenNLP as well
What is crucial to understand is that using NER software like the above means using a pretrained model, i.e. a model which has been previously trained on some standard corpus with standard annotated entities.
Btw you can usually find the original annotated dataset by looking at the documentation. There's one NER corpus here.
This is convenient and might suit your goal, but sometimes it doesn't collect exactly every that you would like it to collect, especially if your corpus is from a very specific domain. If you need more specific NER, you must train your own model and this requires obtaining some annotated data (i.e. manually annotating or paying somebody to do it).
Even in this case, a NER model is statistical and it will unavoidably make some mistakes, don't expect perfect results.
About misspellings or rare names: a NER model doesn't care (or not too much) about the actual entity, because it's not primarily based on the words in the entity. It's based on indications in the surrounding text, for example in the sentence "It was announced by Mr XYZ that the event would take place in July", the NER model should find 'Mr XYZ' as a person due to "announced by" and 'July' as a date because of "take place in". However if the language used in the corpus is very different from the training data used for the model, the performance could be very bad.

How to train a custom model for speech to text Cognitive Services?

We build an Speech To Text Application. In this Conversation always in dutch language. But in some cases English and Dutch words are same. At that time how can i train my model.
There are different ways to do the task
Train the model with audio samples of the language Dutch (Belgium or Standard) with related transcript
Without any audio file give the text file of the language to train the model
By default settings can be applied like train and test sampling separation, check the sample count and divide the sets.
create a training file with few sentences (repeated content also acceptable). Train the model with that file. Based on the language priority, the file has to contain Dutch and English related words.
Use the following can help you to create a pronunciation file

speech to text training for impaired voice

I want to train and use an ML based personal voice to text converter for a highly impaired voice, for a small set of 300-400 words. This is to be used for people with voice impairment. But cannot be generic because each person will have a unique voice input for words, depending on their type of impairment.
Wanted to know if there are any ML engines which allow for such a training. If not, what is the best approach to go about it.
Thanks
Most of the speech recognition engines support training (wav2letter, deepspeech, espnet, kaldi, etc), you just need to feed in the data. The only issue is that you need a lot of data to train reliably (1000 of samples for each word). You can check Google Commands dataset for example of how to train from scratch.
Since the training dataset will be pretty small for your case and will consist of just a few samples, you can probably start with existing pretrained model and finetune it on your samples to get best accuracy. You need to look on "few short learning" setups.
You can probably look on wav2vec 2.0 pretrained model, it should be effective for such learning. You can find examples and commands for fine-tuning and inference here.
You can also try fine-tuning Japser models in Google Commands for NVIDIA NEMO. It might be a little less effective but could still work and should be easier to setup.
I highely recommend watching the youtube original series "The age of AI"'s First season, episode two.
Basically, google already done this for people who can't really form normal words with impared voice. It is very interesting and speaks a little bit about how they done and doing that with ML technologies.
enter link description here

Text processing tool for tweets

I am collecting millions of sports related tweets daily. I want to process the text in those tweets. I want to recognize the entities, find the sentiment of the sentence and find the events in those tweets.
Entity recognizing :
For example :
"Rooney will play for England in their next match".
From this tweet i want to recognize person entity "Rooney" and place entity "England"
sentiment analysis:
I want to find the sentiment of a sentence. For example
Chelsea played their worst game ever
Ronaldo scored a beautiful goal
The first one should marked as "negative" sentence and the later one should marked as "positive".
Event recognizing :
I want to find "goal scoring event" from tweets. Sentences like "messi scored goal in first half" and "that was a fantastic goal from gerrald" should marked as "goal scoring event".
I know entity recognizing and sentiment analysis tools are available and i need to write the rules for event recognizing. I have seen so many tools like Stanford NER, alchemy api, open calais, meaning cloud api, ling pipe, illinois etc..
I'm really confused about which tool I should select? Is there any free tools available without daily rate limits? I want to process millions of tweets daily and java is my preferable language.
Thanks.
For NER you can also use TwitIE which is a GATE pipeline so you can use it using the GATE API in Java.
Given the consideration that your preferred language is Java, I would strongly suggest to start with Stanford NLP project. Most of your basic needs like cleansing, chunking, NER can be done based on that. For NER click here.
Going ahead for sentiment analysis you can use simplistic classifiers like Naive Bayes and then add complexities. More here.
For the event extraction, you can use linguistic approach to identify the verbs with their association with ontology on your side.
Just remember, this is just to get you started and no way an extensive answer.
No API with unlimited call availalble. IF you want to stick with java, use stanford package with customization as per your need.
If you are comfortable with python, look at nltk.
Well, for person, organization stanford will work, for your input query :
Rooney will play for England in their next match
[Text=Rooney CharacterOffsetBegin=0 CharacterOffsetEnd=6 PartOfSpeech=NNP Lemma=Rooney NamedEntityTag=PERSON] [Text=will CharacterOffsetBegin=7 CharacterOffsetEnd=11 PartOfSpeech=MD Lemma=will NamedEntityTag=O] [Text=play CharacterOffsetBegin=12 CharacterOffsetEnd=16 PartOfSpeech=VB Lemma=play NamedEntityTag=O] [Text=for CharacterOffsetBegin=17 CharacterOffsetEnd=20 PartOfSpeech=IN Lemma=for NamedEntityTag=O] [Text=England CharacterOffsetBegin=21 CharacterOffsetEnd=28 PartOfSpeech=NNP Lemma=England NamedEntityTag=LOCATION] [Text=in CharacterOffsetBegin=29 CharacterOffsetEnd=31 PartOfSpeech=IN Lemma=in NamedEntityTag=O] [Text=their CharacterOffsetBegin=32 CharacterOffsetEnd=37 PartOfSpeech=PRP$ Lemma=they NamedEntityTag=O] [Text=next CharacterOffsetBegin=38 CharacterOffsetEnd=42 PartOfSpeech=JJ Lemma=next NamedEntityTag=O] [Text=match CharacterOffsetBegin=43 CharacterOffsetEnd=48 PartOfSpeech=NN Lemma=match NamedEntityTag=O]
If you want to add eventrecognization too, you need to retrain the stanford package with extrac class having event based dataset. Which can help you to classify event based input.
Does the NER use part-of-speech tags?
None of our current models use pos tags by default. This is largely
because the features used by the Stanford POS tagger are very similar
to those used in the NER system, so there is very little benefit to
using POS tags.
However, it certainly is possible to train new models which do use POS
tags. The training data would need to have an extra column with the
tag information, and you would then add tag=X to the map parameter.
check - http://nlp.stanford.edu/software/crf-faq.shtml
Stanford NER and OPENNLP are both open-source and have models that perform well on formal article/texts.
But their accuracy drops significantly over Twitter (from 90% recall over formal text to 40% recall over tweets)
The informal nature of tweets (bad capitalization, spellings, punctuations), improper usage of words, vernacularity and emoticons makes it more complicated
NER, sentiment analysis and event extraction over tweets is a well-researched area apparently for its applications.
Take a look at this: https://github.com/aritter/twitter_nlp, see this demo of twitter NLP and event extraction: http://ec2-54-170-89-29.eu-west-1.compute.amazonaws.com:8000/
Thank you

Resources