How to train a custom model for speech to text Cognitive Services? - speech-to-text

We build an Speech To Text Application. In this Conversation always in dutch language. But in some cases English and Dutch words are same. At that time how can i train my model.

There are different ways to do the task
Train the model with audio samples of the language Dutch (Belgium or Standard) with related transcript
Without any audio file give the text file of the language to train the model
By default settings can be applied like train and test sampling separation, check the sample count and divide the sets.
create a training file with few sentences (repeated content also acceptable). Train the model with that file. Based on the language priority, the file has to contain Dutch and English related words.
Use the following can help you to create a pronunciation file

Related

Named Entity Recognition Systems for German Texts

I am working on a Named Entity Recognition (NER) project in which I got a large amount of text in the sense that it is too much to read or skim read. Therefore, I want to create an overview of what is mentioned by extracting named entities (places, names, times, maybe topics) and create an index of kind (entity, list of pages/lines where it is mentioned). I have worked through Standford's NLP lecture, (parts of) Eisenstein's Introduction to NLP book found some literature and systems for English texts. As my corpus is in German, I would like to ask how I can approach this problem. Also, this is my first NLP project, so I would not know if I could solve this challenge even if texts were in English.
As a first step
are there German NER systems out there which I could use?
The further roadmap of my project is:
How can I avoid mapping misspellings or rare names to a NUL/UNK token? This is relevant because there are also some historic passages that use words no longer in use or that follow old orthography. I think the relevant terms are tokenisation or stemming.
I thought about fine-tuning or transfer learning the base NER model to a corpus of historic texts to improve NER.
A major challenge is that there is no annotated dataset for my corpus available and I could only manually annotate a tiny fraction of it. So I would be happy for hints on German annotated datasets which I could incorporate into my project.
Thank you in advance for your inputs and fruitful discussions.
Most good NLP toolkits can perform NER in German:
Stanford NLP
Spacy
probably NLTK and OpenNLP as well
What is crucial to understand is that using NER software like the above means using a pretrained model, i.e. a model which has been previously trained on some standard corpus with standard annotated entities.
Btw you can usually find the original annotated dataset by looking at the documentation. There's one NER corpus here.
This is convenient and might suit your goal, but sometimes it doesn't collect exactly every that you would like it to collect, especially if your corpus is from a very specific domain. If you need more specific NER, you must train your own model and this requires obtaining some annotated data (i.e. manually annotating or paying somebody to do it).
Even in this case, a NER model is statistical and it will unavoidably make some mistakes, don't expect perfect results.
About misspellings or rare names: a NER model doesn't care (or not too much) about the actual entity, because it's not primarily based on the words in the entity. It's based on indications in the surrounding text, for example in the sentence "It was announced by Mr XYZ that the event would take place in July", the NER model should find 'Mr XYZ' as a person due to "announced by" and 'July' as a date because of "take place in". However if the language used in the corpus is very different from the training data used for the model, the performance could be very bad.

Train MS Custom Speech model to recognize dashed ids

I want to enable my Microsoft Custom Speech model to recognize designators containing numbers, chars and dashes, something like this: 12-34 A-56 B78.
The speech model recognizes numbers and characters correctly. Is there a way to train it so it would output the string 12-34 A-56 B78 when i say "twelve thirtyfour a fiftysix b seventyeight"? I need this for a german speech model.
I've already tried to train a model with 10000 randomly generated strings like the one above. I then trained the model using related text.
Thanks in advance
These are very specific format requirements. Unfortunately, it is currently not possible to get results exactly like this from the speech service. I suggest to do some post-processing on the results to format them this way.

text file data convert as data-set for natural language processing

I have to build model for natural language processing. For that i have data in text file and I want to convert that data into data-set for NLP model which predict bad and good in it.
I taken sample of emails data with contains some information and it is in text file.
I want to know how can i use that text file as data-set to my nlp model. please help me out

Named entity recognition with NLTK or Stanford NER using custom corpus

I am trying to train a NER model in Indian with custom NE (named entity) dictionary for chunking. I refer to NLTK and Stanford NER repectively:
NLTK
I found the nltk.chunk.named_entity.NEChunkParser nechunkparser able to train on a custom corpus. However, the format of training corpus was not specified in the documentation or the comment of the source code.
Where could I find some guide to the custom corpus for NER in NLTK?
Stanford NER
According to the question, the FAQ of Stanford NER gives direction of how to train a custom NER model.
One of the major concern is that default Stanford NER does not support Indian. So is it viable to feed an Indian NER corpus to the model?
Your Training corpus needs to be in a .tsv file extension.
The file should some what look like this:
John PER
works O
at O
Intel ORG
This is just for representation of the data as i do not know which Indian language you are targeting. But Your data must always be Tab Separated values. First will be the token and the other value its associated label.
I have tried NER by building my custom data (in English though) and have built a model.
So I guess its pretty much possible for Indian languages also.

Building Speech Recognition for a closed vocabulary

I can create voice recognition for my limited set of words using the following link.
http://www.speech.cs.cmu.edu/tools/lmtool-new.html
But how do I give feedback to the language model to train better for my voice.
For example, the phonetic values in .dic files are for american accent (I want to train it to indian accent).
Language model has nothing to do with voice, it operates with words. Use SphinxTrain to tailor the acoustic model to the accent you need and read how to adapt existing model or create new one.

Resources