i am developing an ASR system for my local kashmiri language and had Already had done some work collected some data and trained a CNN model on it with not so good accuracy but now i want to change the Strategy i want to do this work on phoneme level can any buddy suggest me the way to do it please.
Thanks in Advance.
Related
I want to create a language translation model using transformers. However, Tensorflow seems to only have a BERT model for English https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4 . If I want a BERT for another language, what is the best way to go about accomplishing this? Should I create a new BERT or can I train Tensorflow's own BertTokenizer on another language?
The Hugging Face model hub contains a plethora of pre-trained monolingual and multilingual transformers (and relevant tokenizers) which can be fine-tuned for your downstream task.
However, if you are unable to locate a suitable model for you language, then yes training from scratch is the only option. Beware though that training from scratch can be a resource-intensive task that will require significant compute power. Here is an excellent blog post to get you started.
I would like to know if it is possible to reuse gpt-3 in a different language, Spanish in this case.
Do I need a gpt-3 model specifically trained with a Spanish corpus, or can I use transfer learning to produce Spanish text?
GPT-3 is only available via an API and only to people who apply for the access. The model is too big to run it locally on any reasonable hardware and fine-tuning is thus hardly an option.
Given how well GPT-3 works machine translation, my guess is that it will work reasonably well for Spanish by default. However, if your task is text classification, you can do a much better job when using a pre-trained BERT-like model, Hugginface's Transformers already have several models for Spanish.
I'm going to implement a translator based on NMT(Neural Machine Translation). In here I hope to use only monolingual corpora without using parallel corpus data for my dataset. Is it possible to train the model using only monolingual corpora data? I'm grateful if someone can share your idea regarding this.
Good day, I am a student that is interested in NLP. I have come across the demo on AllenNLP's homepage, which stated that:
The model is a simple LSTM using GloVe embeddings that is trained on the binary classification setting of the Stanford Sentiment Treebank. It achieves about 87% accuracy on the test set.
Is there any reference to the sample code or any tutorial that I can follow to replicate this result, so that I can learn more about this subject? I am trying to obtain a Regression Output (Instead of classification).
I hope that someone can point me in the right direction.. Any help is much appreciated. Thank you!
AllenAI provides all code for examples and lib opensource on Git, including AllenNLP.
I found exactly how the example was run here: https://github.com/allenai/allennlp/blob/master/allennlp/tests/data/dataset_readers/stanford_sentiment_tree_bank_test.py
However, to make it a Regression task, you'll have to tweak directly on Pytorch, which is the underlying technology for AllenNLP.
Please suggest a good machine learning classifier for truecasing of dataset.
Also, Is it possible to specify out own rules/features for truecasing in such a classifier? Thanks for all your suggestions.
Thanks
I implemented a version of a truecaser in Python. It can be trained for any language when you provide enough data (i.e. correctly cased sentences).
For English, it achieves an accuracy of 98.38% on sample sentences from Wikipedia. A pre-trained model for English is provided.
You can find it here:
https://github.com/nreimers/truecaser
Please take a look at this whitepaper.
http://www.cs.cmu.edu/~llita/papers/lita.truecasing-acl2003.pdf
They report 98% of accuracy.