I want to build a English acoustic model for children under 14 in China, with about 800 words in vocabulary, using cmusphinx.
I did some research that some commercial voice engine take thousands of hours of voice record to train their acoustic model: (nuance and google spent 2000+ and 1000+ hours).
For I need to achieve about 95% a accurate rate, How many hours do I need for the voice corpus ?
Is it the longer the voice corpus is, the better accurate rate it will achieve ?
300-400 hours is a good amount of data. Less than 100 will not work.
With the increase of the data size you will not necessary get an increase in accuracy if the training data itself has systematic issues, however, if you properly analyze issues in a training data, the result could potentially improve.
If you study machine learning in general, the course would cover data preparation issues.
Related
I have a question is there already any free dataset available to test doc2vec and if in case I wanted to create my own dataset what could be an appropriate way to do it.
Assuming you mean the 'Paragraph Vectors' algorithm, which is often called Doc2Vec, any textual dataset is a potential test/demo dataset.
The original papers by the creators of Doc2Vec showed results from applying it to:
movie reviews
search engine summary snippets
Wikipedia articles
scientific articles from Arxiv
People have also used it onβ¦
titles of articles/books
abstracts of larger articles
full news articles or scientific papers
tweets
blogposts or social media posts
resumes
When learning, it's best to pick very simple, common datasets when you're 1st starting, and then larger datasets that you somewhat understand or are related to your areas of interest β if you don't already have a sufficient project-related dataset.
Note that the algorithm, like others in the [something]2vec family of algorithms, works best with lots of varied training data β many tens of thousands of unique words each with many contrasting usage examples, over many tens of thousands (or many more) of documents.
If you crank the vector_size way down, & the training-epochs way up, you can eke some hints of its real performance out of smaller datasets of a few hundred contrasting documents. For example, in the Python Gensim library's Doc2Vec intro-tutorial & test-cases, a tiny set of 300 news-summaries (from about 20 years ago called the 'Lee Corpus') are used, and each text is only a few hundreds words long.
But the vector_size is reduced to 50 β much smaller than the hundreds-of-dimensions typical with larger training data, and perhaps still too many dimensions for such a small amount of data. And, the training epochs is increased to 40, much larger than the default of 5 or typical Doc2Vec choices in published papers of 10-20 epochs. And even with those changes, with such little data & textual variety, the effect of moving similar documents to similar vector coordinates will be appear weaker to human review, & be less consistent between runs, than a better dataset will usually show (albeit using many more minutes/hours of training time).
Does anyone have an estimate of the number of generations one should search before concluding that the NEAT-algorithm is not able to reach the minima?
I am running NEAT on a very small dataset of cancer patients (~5K rows). And after 5000 generations, the concordance index for prediction of survival index is not improving.
Does anyone have any experience of how many generations should one try before you deem this as not efficient for the given problem?
There are a couple other hyperparameters to consider before deciding NEAT cannot produce a usable neural network for your problem. You will have to make sure that your population is also large enough. Obviously a larger dataset is more helpful, but that is limited. Finally, changes such as mutation rates, aggregation options, activation functions, and your fitness function will all affect the training process for each genome. Feel free to PM if you want suggestions on them.
I'm looking to use some tweets about measles/ the mmr vaccine to see how sentiment about vaccination changes over time. I plan on creating the training set from the corpus of data I currently have (unless someone has a recommendation on where I can get similar data).
I would like to classify a tweet as either: Pro-vaccine, Anti-Vaccine, or Neither (these would be factual tweets about outbreaks).
So the question is: How big is big enough? I want to avoid problems of overfitting (so I'll do a test train split) but as I include more and more tweets, the number of features needing to be learned increases dramatically.
I was thinking 1000 tweets (333 of each). Any input is appreciated here, and if you could recommend some resources, that would be great too.
More is always better. 1000 tweets on a 3-way split seems quite ambitious, I would even consider 1000 per class for a 3-way split on tweets quite low. Label as many as you can within a feasible amount of time.
Also, it might be worth taking a cascaded approach (esp. with so little data), i.e. label a set vaccine vs non-vaccine, and within the vaccine subset you'd have a pro vs anti set.
In my experience trying to model a catch-all "neutral" class, that contains everything that is not explicitly "pro" or "anti" is quite difficult because there is so much noise. Especially with simpler models such as Naive Bayes, I have found the cascaded approach to be working quite well.
I just wanted to understand (from your experience), that if I have to create a sentiment analysis classification model (using NLTK), what would be a good training data size. For instance if my training data is going to contain tweets, and I intend to classify them as positive,negative and neutral, how many tweets each should I ideally have per category to get a reasonable model working?
I understand that there are many parameters like quality of data, but if one has to get started what might be a good number.
That's a really hard question to answer for people who are not familiar with the exact data, its labelling and the application you want to use it for. But as a ballpark estimate, I would say start with 1,000 examples of each and go from there.
Can anybody please explain multitask learning in simple and intuitive way? May be some real
world problem would be useful.Mostly, these days i am seeing many people are using it for natural language processing tasks.
Let's say you've built a sentiment classifier for a few different domains. Say, movies, music DVDs, and electronics. These are easy to build high quality classifiers for, because there is tons of training data that you've scraped from Amazon. Along with each classifier, you also build a similarity detector that will tell you for a given piece of text, how similar it was to the dataset each of the classifiers was trained on.
Now you want to find the sentiment of some text from an unknown domain or one in which there isn't such a great dataset to train on. Well, how about we take a similarity weighted combination of the classifications from the three high quality classifiers we already have. If we are trying to classify a dish washer review (there is no giant corpus of dish washer reviews, unfortunately), it's probably most similar to electronics, and so the electronics classifier will be given the most weight. On the other hand, if we are trying to classify a review of a TV show, probably the movies classifier will do the best job.