Acoustic Model number of hours - cmusphinx

I want to create a model to recognize Arabic Letters. I know how to create the language model and the dictionary files, but I am stuck at the acoustic model. I record wav files for each letter, but during the training, it says that the training hours are too small although the training continues. When I try to use it, the model doesn't recognize anything(giving null).
I want to know how should I record the wav files, should I keep repeating, for example, the letter Alif like 100 times in one wav file or should I just record multiple wav files of the same letter.
Your help is highly appreciated.

I want to know how should I record the wav files, should I keep repeating, for example, the letter Alif like 100 times in one wav file or should I just record multiple wav files of the same letter.
It's better to have multiple files with continuos words, not with letters. Letters are hard to recognize.
When I try to use it, the model doesn't recognize anything(giving null).
There might be different issues here (wrong audio format, etc). You can share your database on CMUSphinx forums with dropbox to get help on this issue.

Related

How to parse screenshots in Power Automate

I have a bit of a complicated process in Power Automate. I'm trying to parse user uploaded screenshots and categorizing them into different variables. At first, it seemed that an obvious choice would be to build and train the AI Model but the only issue is that the data in the screenshots can vary (i.e. some images will contain more rows, some won't contain the relevant data, and the data can be located in different regions of the screenshot).
Some example of images, which a user can upload, are as follows: (i) Samsung 1 Metrics, (ii) Samsung 2 Metrics (iii) iPhone metrics
My attempt was to perform OCR on the uploaded screenshot and then do string parsing. Therefore, I tried attempting the following flow: Flow Diagram and specifically the substring parsing as:
Substring parsing
Basically, I'm performing OCR on the screenshot and then searching for a substring which corresponds to the values that I'm interested in. I'm unsure if this is the best way to do this as it isn't dynamic (i.e. I have to offset the substring index by a certain amount of characters). Any advice is greatly appreciated.
I believe you should be able to train a custom Form Processing model to extract the information you need. You can two different collections in your training dataset to have the model be able to recognize both Samsung and iPhone layouts.
All you'll need is 5 samples for each collection and you should be good to go.

Text/Image processing in Python

Brief introduction:
I'm trying to get certain texts from an image of a lot of texts.
By just thinking, there should be at least two ways to handle this problem:
One way is first segmenting the images by text areas — for example, train the neural network with a bunch of sample images that contain the sample texts, and then let the trained model locate corresponding text areas in the real image, then crop that area out from the image, save it — and secondly use, for instance, pytesseract to convert image to string.
The other way is to reverse the processes. First convert the image into strings, then train the neural network with sample real texts, then let the trained model find corresponding texts in texts converted from images.
So, my questions are listed below:
Can this problem be solved without training a neural network? Will it be more efficient than NN, in terms of time taken to run the program and accuracy of results?
Among the two methods above I wrote, which one is better, in terms of time taken to run the program and accuracy of results?
Any other experienced suggestions?
Additional background information if needed:
So, I have a number of groups of screenshots of different web pages, each of which has a lot of texts on it. And I want to extract certain paragraphs from that large volume of texts. The paragraphs I want to extract express similar things but under different contexts.
For example, on a large mixed online forum platform, many comments are made on different things, some on landscapes of mountains, some politics, some sciences, etc... As that platform cannot only have one page, there must be hundreds of pages where countless of users make their comments. Now I want to extract the comments on politics specifically from the entire forum, i.e. from all the pages that platform has. So I would use Python + Selenium to scrape the pages and save the screenshots. Now we need to go back to the questions asked above. What to do now?
Update:
Just a thought went by. Probably a NN trained by images that contain texts cannot give a very accurate location of wanted texts, as the NN might be only looking for arrangements of pixels instead of the words, or even meaning, that compose the sentences or paragraphs. So maybe the second method, text processing, may be better in this case? (like NLP?)
So, you decided not to parse text, but save it as an image and then detect text from this image.
Text -> Image -> Text
It is worst scenario for parsing webpages.
While dealing with OCR you must expect many problems, such as:
High CPU consumption;
Different fonts;
Hidden elements (like 'See full text');
And the main one - you can't OCR with 100% accuracy.
If you try to create common parser, that should crawl only required text from any page given without any "garbage" - it is almost utopic idea.
As far as i know, something about this is 'HTML Readability' technology (Browsers like Safari and Firefox uses it). But how it will work with forums i can't say. Forums is very special format of pages.

Comparing voice input with existing audio sources

I'm currently working on creating a recipe for a script that would compare audio input with existing audio sources and return a match is any.
The idea is that the voice input would not be convertible to text. Those would be vocals such as dog ("woof") or cat ("meow") sound inputs.
In the end, I would like the script to conclude whether the input was a cat or dog sound, or none of the two.
I understand that It would require to pre process the sound input (low-pass; noise reduction etc), then do a spectrum analysis of the sound before comparing this to the existing spectrum analysis from the DB but I don't know where to start.
Are there any libraries for this kind of small project that could help?
How do I compare spectrum analysis?
How does spectrum analysis comparison take into account the possibility that two different people could make the same meow sound? Does it take into account a match up to a specific pourcentage?
Thanks for any guidance regarding this matter.

Which features can I try to extract out of mp3 files to classify them?

I am planning to build a music genre classifier working with mp3 files, and I wanna test and see which features work best for this. I have seen a paper that used MFCC(Mel Frequency Cepstral Coefficients) for this, but as a beginner in Machine Learning, this method felt complicated. I also saw some that converted mp3 files into spectograms and analysed those, but with no success. What I am looking for is a few easy-to-extract features to classify mp3 files. Do any other methods exist save for the two I just listed?
There are some papers on this, you can easily google them up.
But the simplest features would be the beat speed, the proportions of high/low frequencies etc.
All of this can be extracted using FFT (Fast Fourier Transform). But I am afraid this may not be so easy if you haven't done it before...

Search different audio files for equal short samples

Consider multiple (at least two) different audio-files, like several different mixes or remixes. Naively I would say, it must be possible to detect samples, especially the vocals, that are almost equal in two or more of the files, of course only then, if the vocal samples aren't modified, stretched, pitched, reverbed too much etc.
So with what kind of algorithm or technique this could be done? Let's say, the user would try to set time markers in all files best possible, which describe the data windows to compare, containing the presumably equal sounds, vocals etc.
I know that no direct approach, trying to directly compare wav data in any way is useful. But even if I have the frequency domain data (e.g. from FFT), I would have to use a comparison algorithm that kind of shifts the comparing-windows through time scale, since I cannot assume the samples, I want to find, are time sync over all files.
Thanks in advance for any suggestions.
Hi this is possible !!
You can use one technique called LSH (locality sensitive hashing), is very robust.
another way to do this is try make spectrogram analysis in your audio files ...
Construct database song
1. Record your Full Song
2. Transform the sound to spectrum
3. slice your Spectrogram in chunk and get three or four high Frequencies
4. Store all the points
Match the song
1. Record one short sample.
2. Transform the sound into another spectrum
3. slice your Spectrogram in chunk and get three or four hight Frequencies
4. Compare the collected frequencies with your database song.
5. your match is the song with have the high hit !
you can see here how make ..
http://translate.google.com/translate?hl=EN&sl=pt&u=http://ederwander.wordpress.com/2011/05/09/audio-fingerprint-em-python/
ederwander

Resources