Pause Interval for Final Result IBM Watson Speech to Text - speech-to-text

I was wondering if it is possible to tweak the pause interval the engine uses to determine final results -- in effect trading off more final results with less contextual accuracy or less final results with more contextual accuracy. Thank you for any help in advance!

Why was the above question voted down? It's a reasonable thing to ask.
The answer is no, it is currently not possible. The final result now comes after 0.5 of non-speech audio is detected following some speech.

Related

Speech rate detection in python

I need to detect the speech rate (speed of spoken words) an a audio file. Most of codes available including pyaudioanalysis etc provide sampling rate, silence detection, or even emotion detection.
The need is I want to know how fast speaker is speaking. Can anyone suggest some code or technique please.
I worked with speech to text but there are 2 main problems
Not all the words are correct that is produced by the engine.
There can be long pauses in between the text that doesn't help for the detection of speech rate.
I was working with PRAAT software, and there is an extension for this in python(https://github.com/YannickJadoul/Parselmouth). A detailed explanation of the procedure is given here
There is an option for detection of speech rate with the script(https://sites.google.com/site/speechrate/Home/praat-script-syllable-nuclei-v2). Using Parselmouth we can run the script. In case if you are ok with using PRAAT software here is a step by step analysis https://sites.google.com/site/speechrate/Home/tutorial.
The script returns no of syllables, no of pauses, duration, speech rate, articulation rate, ASD(speaking_time/no_of_syllables).
for reference paper-https://www.researchgate.net/publication/24274554_Praat_script_to_detect_syllable_nuclei_and_measure_speech_rate_automatically
check this https://github.com/Shahabks/myprosody, this could work even.
Hope this helps.

Finding the frequency per second from an audio file

I am currently making a game, similar to Audiosurf. I am trying to find the frequency of an audio file(like .mpg3 or .wav) at every second. Based on the value I will build the level. I have been doing a lot of research on this topic. I have a way to get the samples within the audio file, i am using the unity engine to make this game. I am thinking about breaking the samples into samples per second(using the transfer rate), then do an FFT on each of those and then find the highest frequency within each. Am I on the right path? Can anyone ofter any suggestions or if I am not on the right path, can anyone one correct me? Any help would be appreciated.
You are on the right path with the FFT part and splitting your samples into bins. Here is a library for that: http://www.fftw.org/
Where it gets hairy is with picking your frequency, let me tell you off the bat just throw away the highest frequency in the spectrum, it's part of the static. Maybe you could use the lowest frequency to catch the bassline, but likely the bass drums and even atmospheric sound effects will interfere there.
Now provided you do find some heuristic that allows you to pick the "frequency" at a given moment in the song, this most likely doesn't have correlation to the music itself. You are really better off re-working your idea to use frequency spectrum at each moment, not just a single frequency.
EDIT: The fourier transform will provide you with an array of complex numbers, each represents the amplitude as the real component and phase as the imaginary component for its bin.

Changing duration of Guitar pluck in PCM data

Folks,
I am struggling with a simple concept related to the duration of play of PCM data. I would appreciate your feedback.
The application I am developing plays guitar notes from a music sheet.
I have implemented Jaffe-Smith Algorithm for guitar plucking.
https://ccrma.stanford.edu/~jos/Mohonk05/Extended_Karplus_Strong_EKS_Algorithm.html.
Let's say I compute samples for note A (440 Hz) for one second.
At the sample rate of 11025, I will be storing 11025 samples that can be send to the computer speakers as PCM audio.
For all the unique notes on the guitar, it takes quite some time to compute samples for all the notes. I am thinking I will pre-compute and save them as binary data and simply load them when the application is run.
So far so good.
Now, let's say I want to play a song (a list of various notes). Let's say the song needs to be played at 100 beats per minute. Let's say I have to play note A for one beat or 0.6 seconds (60/100).
Recalculating samples for 0.6 seconds may take quite some time.
Can I simply play (11025 * 0.6) samples? Will this create any side effect?
Is there a better way to achieve what I am trying to do?
Thank you in advance for your help.
Regards,
Peter
What you're basically trying to do is create a synthesized guitar, yes? I might suggest that you go with the sampler route instead.
By sample, I mean a small clip of audio (not a single sample in the sense of ADC or DAC).
Basically, you can flatten what you need into 4 parts:
Attack
Decay
Sustain
Release
These four parts work in that order, and are generally referred to as an ADSR envelope. The attack of the note is the initial sound. For a guitar, you are going to hear a pluck and the start of a pitch. The decay is going to be the sample of the string as it starts to fade away. The sustain is a sample repeated over and over again until you release the key. The release sample is what is played when you release the key. For a guitar, you might hear a sample of lightly putting fingers back on the string to stop their vibration.
Now, you could generate all of these samples in real-time, but will likely be very CPU intensive.
Regarding your question: "Can I simply play (11025 * 0.6) samples?" Yes, at a sample rate of 11025, that will be 0.6 seconds of audio. Also keep in mind though that you should be sending a continuous stream of data to the sound card, filling any empty spots with 0 (for signed PCM).

Twitter Subjectivity Training Sets

I need a reliable and accurate method to filter tweets as subjective or objective. In other words I need to build a filter in something like Weka using a training set.
Are there any training sets available which could be used as a subjective/objective classifier for Twitter messages or other domains which may be transferable?
For research and non-profit purposes, SentiWordNet gives you exactly what you want. A commercial license is available too.
SentiWordNet : http://sentiwordnet.isti.cnr.it/
Sample Jave Code: http://sentiwordnet.isti.cnr.it/code/SWN3.java
Related Paper: http://nmis.isti.cnr.it/sebastiani/Publications/LREC10.pdf
The other approach I would try:
Example
Tweet 1: #xyz u should see the dark knight. Its awesme.
1) First a dictionary lookup for the for meanings.
"u" and "awesme" will not return anything.
2) Then go against the known abbreviations/shorthands and substitute matches with the expansions
(Some resources: netlingo http://www.netlingo.com/acronyms.php or smsdictionary http://www.smsdictionary.co.uk/abbreviations)
Now the original tweet will look like:
Tweet 1: #xyz you should see the dark knight. Its awesme.
3) Then feed the remaining words in the spell checker and substitute with the best match (not always ideal and error prone for small words)
Related Link:
Looking for Java spell checker library
Now the original tweet will look like:
Tweet 1: #xyz you should see the dark knight. Its awesome.
4) Split and feed the tweet into SWN3, aggregate the result
The problem with this approach is that
a) Negations should be handled outside SWN3.
b) Information in emoticons and exaggerated punctuations will be lost or they need to be handled separately.
There is sentiment training data at CMU somewhere. I can't remember the link. CMU has done a lot on twitter and sentiment analysis:
From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series
Carnegie Mellon Study of Twitter Sentiments Yields Results Similar to Public Opinion Polls
I wrote an english vs. not english Naive Bayes classifier for twitter and made a ~example dev/test set and it was 98% accurate. I think that sort of thing is always pretty good if you are just trying to understand the problem, but a package like SentiWordNet might give you a head start.
The problem is defining what makes a tweet subjective or objective! It's important to understand that machine learning is less about the algorithm and more about the quality of the data.
You mention 75% accuracy is all you need.... what about recall? If you provide the right training data you might be able to get that, at the cost of lower recall.
The DynamicLMClassifier in LingPipe works pretty good.
http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html

Questions about Filters for Note Onset Detection?

Forgive me if I may come as ignorant but I would like to ask some questions regarding using Filter Algorithms for Note Onset Detection.
Is 'Detection Function' the same as using Filters on the audio signal? Or generally, what is the difference between Detection Function, Filtering (pre-processing the signal), and Peak-Picking?
I've constantly heard about the Low-Pass (or High-Pass) filter, but I am confused. I read that it works on cancelling out certain frequencies that are below (or above) a certain threshold. However, I am using the Time-Domain for calculating Note Onsets (that is, using the change in signal amplitude/energy). So I am not sure on how I can apply low-pass filtering to the time-domain. Any other good filters for note-onset detection?
What is the difference between, Spectral and Phase energy? (I have an idea that spectral refers to the spectogram or frequencies, but I do not know what Phase is)
I am having difficulties with working with dynamic thresholding. Any suggestions for a good algorithm? For example, I have the following signal:
As shown in the image above, there are note onsets that I have missed. A brief description of my algorithm, I calculate and take note of the energy/amplitude changes that occur in the audio signal. Then I get the maximum 'energy change' and based on the sensitivity, I take a percentage of it and set it as the threshold. So this is where the problem of dealing with varying degrees of amplitude/energy comes in. If I set the sensitivity too low, I come up with 'ghost' onsets and if I set the sensitivity too high, I miss out on some onsets. Any suggestions to improve the algorithm (or suggest a new algorithm) that I am using?
I am sure that it is difficult to have 100% accuracy but I need to have a better algorithm for note onset detection compared with what I have now. I would appreciate all the help I can get. Thank you very much!
One way is to detect sudden increases in the amplitude envelope. One way of calculating the amplitude envelope is to rectify the input signal (i.e. take the absolute value) and then low pass filter it. Check out the code examples in http://www.musicdsp.org for time domain filter examples and envelope followers.

Resources