Algorithm to change speech pitch - audio

I'm looking for a way to heighten the pitch of recorded speech audio.
I'd like to change the pitch only at the end of the speech, to create a sort of "up speak".
What are the typical algorithms to do this?
Thanks.

PSOLA (Pitch Synchronous Overlap and Add) is a digital signal processing technique used for speech processing and more specifically speech synthesis. It can be used to modify the pitch and duration of a speech signal.
Example code is
https://github.com/joaocarvalhoopen/Pitch_Shifter_using_PSOLA_algorithm

Related

does converting from mulaw to linear impact audio quality?

I want to change audio encoding from mulaw to linear in order to use a linear speech recognition model from Google.
I'm using a telephony channel, so audio is encoded in mulaw, 8bits, 8000Hz.
When I use Google Mulaw model, there are some issue with recognizing some short single words -> basically they are not recognized at all -> API returns None
I was wondering if it is a good practise to change the encoding for Linear or Flac?
I already did it, but I cannot really measure the degree of this improvement.
It is always best practice to use either LINEAR16 for headerless audio data or FLAC for headered audio data. They both provide lossless codec. It is good practice to set the sampling rate to 16000 Hz otherwise you can set the sample_rate_hertz to match the native sample rate of the audio source (instead of re-sampling). Since Google Speech to Text API provides various ways to improve the audio quality, you can use World Level Confidence to measure the accuracy for response.
Ideally the audio would be recorded to start with using lossless codec like linear16 ot flac. But once you have it in format like mulaw transcoding it before sending to Google speech-to-text is not helpful.
Consider using model=phone_call and use_enhanced=true for better telephony quality.
For quick experimentation you can use STT UI https://cloud.google.com/speech-to-text/docs/ui-overview.

Speech rate detection in python

I need to detect the speech rate (speed of spoken words) an a audio file. Most of codes available including pyaudioanalysis etc provide sampling rate, silence detection, or even emotion detection.
The need is I want to know how fast speaker is speaking. Can anyone suggest some code or technique please.
I worked with speech to text but there are 2 main problems
Not all the words are correct that is produced by the engine.
There can be long pauses in between the text that doesn't help for the detection of speech rate.
I was working with PRAAT software, and there is an extension for this in python(https://github.com/YannickJadoul/Parselmouth). A detailed explanation of the procedure is given here
There is an option for detection of speech rate with the script(https://sites.google.com/site/speechrate/Home/praat-script-syllable-nuclei-v2). Using Parselmouth we can run the script. In case if you are ok with using PRAAT software here is a step by step analysis https://sites.google.com/site/speechrate/Home/tutorial.
The script returns no of syllables, no of pauses, duration, speech rate, articulation rate, ASD(speaking_time/no_of_syllables).
for reference paper-https://www.researchgate.net/publication/24274554_Praat_script_to_detect_syllable_nuclei_and_measure_speech_rate_automatically
check this https://github.com/Shahabks/myprosody, this could work even.
Hope this helps.

Is it possible to, as accurately as possible, decompose an audio into MIDI, given the SoundFont that was used?

If I know the SoundFont that a MIDI to audio track has used, can I theoretically reverse the audio back into it's (most likely) MIDI components? If so, what would be one of the best approach to doing this?
The end goal is to try encoding audio (even voice samples) into MIDI such that I can reproduce the original audio in MIDI format better than, say, BearFileConverter. Hopefully with better results than just bandpass filters or FFT.
And no, this is not for any lossy audio compression or sheet transcription, this is mostly for my curiosity.
For monophonic music only, with no background sound, and if your SoundFont synthesis engine and your record sample rates are exactly matched (synchronized to 1ppm or better, have no additional effects, also both using a known A440 reference frequency, known intonation, etc.), then you can try using a set of cross correlations of your recorded audio against a set of synthesized waveform samples at each MIDI pitch from your a-priori known font to create a time line of statistical likelihoods for each MIDI note. Find the local maxima across your pitch range, threshold, and peak pick to find the most likely MIDI note onset times.
Another possibility is sliding sound fingerprinting, but at an even higher computational cost.
This fails in real life due to imperfectly matched sample rates plus added noise, speaker and room acoustic effects, multi-path reverb, and etc. You might also get false positives for note waveforms that are very similar to their own overtones. Voice samples vary even more from any template.
Forget bandpass filters or looking for FFT magnitude peaks, as this works reliably only for close to pure sinewaves, which very few musical instruments or interesting fonts sound like (or are as boring as).

How Google Speech to Text works?

I would like to know, How google converts speech to text in their Speech Recognition API.
Have they stored almost all sounds and match them at particular frequency level or do they have some different audio encoder and decoder algorithm which analyses the voice for different sound pattern like "A", "The" , "B", "V", "D", "Hello" etc.,
It will also be great. if some one could share, How audio are encoded and how stored audio can be filtered with all different sounds, for an example :-
Music which has sound of playing guitar, drum and voice, I would like to filter them out in 3 output with guitar sound separately, drum sound separately, voice sound separately and further decoding voice to text.
Any documentation link or research paper for university would be great.
Thanks
Google speech recognizer is described here. To understand it you probably need to read a textbook Automatic Speech Recognition
A Deep Learning Approach first.
Separation of guitar and drums is usually implemented with Non-Negative Matrix Factorization.

How i recognize a unique sound in a noisy environment?

I am developing app to detect the inability of elderly people to unlock their rooms using IC cards in their daycare center.
This room doors has an electronic circuit that emits beep sounds d to signal the user failure in unlock the room. My goal is to detect this beep signal.
I have searched a lot and found some possibilities:
To clip the beep sound and use as a template signal and compare it with test signal (the complete human door interaction audio clip) using convolution, matched filters, DTW or what so ever to measure their similarity. What do u recommend and how to implement it.
To analyze the FFT of beep sound to see if it has a frequency band different that of the background noise. I do not understand how to do it exactly?
To check whether the beep sound form a peak at certain frequency spectrum that is absent in the background noise. If so, Implement a freclipped the beep sound and got the spectrogram as shown in the figure spectrogram of beep sound. but i cannot interpret it? could u give me a detailed explanation of the spectrogram.
3.What is your recommendation? If you have other efficient method for beep detection, please explain.
There is no need to calculate the full spectrum. If you know the frequency of the beep, you can just do a single point DFT and continuously check the level at that frequency. If you detect a rising and falling edge within a given interval it must be the beep sound.
You might want to have a look at the Goertzel Algorithm. It is an algorithm for continuous single point DFT calculation.

Resources