text to speech set rate for individual word

text to speech set rate for individual word - azure

I am aware that speech rate is controlled by prosody in ssml, however when I tried to set rate for individual word within a sentence that already has rate set, it either has no effect, sets the whole sentence at reduced rate or remainder sentence will play at reduced rate
tried both below
<prosody pitch="+18.00%" rate="1.05" volume="-15.00%">apple apple orange are you <prosody rate="-50.00%"><emphasis level="strong">special</emphasis></prosody> person<break time="80ms" /></prosody>
<prosody pitch="+18.00%" rate="1.05" volume="-15.00%">apple apple orange are you </prosody><prosody rate="0.50"><emphasis level="strong">special</emphasis></prosody><prosody rate="1.05"> person<break time="80ms" /></prosody>
a work around i can think of is break this into three speeches with 1st 3rd using same prosody and 2nd with reduced rate then stitch together
but I want to know if this can be accomplish within ssml?

Related

Custom entities extraction from texts

What is the right approach for multi-label text information extraction/classification
Having texts that describe a caregiver/patient visit : (made-up example)
Mr *** visits the clinic on 02/2/2018 complaining about pain in the
lower back for several days, No pathological findings in the x-ray or
in the blood tests. I suggest Mr *** 5 resting days.
Now, that text can be even in a paragraph size where the only information I care about will be lower back pain and resting days. I have 300-400 different labels but the number of labeled samples can be around 1000-1500 (total) . When I label the text I also mark the relevant words that create the "label" ,here it will be ['pain','lower','back'].
When I just use look-up for those words (or the other 300-400 labels) in other texts I manage to label a larger amount of texts but if the words are written in different patterns such as Ache in the lower back or "lowerback pain" and I've never added that pattern to the look-up table of "lower back pain" I won't find it.
Due to the fact that I can have large paragraph but the only information I need is just 3-4 words, DL/ML models do not manage to learn with that amount of data and a high number of labels.I am wondering if there is a way to use the lookup table as a feature in the training phase or to try other approaches

How to combine the results of multiple OCR tools to get better text recognition

Imagine, you have different OCR tools to read text from images but none of them gives you a 100% accurate output. Combined however, the result could come very close to the ground truth - What would be the best technique to "fuse" the text together to get good results?
Example:
Actual text
§ 5.1: The contractor is obliged to announce the delay by 01.01.2019 at the latest. The identification-number to be used is OZ-771LS.
OCR tool 1
5 5.1 The contractor is obliged to announce the delay by O1.O1.2019 at the latest. The identification-number to be used is OZ77lLS.
OCR tool 2
§5.1: The contract or is obliged to announce theedelay by 01.O1. 2O19 at the latest. The identification number to be used is O7-771LS
OCR tool 3
§ 5.1: The contractor is oblige to do announced he delay by 01.01.2019 at the latest. T he identification-number ti be used is OZ-771LS.
What could be a promising algorithm to fuse OCR 1, 2 and 3 to get the actual text?
My first idea was creating a "tumbling window" of an arbitrary length, compare the words in the window and take the words 2 out of 3 tools predict for every position.
For example with window size 3:
[5 5.1 The]
[§5.1: The contract]
[§ 5.1: The]
As you see, the algorithm doesn't work as all three tools have different candidates for position one (5, §5.1: and §).
Of course it would be possible to add some tricks like Levenshtein distance to allow some deviations but I fear this will not really be robust enough.

scipy.stats.normaltest not a good test for radio spectrum data?

I have a radio spectrum (converted from the time domain with FFT). For each bin (frequency), I have 100 samples, taken a couple of seconds apart. These samples are power (e.g. -47.5 dBm).
I am testing for normality using the technique seen here. Presumably channels (man-made radio signals) will be "less random" than the noise floor, which is (supposed to be) Gaussian noise.
When I run normaltest on each frequency array, it returns p < 0.055 the majority of the time (which according to the reference above, means "probably not normal"). This includes many, many frequencies that are part of the noise floor.
Why doesn't this test work well with my setup?

The Sound of Hydrogen using the NIST Spectral Database

In the video The Sound of Hydrogen (original here), the sound
is created using the NIST Atomic Spectra Database and then importing this edited data into Mathematica to modulate a Sine Wave. I was wondering how he turned the data from the website into the values shown in the video (3:47 - top of the page) because it is nothing like what is initially seen on the website.

Short answer: It's different because in the tutorial the sampling rate is 8 kHz while it's probably higher in the original video.
Long answer:
I wish you'd asked this on http://physics.stackexchange.com or http://math.stackexchange.com instead so I could use formulae... Use the bookmarklet
javascript:(function(){function%20a(a){var%20b=a.createElement('script'),c;b.src='https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML.js',b.type='text/javascript',c='MathJax.Hub.Config({tex2jax:{inlineMath:[[\'$\',\'$\']],displayMath:[[\'\\\\[\',\'\\\\]\']],processEscapes:true}});MathJax.Hub.Startup.onload();',window.opera?b.innerHTML=c:b.text=c,a.getElementsByTagName('head')[0].appendChild(b)}function%20b(b){b.MathJax===undefined?a(b.document):b.MathJax.Hub.Queue(new%20b.Array('Typeset',b.MathJax.Hub))}var%20c=document.getElementsByTagName('iframe'),d,e;b(window);for(d=0;d<c.length;d++)e=c[d].contentWindow||c[d].contentDocument,e.document||(e=e.parentNode),b(e)})()
to render the formulae with MathJax:
First of all, note how the Rydberg formula provides the resonance frequencies of hydrogen as $\nu_{nm} = c R \left(\frac1{n^2}-\frac1{m^2}\right)$ where $c$ is the speed of light and $R$ the Rydberg constant. The highest frequency is $\nu_{1\infty}\approx 3000$ THz while for $n,m\to\infty$ there is basically no lower limit, though if you restrict yourself to the Lyman series ($n=1$) and the Balmer series ($n=2$), the lower limit is $\nu_{23}\approx 400$ THz. These are electromagnetic frequencies corresponding to light (not entirely in the visual spectrum (ranging from 430–790 THz), there's some IR and lots of UV in there which you cannot see). "minutephysics" now simply considers these frequencies as sound frequencies that are remapped to the human hearing range (ca 20-20000 Hz).
But as the video stated, not all these frequencies resonate with the same strength, and the data at http://nist.gov/pml/data/asd.cfm also includes the amplitudes. For the frequency $\nu_{nm}$ let's call the intensity $I_{nm}$ (intensity is amplitude squared, I wonder if the video treated that correctly). Then your signal is simply
$f(t) = \sum\limits_{n=1}^N \sum\limits_{m=n+1}^M I_{nm}\sin(\alpha(\nu_{nm})t+\phi_{nm})$
where $\alpha$ denotes the frequency rescaling (probably something linear like $\alpha(\nu) = (20 + (\nu-400\cdot10^{12})\cdot\frac{20000-20}{(3000-400)\cdot 10^{12}})$ Hz) and the optional phase $\phi_{nm}$ is probably equal to zero.
Why does it sound slightly different? Probably the actual video did use a higher sampling rate than the 8 kHz used in the tutorial video.

Pitch recognition of musical notes on a smart phone, pt. 2

As a follow-up to my previous question, if I want my smartphone application to detect a certain musical note, and I only need to know whether the incoming sound is that musical note or not, with a certain amount of fuzziness, to allow the note to be off-key by x cents.
Given that, is there a superior method over others for speed and accuracy? That is, by knowing that the note you are looking for is, say, a #C3, how best to tell if that note is present or not? I'm assuming that looking for a single note would be easier than separating out all waveforms, and then looking at the results for the fundamental frequency.
In the responses to my original question, one respondent suggested that autocorrelation might work well if you know that the notes are within a certain range. I wonder if autocorrelation would then work even better, if you only have to check for the presence or absence of a certain note (+/- x cents).
Those methods being:
Kiss FFT
FFTW
Discrete Wavelet Transform
autocorrelation
zero crossing analysis
octave-spaced filters
DWT
Any thoughts would be appreciated.

As you describe it, you just need to determine if a particular pitch is present. A very simple (fast) detector would just record the equivalent of one period of the waveform, then record another period and correlate them, like an oversimplified (single-lag) autocorrelation. If there's a high match, you know the waveform being recorded is repeating at around the same period, or a harmonic of it.
For instance, to detect 1 kHz, record 1 ms of audio (48 samples at 48 kHz), then record another 1 ms, and compare them (correlate = multiply all samples and sum). If they line up (correlation above some threshold), then you're listening to 1 kHz, 2 kHz, 3 kHz, or some other multiple. Doing several periods would give you more confidence on the match.
A true autocorrelation would tell you which harmonic, specifically, if that's important to you.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

text to speech set rate for individual word - azure

Related

Custom entities extraction from texts

How to combine the results of multiple OCR tools to get better text recognition

scipy.stats.normaltest not a good test for radio spectrum data?

The Sound of Hydrogen using the NIST Spectral Database

Pitch recognition of musical notes on a smart phone, pt. 2

Categories

Resources