Azure TTS neural voice audio file is created abnormally in 1 byte size - azure

Azure TTS standard voice audio files are generated normally. However, for neural voice, the audio file is generated abnormally with the size of 1 byte. The code is below.
C# code
public static async Task SynthesizeAudioAsync()
{
var config = SpeechConfig.FromSubscription("xxxxxxxxxKey", "xxxxxxxRegion");
using var synthesizer = new SpeechSynthesizer(config, null);
var ssml = File.ReadAllText("C:/ssml.xml");
var result = await synthesizer.SpeakSsmlAsync(ssml);
​ using var stream = AudioDataStream.FromResult(result);
await stream.SaveToWaveFileAsync("C:/file.wav");
}
ssml.xml - The file below, set to standard voice, works fine.
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-GB-George-Apollo">
When you're on the motorway, it's a good idea to use a sat-nav.
</voice>
</speak>
ssml.xml - However, the following file set for neural voice does not work, and an empty sound source file is created.
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AriaNeural">
When you're on the motorway, it's a good idea to use a sat-nav.
</voice>
</speak>

Looking at the behavior you have described due to some issues, the Speech service has returned no audio bytes.
I have checked the SSML file at my end it works completely fine i.e. there is no issues with the SSML.
As a next step to the solution, I would recommend you to add error handling code to give better picture of the error and take the action accordingly :
var config = SpeechConfig.FromSubscription("xxxxxxxxxKey", "xxxxxxxRegion");
using var synthesizer = new SpeechSynthesizer(config, null);
var ssml = File.ReadAllText("C:/ssml.xml");
var result = await synthesizer.SpeakSsmlAsync(ssml);
if (result.Reason == ResultReason.Canceled)
{
var cancellation = SpeechSynthesisCancellationDetails.FromResult(result);
Console.WriteLine($"CANCELED: Reason={cancellation.Reason}");
if (result.Reason == ResultReason.SynthesizingAudioCompleted)
{
Console.WriteLine ("No error ")
using var stream = AudioDataStream.FromResult(result);
await stream.SaveToWaveFileAsync("C:/file.wav");
}
else if (cancellation.Reason == CancellationReason.Error)
{
{
Console.WriteLine($"CANCELED: ErrorCode={cancellation.ErrorCode}");
Console.WriteLine($"CANCELED: ErrorDetails=[{cancellation.ErrorDetails}]");
}
}
The above piece of modification will provide friendly error message on the console app.
Note : If you are not using the console app, you will have modify the code.
Sample output :
This is just a sample output. the error you might see would be different.

Related

Azure Text to Speech (Cognitive Services) in web app - how to stop it from outputting audio?

I'm using Azure Cognitive Services for Text to Speech in a web app.
I return the bytes to the browser and it works great, however on the server (or local machine) the speechSynthesizer.SpeakTextAsync(inp) line outputs the audio to the speaker.
Is there a way to turn this off, since this runs on a web server (and even if I ignore it, there's the delay while it outputs audio before sending back the data)
Here's my code ...
var speechConfig = SpeechConfig.FromSubscription(speechKey, speechRegion);
speechConfig.SpeechSynthesisVoiceName = "fa-IR-FaridNeural";
speechConfig.OutputFormat = OutputFormat.Detailed;
using (var speechSynthesizer = new SpeechSynthesizer(speechConfig))
{
// todo - how to disable it saying it here?
var speechSynthesisResult = await speechSynthesizer.SpeakTextAsync(inp);
return Convert.ToBase64String(speechSynthesisResult.AudioData);
}
What you can do is add an audioconfig to the speechSynthesizer.
In this audioconfig object you can specify a file path to a .wav file which already exist on the server.
Whenever you run speaktextasyn instead of a speaker it will redirect the data to the .wav file.
This audio file you can read and perform your logic later.
Just add the following code before creating the speechSynthesizer object.
var audioconfig = AudioConfig.FromWavFileOutput(filepath);
here filepath is a location of the .wav file as a string.
Complete code :
string filepath = "<file path> " ;
var speechConfig = SpeechConfig.FromSubscription(speechKey, speechRegion);
var audioconfig = AudioConfig.FromWavFileOutput(filepath);
speechConfig.SpeechSynthesisVoiceName = "fa-IR-FaridNeural";
speechConfig.OutputFormat = OutputFormat.Detailed;
using (var speechSynthesizer = new SpeechSynthesizer(speechConfig, audioconfig))
{
// todo - how to disable it saying it here?
var speechSynthesisResult = await speechSynthesizer.SpeakTextAsync(inp);
return Convert.ToBase64String(speechSynthesisResult.AudioData);
}

How can I handle embed text tracks on chromecast

I'm trying to play a live stream which is .m3u8 file and using embed text tracks in it. (I mean the text track is wrapped with video segment in the same .ts file) It works well on videoElement since I could access textTrack with videoElement.textTracks. However, When I cast it to chromecast, the text tracks is not show up.
Here is code snippet:
Sender:
const englishSubtitle = new chrome.cast.media.Track('1', // track ID
window.chrome.cast.media.TrackType.TEXT);
englishSubtitle.subtype = chrome.cast.media.TextTrackType.CAPTIONS;
englishSubtitle.name = 'English';
englishSubtitle.language = 'en-US';
englishSubtitle.customData = null;
englishSubtitle.trackContentId = '1';
englishSubtitle.trackContentType = 'text/vtt';
mediaInfo.tracks = [englishSubtitle];
}
const loadRequest = new chrome.cast.media.LoadRequest(mediaInfo);
Receiver:
const textTracksManager = this.playerManager.getTextTracksManager();
const tracks = textTracksManager.getTracks();
textTracksManager.setActiveByIds([tracks[0].trackId]);
BTW: This configuration works well when casting an VOD content. But we use out band text tracks in VOD. (I mean we use an separate vtt file URL)

Posting MP4 Videos to Instagram always returns error code 2207026

I posted an MP4 Video to instagram using the Facebook Graph API for Content Publishing with the below code, but for most of the time I'm getting back the error occurred 2207026 while checking the status of the video.
The code is below:
let mediaContainerUrl = `https://graph.facebook.com/${igUserId}/media`;
let videoURL = `URL Of MP4 Video Hosted on Firebase Storage`;
let containerParams = new URLSearchParams();
containerParams.append('caption', body);
containerParams.append('video_url', videoURL);
containerParams.append('access_token', targetAccessToken);
let mediaContainerResponse = await axios.post(mediaContainerUrl, containerParams);
let { id: mediaContainerId } = mediaContainerResponse.data; //Upload is done here
I now checked the status of the uploaded video using the below code:
let mediaContainerStatusEndpoint = `https://graph.facebook.com/${mediaContainerId}?fields=status_code,status&access_token=${targetAccessToken}`;
let { data: mediaContainerStatus } = await axios.get(mediaContainerStatusEndpoint);
let { status_code, status } = mediaContainerStatus; //Unfortunately, status is most times coming back as `2207026` even for valid videos
So, is there a reason I keep getting back the status code 2207026? Because status code 2207026 means invalid video format. But the video format seems valid to me
Please note that the video is an MP4 and also under 60seconds.
Any ideas on this would be greatly appreciated.
Thanks

Unable to stream microphone audio to Google Speech to Text with NodeJS

I am going to develop a simple web based Speech to Text project. Develop with NodeJS, ws (WebSocket), and Google's Speech to Text API.
However, I have no luck to get the transcript from Google's Speech to Text API.
Below are my server side codes (server.js):
ws.on('message', function (message) {
if (typeof message === 'string') {
if(message == "connected") {
console.log(`Web browser connected postback.`);
}
}
else {
if (recognizeStream !== null) {
const buffer = new Int16Array(message, 0, Math.floor(message.byteLength / 2));
recognizeStream.write(buffer);
}
}
});
Below are my client side codes (ws.js):
function recorderProcess(e) {
var floatSamples = e.inputBuffer.getChannelData(0);
const ConversionFactor = 2 ** (16 - 1) - 1;
var floatSamples16 = Int16Array.from(floatSamples.map(n => n * ConversionFactor));
ws.send(floatSamples16);
}
function successCallback(stream) {
window.stream = stream;
var audioContext = window.AudioContext;
var context = new audioContext();
var audioInput = context.createMediaStreamSource(stream);
var recorder = context.createScriptProcessor(2048, 1, 1);
recorder.onaudioprocess = recorderProcess;
audioInput.connect(recorder);
recorder.connect(context.destination);
}
When I run the project, and open http://localhost/ in my browser, trying to speaking some sentences to the microphone. Unfortunately, there are no transcription returned, and no error messages returned in NodeJS console.
When I check the status in Google Cloud Console, it only display a 499 code in the dashboard.
Many thanks for helping!
I think the issue could be related to the stream process. Maybe some streaming process is stopped before the end of an operation. My suggestion is to review the callbacks in the JasvaScript code in order to find some “broken" promises.
Also, maybe its obvious but there is a different doc for audios than more than a minute:
https://cloud.google.com/speech-to-text/docs/async-recognize
CANCELLED - The operation was cancelled, typically by the caller.
HTTP Mapping: 499 Client Closed Request
Since the error message, this also could be related to the asynchronous and multithread features of node js.
Hope this works!

Dialogflow, detection intent from audio

I'm trying to send an audio file to dialogflow API for intent detection. I already have an agent working quite well but only with text. I'm trying to add the the audio feature but with no luck.
I'm using the example (Java) provided in this page:
https://cloud.google.com/dialogflow-enterprise/docs/detect-intent-audio#detect-intent-text-java
This is my code:
public DetectIntentResponse detectIntentAudio(String projectId, byte [] bytes, String sessionId,
String languageCode)
throws Exception {
// Set the session name using the sessionId (UUID) and projectID (my-project-id)
SessionName session = SessionName.of(projectId, sessionId);
System.out.println("Session Path: " + session.toString());
// Note: hard coding audioEncoding and sampleRateHertz for simplicity.
// Audio encoding of the audio content sent in the query request.
AudioEncoding audioEncoding = AudioEncoding.AUDIO_ENCODING_LINEAR_16;
int sampleRateHertz = 16000;
// Instructs the speech recognizer how to process the audio content.
InputAudioConfig inputAudioConfig = InputAudioConfig.newBuilder()
.setAudioEncoding(audioEncoding) // audioEncoding = AudioEncoding.AUDIO_ENCODING_LINEAR_16
.setLanguageCode(languageCode) // languageCode = "en-US"
.setSampleRateHertz(sampleRateHertz) // sampleRateHertz = 16000
.build();
// Build the query with the InputAudioConfig
QueryInput queryInput = QueryInput.newBuilder().setAudioConfig(inputAudioConfig).build();
// Read the bytes from the audio file
byte[] inputAudio = Files.readAllBytes(Paths.get("/home/rmg/Audio/book_a_room.wav"));
byte[] encodedAudio = Base64.encodeBase64(inputAudio);
// Build the DetectIntentRequest
DetectIntentRequest request = DetectIntentRequest.newBuilder()
.setSession("projects/"+projectId+"/agent/sessions/" + sessionId)
.setQueryInput(queryInput)
.setInputAudio(ByteString.copyFrom(encodedAudio))
.build();
// Performs the detect intent request
DetectIntentResponse response = sessionsClient.detectIntent(request);
// Display the query result
QueryResult queryResult = response.getQueryResult();
System.out.println("====================");
System.out.format("Query Text: '%s'\n", queryResult.getQueryText());
System.out.format("Detected Intent: %s (confidence: %f)\n",
queryResult.getIntent().getDisplayName(), queryResult.getIntentDetectionConfidence());
System.out.format("Fulfillment Text: '%s'\n", queryResult.getFulfillmentText());
return response;
}
I have tried with several formats, wav (PCM 16 bits several sample rates) and FLAC, and also converting the bytes to base64 in two different ways as described here (by code or console):
https://dialogflow.com/docs/reference/text-to-speech
I have even tested with the .wav provided in this example creating a new intent in my agent called "book a room" with that training phrase. It works using text and audio from the dialogflow console but only works with text, not audio from my code... and I'm sending the same wav they provide! (code above)
I always receive the same response (QueryResult):
I need a clue or something, I'm totally stuck here. No logs, no errors in the response... but does not work.
Thanks
I wrote to the dialogflow support and the replied my with a working piece of code. It is basically the same posted above, the only difference is the base64 encoding, it is not necessary to do that.
So I removed:
byte[] encodedAudio = Base64.encodeBase64(inputAudio);
(And used inputAudio directly)
Now It is working as expected...

Resources