C++ Microsoft SAPI: How to set Windows text-to-speech output to a memory buffer? - visual-c++

I have been trying to figure out how to "speak" a text into a memory buffer using Windows SAPI 5.1 but so far no success, even though it seems it should be quite simple.
There is an example of streaming the synthesized speech into a .wav file, but no examples of how to stream it to a memory buffer.
In the end I need to have the synthesized speech in a char* array in 16 kHz 16-bit little-endian PCM format. Currently I create a temp .wav file, redirect speech output there, then read it, but it seems to be a rather stupid solution.
Anyone knows how to do that?
Thanks!

Look at ISpStream::SetBaseStream. Here's a little helper:
inline HRESULT SPCreateStreamOnHGlobal(
HGLOBAL hGlobal, //Memory handle for the stream object
BOOL fDeleteOnRelease, //Whether to free memory when the object is released
const WAVEFORMATEX * pwfex, //WaveFormatEx for stream
ISpStream ** ppStream) //Address of variable to receive ISpStream pointer
{
HRESULT hr;
IStream * pMemStream;
*ppStream = NULL;
hr = ::CreateStreamOnHGlobal(hGlobal, fDeleteOnRelease, &pMemStream);
if (SUCCEEDED(hr))
{
hr = ::CoCreateInstance(CLSID_SpStream, NULL, CLSCTX_ALL, __uuidof(*ppStream), (void **)ppStream);
if (SUCCEEDED(hr))
{
hr = (*ppStream)->SetBaseStream(pMemStream, SPDFID_WaveFormatEx, pwfex);
if (FAILED(hr))
{
(*ppStream)->Release();
*ppStream = NULL;
}
}
pMemStream->Release();
}
return hr;
}

I accomplished it using the ISpStream. Use Setbasestream function of the ispstream to bind it to an istream and then set the output of ispvoice to that ispstream.
Here is my working solution if anybody wants it :
https://github.com/itsyash/MS-SAPI-demo

Do you know how to create a memory-mapped file? You could see if the ISpStream will bind to it.

Related

What is the optimal way to fetch graphics card name and info in a Windows C++ app?

I'm trying to pull some information from client machines that I'd like to be formatted almost exactly as it's seen here from "dxdiag.exe":
I realize there should be an API for this type of functionality, but I've searched and searched and can't figure out library or header file I need to include to access this tool. Any help is greatly appreciated.
For Windows Vista or later, use DXGI.
#include <dxgi.h>
#include <wrl/client.h>
#pragma comment(lib, "dxgi.lib")
using Microsoft::WRL::ComPtr;
ComPtr<IDXGIFactory1> dxgiFactory;
HRESULT hr = CreateDXGIFactory1(IID_PPV_ARGS(dxgiFactory.ReleaseAndGetAddressOf()));
if (FAILED(hr)) // ... error handling
ComPtr<IDXGIAdapter1> adapter;
for (UINT adapterIndex = 0;
SUCCEEDED(dxgiFactory->EnumAdapters1(
adapterIndex,
adapter.ReleaseAndGetAddressOf()));
adapterIndex++)
{
DXGI_ADAPTER_DESC1 desc = {};
hr = adapter->GetDesc1(&desc);
if (FAILED(hr)) // ... error handling
if (desc.Flags & DXGI_ADAPTER_FLAG_SOFTWARE)
{
// Don't select the Basic Render Driver adapter.
continue;
}
// desc.VendorId: VID
// desc.DeviceId: PID
// desc.Description: name string seen above
}
You can also look at the source code for DirectX Capabilities Viewer and the sample SystemInfoUWP.
I used Microsoft::WRL::ComPtr as a C++ smart-pointer for COM.

Retrieving D3D12 Video Decode Status

I am implementing a H264 video decoder through Direct3D 12 APIs - while I am very new to Direct3D I do have experience with other graphics APIs and H264. I have been struggling to find decent examples of D3D12 video decoding but have got as far as seemingly successfully submitting my work to the decoder.
However, I am at a loss with how to query the decode status. From piecing together the documentation and some other code I found I think it's something like this - mapping the result into the status struct - but I get an invalid argument error. Could anyone point me in the right direction, and any good examples of D3D12 video decoding would be a great resource.
// decode_commands is a ID3D12VideoDecodeCommandList
// query_heap is a ID3D12QueryHeap
// device is a ID3D12Device
// Make query for decode stats.
decode_commands->EndQuery(query_heap.Get(), D3D12_QUERY_TYPE::D3D12_QUERY_TYPE_VIDEO_DECODE_STATISTICS, 0);
// Create buffer for query result.
ComPtr<ID3D12Resource> query_result;
D3D12_RESOURCE_DESC query_result_description = CD3DX12_RESOURCE_DESC::Buffer(sizeof(D3D12_QUERY_DATA_VIDEO_DECODE_STATISTICS));
HRESULT create_query_result = device->CreateCommittedResource(
&CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE::D3D12_HEAP_TYPE_DEFAULT),
D3D12_HEAP_FLAGS::D3D12_HEAP_FLAG_NONE,
&query_result_description,
D3D12_RESOURCE_STATES::D3D12_RESOURCE_STATE_COPY_DEST,
nullptr,
IID_PPV_ARGS(&query_result));
if (FAILED(create_query_result)) {
log("Failed to create query result");
return false;
}
// Resolve query.
decode_commands->ResolveQueryData(query_heap.Get(), D3D12_QUERY_TYPE::D3D12_QUERY_TYPE_VIDEO_DECODE_STATISTICS, 0, 1, query_result.Get(), 0);
// Get stats from query result.
D3D12_RANGE range;
range.Begin = 0;
range.End = sizeof(D3D12_QUERY_DATA_VIDEO_DECODE_STATISTICS);
D3D12_QUERY_DATA_VIDEO_DECODE_STATISTICS stats;
HRESULT map = query_result->Map(0, &range, reinterpret_cast<void**>(&stats));
if (FAILED(map)) {
log("Failed to map query result");
return false;
}
Perhaps you should begin with a working sample from D3D9 API :
H264Dxva2Decoder
Than, you write a D3D12 decoder, that's what i would do.

Decode streaming audio with gstreamer 1.0 and access the waveform data?

The actual gst version is 1.8.1.
Currently I have code that receives a gstreamer encoded stream and plays it through my soundcard. I want to modify it to instead give my application access to the raw un-compressed audio data. This should result in an array of integer sound samples, and if I were to plot them I would see the audio wave form (e.g. a perfect tone would be a nice sine wave), and if I were to append the most recent array to the last one received by a callback I wouldn't see any discontinuity.
This is the current playback code:
https://github.com/lucasw/audio_common/blob/master/audio_play/src/audio_play.cpp
I think I need to change the alsasink to an appsink, and setting up a callback that will get the latest chunk of audio after it has passed through the decoder. This is adapted from https://github.com/jojva/gst-plugins-base/blob/master/tests/examples/app/appsink-src.c :
_sink = gst_element_factory_make("appsink", "sink");
g_object_set (G_OBJECT (_sink), "emit-signals", TRUE,
"sync", FALSE, NULL);
g_signal_connect (_sink, "new-sample",
G_CALLBACK (on_new_sample_from_sink), this);
And then there is the callback:
static GstFlowReturn
on_new_sample_from_sink (GstElement * elt, gpointer data)
{
RosGstProcess *client = reinterpret_cast<RosGstProcess*>(data);
GstSample *sample;
GstBuffer *app_buffer, *buffer;
GstElement *source;
/* get the sample from appsink */
sample = gst_app_sink_pull_sample (GST_APP_SINK (elt));
buffer = gst_sample_get_buffer (sample);
/* make a copy */
app_buffer = gst_buffer_copy (buffer);
/* we don't need the appsink sample anymore */
gst_sample_unref (sample);
/* get source and push new buffer */
source = gst_bin_get_by_name (GST_BIN (client->_sink), "app_source");
return gst_app_src_push_buffer (GST_APP_SRC (source), app_buffer);
}
Can I get at the data in that callback? What am I supposed to do with the GstFlowReturn? If that is passing data to another pipeline element I don't want to do that, I'd rather get it there and be done.
https://github.com/lucasw/audio_common/blob/appsink/audio_process/src/audio_process.cpp
Is the gpointer data passed to that callback exactly what I want (cast to a gint16 array?), or otherwise how do I convert and access it?
The GstFlowReturn is merely a return value for the underlying base classes. If you would return an error there the pipeline probably stops because.. well there was a critical error.
The cb_need_data events are triggered by your appsrc element. This can be used as a throttling mechanism if needed. Since you probably use the appsrc in a pure push mode (as soon something arrives at the appsink you push it to the appsrc) you can ignore these. You also explicitly disable these events on the appsrc element. (Or do you still use the one?)
The data format in the buffer depends on the caps that the decoder and appsink agreed on. That is usually the decoder preferred format. You may have some control over this format depending on the decoder or convert it to your preferred format. May be worthwhile to check the format, Float32 is not that uncommon..
I kind of forgot what your actual question was, I'm afraid..
I can interpret the data out of the modified callback below (there is a script that plots it to the screen), it looks like it is signed 16-bit samples in the uint8 array.
I'm not clear about the proper return value for the callback, there is a cb_need_data callback setup elsewhere in the code that is getting triggered all the time with this code.
static void // GstFlowReturn
on_new_sample_from_sink (GstElement * elt, gpointer data)
{
RosGstProcess *client = reinterpret_cast<RosGstProcess*>(data);
GstSample *sample;
GstBuffer *buffer;
GstElement *source;
/* get the sample from appsink */
sample = gst_app_sink_pull_sample (GST_APP_SINK (elt));
buffer = gst_sample_get_buffer (sample);
GstMapInfo map;
if (gst_buffer_map (buffer, &map, GST_MAP_READ))
{
audio_common_msgs::AudioData msg;
msg.data.resize(map.size);
// TODO(lucasw) copy this more efficiently
for (size_t i = 0; i < map.size; ++i)
{
msg.data[i] = map.data[i];
}
gst_buffer_unmap (buffer, &map);
client->_pub.publish(msg);
}
}
https://github.com/lucasw/audio_common/tree/appsink

SAPI 5 TTS Events

I'm writing to ask you some advices for a particular problem regarding SAPI engine. I have an application that can speak both to the speakers and to a WAV file. I also need some events to be aware, i.e. word boundary and end input.
m_cpVoice->SetNotifyWindowMessage(m_hWnd, TTS_MSG, 0, 0);
hr = m_cpVoice->SetInterest(SPFEI_ALL_EVENTS, SPFEI_ALL_EVENTS);
Just for test I added all events! When the engine speaks to speakers all events are triggered and sent to the m_hWnd window, but when I set output to the WAV file, none of them are sent
CSpStreamFormat fmt;
CComPtr<ISpStreamFormat> pOld;
m_cpVoice->GetOutputStream(&pOld);
fmt.AssignFormat(pOld);
SPBindToFile(file, SPFM_CREATE_ALWAYS, &m_wavStream, &fmt.FormatId(), fmt.WaveFormatExPtr());
m_cpVoice->SetOutput(m_wavStream, false);
m_cpVoice->Speak(L"Test", SPF_ASYNC, 0);
Where file is a path passed as argument.
Really this code is taken from the TTS samples found on the SAPI SDK. It seems a little bit obscure the part setting the format...
Can you help me in finding the problem? Or does anyone of you know a better way to write TTS to WAV? I can not use manager code, it should be better to use the C++ version...
Thank you very much for help
EDIT 1
This seems to be a thread problem and searching in the spuihelp.h file, that contains the SPBindToFile helper I found that it uses the CoCreateInstance() function to create the stream. Maybe this is where the ISpVoice object looses its ability to send event in its creation thread.
What do you think about that?
I adopted an on-the-fly solution that I think should be acceptable in most of the cases, In fact when you write speech on files, the major event you would be aware is the "stop" event.
So... take a look a the class definition:
#define TTS_WAV_SAVED_MSG 5000
#define TTS_WAV_ERROR_MSG 5001
class CSpeech {
public:
CSpeech(HWND); // needed for the notifications
...
private:
HWND m_hWnd;
CComPtr<ISpVoice> m_cpVoice;
...
std::thread* m_thread;
void WriteToWave();
void SpeakToWave(LPCWSTR, LPCWSTR);
};
I implemented the method SpeakToWav as follows
// Global variables (***)
LPCWSTR tMsg;
LPCWSTR tFile;
long tRate;
HWND tHwnd;
ISpObjectToken* pToken;
void CSpeech::SpeakToWave(LPCWSTR file, LPCWSTR msg) {
// Using, for example wcscpy_s:
// tMsg <- msg;
// tFile <- file;
tHwnd = m_hWnd;
m_cpVoice->GetRate(&tRate);
m_cpVoice->GetVoice(&pToken);
if(m_thread == NULL)
m_thread = new std::thread(&CSpeech::WriteToWave, this);
}
And now... take a look at the WriteToWave() method:
void CSpeech::WriteToWav() {
// create a new ISpVoice that exists only in this
// new thread, so we need to
//
// CoInitialize(...) and...
// CoCreateInstance(...)
// Now set the voice, i.e.
// rate with global tRate,
// voice token with global pToken
// output format and...
// bind the stream using tFile as I did in the
// code listed in my question
cpVoice->Speak(tMsg, SPF_PURGEBEFORESPEAK, 0);
...
Now, because we did not used the SPF_ASYNC flag the call is blocking, but because we are on a separate thread the main thread can continue. After the Speak() method finished the new thread can continue as follow:
...
if(/* Speak is went ok */)
::PostMessage(tHwn, TTS_WAV_SAVED_MSG, 0, 0);
else
::PostMessage(tHwnd, TTS_WAV_ERROR_MSG, 0, 0);
}
(***) OK! using global variables is not quite cool :) but I was going fast. Maybe using a thread with the std::reference_wrapper to pass parameters would be more elegant!
Obviously, when receiving the TTS messages you need to clean the thread for a next time call! This can be done using a CSpeech::CleanThread() method like this:
void CSpeech::CleanThread() {
m_thread->join(); // I prefer to be sure the thread has finished!
delete m_thread;
m_thread = NULL;
}
What do you think about this solution? Too complex?

Libav and xaudio2 - audio not playing

I am trying to get audio playing with libav using xaudio2. The xaudio2 code I am using works with an older ffmpeg using avcodec_decode_audio2, but that has been deprecated for avcodec_decode_audio4. I have tried following various libav examples, but can't seem to get the audio to play. Video plays fine (or rather it just plays right fast now, as I haven't coded any sync code yet).
Firstly audio gets init, no errors, video gets init, then packet:
while (1) {
//is this packet from the video or audio stream?
if (packet.stream_index == player.v_id) {
add_video_to_queue(&packet);
} else if (packet.stream_index == player.a_id) {
add_sound_to_queue(&packet);
} else {
av_free_packet(&packet);
}
}
Then in add_sound_to_queue:
int add_sound_to_queue(AVPacket * packet) {
AVFrame *decoded_frame = NULL;
int done = AVCODEC_MAX_AUDIO_FRAME_SIZE;
int got_frame = 0;
if (!decoded_frame) {
if (!(decoded_frame = avcodec_alloc_frame())) {
printf("[ADD_SOUND_TO_QUEUE] Out of memory\n");
return -1;
}
} else {
avcodec_get_frame_defaults(decoded_frame);
}
if (avcodec_decode_audio4(player.av_acodecctx, decoded_frame, &got_frame, packet) < 0) {
printf("[ADD_SOUND_TO_QUEUE] Error in decoding audio\n");
av_free_packet(packet);
//continue;
return -1;
}
if (got_frame) {
int data_size;
if (packet->size > done) {
data_size = done;
} else {
data_size = packet->size;
}
BYTE * snd = (BYTE *)malloc( data_size * sizeof(BYTE));
XMemCpy(snd,
AudioBytes,
data_size * sizeof(BYTE)
);
XMemSet(&g_SoundBuffer,0,sizeof(XAUDIO2_BUFFER));
g_SoundBuffer.AudioBytes = data_size;
g_SoundBuffer.pAudioData = snd;
g_SoundBuffer.pContext = (VOID*)snd;
XAUDIO2_VOICE_STATE state;
while( g_pSourceVoice->GetState( &state ), state.BuffersQueued > 60 ) {
WaitForSingleObject( XAudio2_Notifier.hBufferEndEvent, INFINITE );
}
g_pSourceVoice->SubmitSourceBuffer( &g_SoundBuffer );
}
return 0;
}
I can't seem to figure out the problem, I have added error messages in init, opening video, codec handling etc. As mentioned before the xaudio2 code is working with an older ffmpeg, so maybe I have missed something with the avcodec_decode_audio4?
If this snappet of code isn't enough, I can post the whole code, these are just the places in the code I think the problem would be :(
I don't see you accessing decoded_frame anywhere after decoding. How do you expect to get the data out otherwise?
BYTE * snd = (BYTE *)malloc( data_size * sizeof(BYTE));
This also looks very fishy, given that data_size is derived from the packet size. The packet size is the size of the compressed data, it has very little to do with the size of the decoded PCM frame.
The decoded data is located in decoded_frame->extended_data, which is an array of pointers to data planes, see here for details. The size of the decoded data is determined by decoded_frame->nb_samples. Note that with recent Libav versions, many decoders return planar audio, so different channels live in different data buffers. For many use cases you need to convert that to interleaved format, where there's just one buffer with all the channels. Use libavresample for that.

Resources