I stream from the microphone of an Android client to a nodejs server
(built by me), which forwards the audio to a DialogFlow agent.
The streaming code of the nodejs server is based on this snippet: https://cloud.google.com/dialogflow/docs/detect-intent-stream#streaming_detect_intent
The server in nodejs first receives the intermediate results (with the words spotted by the automatic speech
recognition - ASR) and then the final result of DialogFlow (with NLU
analysis). The connection protocol between the Android client and the nodejs
server is websocket.
The problem that I am facing is that
when I stream the audio of short sentences (in my case the
words in Italian "sì"/"no"), the final result of dialogFlow
sometimes comes with a delay that is close to 10 seconds.
I measured the delay of the final DialogFlow result from when the first partial ASR result arrives.
So my experiments are carried on like this:
I open the microphone and I start speaking (I just say "sì" or "no")
the first intermediate result arrives (the timer starts)
the final result of DialogFlow arrives (the timer stops)
The intermediate results of the speech recognition arrive almost
immediately, that is, with a very low delay.
As for the delay with which the final result of DialogFlow arrives,
about once in two the result arrives within 4 seconds, which is
acceptable, otherwise it takes longer (in the worst case even with 10
or 11 seconds of delay). In the latter case, the delay is not acceptable
as the user experience is too slow.
Again, I must say that this problem is found only in extremely short
sentences. With sentences consisting of words of a just few syllables, the
delay of DialogFlow is always negligible, and everything is fine.
I also tried with the English models, but the sentences consisting of
"yes" and "no" suffer from the same delay. I also did some tests with
models other than the default one (in particular I tried the model
command_and_search, which is described as "Best for short queries
such as voice commands or voice search"), but without luck.
I understand that the use of such short sentences as "yes" and "no" is
not the most frequent use case for a DialogFlow agent, however I
believe that in certain circumstances it is useful (sometimes
necessary) to use them.
So I ask if anyone has experienced this problem and knows how to overcome
it.
Related
I am making a bot on Dialogflow with a Fulfillment. Considering the given strict 5-second window in DialogFlow, I am getting [empty response] as a response.
I want to overcome this issue, but my web service requires more than 9 seconds for the execution.
I am considering to redesigning the conversation flow in such a way that we will start streaming audio till the Response is processed.
Example:
User Question: xx xxx xxx xxxx xxxxx?
Response: a). We'll play fixed audio to keep the user engaged for few seconds till it finds a response text in the back end; b).
Receive answers from the web service and save them in the session to
display further.
How can I achieve this and how can I handle the Timeout issue?
You're on the right track, but there are a number of other things to consider.
First, however, keep in mind that anything that is trying to "avoid" the 5 second timeout already indicates some issues with the design. Waiting 10 seconds for a reply is a pretty long time with something as interactive as voice! Even 5 seconds, which is the timeout, is a long time. (And there is no way to change this timeout.)
So the first thing you may want to do is consider if there is a better/faster way to do what you want.
If not, the rough approach would be something like this:
Get the request from the user.
Track a unique identifier, either tied to the user or tied to the session. You'll be using this as a key into some kind of database or data store.
Start the API call as part of an asynchronous request or in another thread.
Reply immediately that you're working on it in a way that the user will send another request. (See below for this issue.) You'll want to make sure that the ID is maintained as part of this session - so you'll need to save it as part of the Session data.
At this point - you're basically doing two things in parallel.
When the API call completes, it needs to save the result in the datastore against the identifier. (It can't save it in the session itself - that response was already sent back to the Assistant.)
You're also waiting for a reply from the user. When it comes in:
Check to see if you have a response saved for this session yet.
If not, then go back to step 4. (You may want to track how many times you get here and give up at some point.)
If you do have the result, reply to the user with the information.
There is an issue with how you reply in step 4, since you want to do something that will guarantee you another request from the person expecting an answer. There are a few possible approaches:
The most straightforward way would be to send back a Media response to play a few seconds of "hold music". This has the advantage that, when the music stops, it will send an event to Dialogflow which you can capture as an Intent and then continue with step 5.
But there are some problems:
Not all versions of the Assistant support the Media response. You will need to check to confirm the feature is supported before you use it and, if not, use another approach (see below).
The media player that is presented on some Assistants allow the user to stop playback, or will not correctly send an event when the audio stops in some situations. So you may never get another request in this session.
Another approach involves some more advanced conversation design tricks, so may not always be suitable for your conversation. Your response can say that you're looking up the results but then ask the user a question - possibly one that is related to other information that you will need. With their reply, you can collect this information (if you need it) and then see if you have a result yet.
In some conversations - this works really well. For example, if you're looking up flights to somewhere, while you're looking that up you might ask them if they will need a hotel or rental car, which you might ask about anyway.
Other conversations, however, don't easily have such questions. In these cases, you may need to ask something that isn't relevant while you stall for time.
I am creating a chatbot which have an intent with a payment link. So on trigger of this intent, I made call from webhook fulfillment to third party api which takes approx 20secs to respond. But in this period of time my response is timed out as it is limited to 5 sec from google.
Can you please suggest what approach should I follow. I just want to wait for approx 20 sec to respond.
Thanks.
one option is to keep the conversation alive using events (generated by the webhook) which trigger dedicated intents.
When a payment must be performed the webhook starts a background process to deal with the 3rd party payment API, and sleep for 4-5 sec, after that generates an event (setFollowupEvent PAYMENT_IN_PROGRESS). This event is associated to a DialogFlow Intent which fires as soon as the event is sent back to the platform.
At this point you have another incoming webhook call: check status of the payment, if it is still in progress (likely after 5 sec) then sleep 4-5 sec and send another event (setFollowupEvent PAYMENT_IN_PROGRESS_2) which produces the same workflow.
There are so many times you can do this (I think a max of 3), so you need to cater for the fact that the payment does not terminate in time (fallback scenario).
A smart option could be to keep engaging the user with the conversation, not always easy, depending on what your chatbot is about.
Hope it helps.
The short answer is that you can't.
The longer answer is that you need to think about this as a conversation. If you asked someone a question, and didn't get any response from them for 20 seconds - that would be pretty uncomfortable, wouldn't it?
Instead, we have come up with ways to compensate for that silence. In a physical conversation, people may engage with you and ask you questions to fill the time. If you're on the phone, they may play hold music. Or we may end the conversation for now and tell them later when there is a result.
When building an Action, we have similar parallels that may work better or worse based on our exact needs.
Engaging in conversation
One approach is that when we get the request from the user, we do two things:
Start a task that will execute the query and save the results in a separate "answer database", indexed against the user, a session ID, or some other temporary id we can generate and use later.
While the query is running, we reply to the user saying we're working on it, and asking them another question.
Then, when they reply with their answer to this other question, we can check if we have an answer for them in the database. If we do, we'll reply with it. If not, we'll repeat step 2 until we do.
This approach works well if we either have other questions to ask, or if we're in a good place to "make small talk". Picture booking an airline reservation - while we look up flights, we may want to ask if they prefer window or aisle seats (Which we'd need to ask later) or make small talk by asking if they're traveling for business or leisure.
Using "hold music"
A variant of this allows us to play some hold music while we're processing the answer.
Instead of asking a question in step 2 above, we reply with a Media Response that plays 20 seconds or so of music. When the music completes, our fulfillment webhook will be sent a MEDIA_STATUS event and we can either return the information from the answer database, or say we're still working on it and play more music.
This is less conversational, but may work better if we don't actually have anything to say in the meantime.
Sending a notification
If the response may take a very long time, then it may just be best to let them do other things and to send a notification or a text or email when you have a result. These cases, however, require the user to have registered with you in some way and are probably more appropriate if you have a long-standing relationship with the user.
Summary
You should be returning results as quickly as you can to keep it feeling like a conversation. When you can't, consider other means, just like how we would consider what it would be like if we were talking to another person.
I have a use case where a mobile app records a long series of commands. Each command is a short, single word (or number). They can happen quickly one right after the other, but the use case does not care if it takes several seconds to get results back from the Cognitive server. It is currently being implemented as discrete asynchronous requests rather than streaming (seems to be more reliable for us).
Since results are coming back async, I see no easy way to map the result back to its corresponding request (and ultimately the app command). Can I embed a unique ID somewhere that will get passed back to me? Is there some other option?
You are using the SDK?
If you do recognizeOnce you get the result from the audio as a call result (synchronous)
If you do continuousrecognition there is currently no way to tag the audio segment.
I have built a standalone app version of a project that until now was just a VST/audiounit. I am providing audio support via rtaudio.
I would like to add MIDI support using rtmidi but it's not clear to me how to synchronise the audio and MIDI parts.
In VST/audiounit land, I am used to MIDI events that have a timestamp indicating their offset in samples from the start of the audio block.
rtmidi provides a delta time in seconds since the previous event, but I am not sure how I should grab those events and how I can work out their time in relation to the current sample in the audio thread.
How do plugin hosts do this?
I can understand how events can be sample accurate on playback, but it's not clear how they could be sample accurate when using realtime input.
rtaudio gives me a callback function. I will run at a low block size (32 samples). I guess I will pass a pointer to an rtmidi instance as the userdata part of the callback and then call midiin->getMessage( &message ); inside the audio callback, but I am not sure if this is thread-sensible.
Many thanks for any tips you can give me
In your case, you don't need to worry about it. Your program should send the MIDI events to the plugin with a timestamp of zero as soon as they arrive. I think you have perhaps misunderstood the idea behind what it means to be "sample accurate".
As #Brad noted in his comment to your question, MIDI is indeed very slow. But that's only part of the problem... when you are working in a block-based environment, incoming MIDI events cannot be processed by the plugin until the start of a block. When computers were slower and block sizes of 512 (or god forbid, >1024) were common, this introduced a non-trivial amount of latency which results in the arrangement not sounding as "tight". Therefore sequencers came up with a clever way to get around this problem. Since the MIDI events are already known ahead of time, these events can be sent to the instrument one block early with an offset in sample frames. The plugin then receives these events at the start of the block, and knows not to start actually processing them until N samples have passed. This is what "sample accurate" means in sequencers.
However, if you are dealing with live input from a keyboard or some sort of other MIDI device, there is no way to "schedule" these events. In fact, by the time you receive them, the clock is already ticking! Therefore these events should just be sent to the plugin at the start of the very next block with an offset of 0. Sequencers such as Ableton Live, which allow a plugin to simultaneously receive both pre-sequenced and live events, simply send any live events with an offset of 0 frames.
Since you are using a very small block size, the worst-case scenario is a latency of .7ms, which isn't too bad at all. In the case of rtmidi, the timestamp does not represent an offset which you need to schedule around, but rather the time which the event was captured. But since you only intend to receive live events (you aren't writing a sequencer, are you?), you can simply pass any incoming MIDI to the plugin right away.
I'm writing an app using sms as communication.
I have chosen to subscribe to an sms-gateway, which provides me with an API for doing so.
The API has functions for sending as well as pulling new messages. It does however not have any kind of push functionality.
In order to do my queries most efficient, I'm seeking data on how long time people wait before they answer a text message - as a probability function.
Extra info:
The application is interactive (as can be), so I suppose the times will be pretty similar to real life human-human communication.
I don't believe differences in personal style will play a big impact on the right times and frequencies to query, so average data should be fine.
Update
I'm impressed and honered by the many great answers recieved. I have concluded that my best shot will be a few adaptable heuristics, including exponential (or maybe polynomial) backoff.
All along I will be gathering statistics for later analysis. Maybe something will show up. I think I will cheat start on the algorithm for generating poll-frquenzies from a probability distribution. That'll be fun.
Thanks again many times.
In the absence of any real data, the best solution may be to write the code so that the application adjusts the wait time based on current history of response times.
Basic Idea as follows:
Step 1: Set initial frequency of pulling once every x seconds.
Step 2: Pull messages at the above frequency for y duration.
Step 3: If you discover that messages are always waiting for you to pull decrease x otherwise increase x.
Several design considerations:
Adjust forever or stop after sometime
You can repeat steps 2 and 3 forever in which case the application dynamically adjusts itself according to sms patterns. Alternatively, you can stop after some time to reduce application overhead.
Adjustment criteria: Per customer or across all customers
You can chose to do the adjustment in step 3 on a per customer basis or across all customers.
I believe GMAIL's smtp service works along the same lines.
well I would suggest finding some statistics on daily SMS/Text Messaging usage by geographical location and age groups and come up with an daily average, it wont be an exact measurement for all though.
Good question.
Consider that people might have multiple tasks and that answering a text message might be one of those tasks. If each of those tasks takes an amount of time that is exponentially distributed, the time to get around to answering the text message is the sum of those task completion times. The sum of n iid random variables has a Gamma distribution.
The number of tasks ahead of the text return also has a dicrete distribution - let's say it's Poisson. I don't have the time to derive the resulting distribution, but simulating it using #Risk, I get either a Weibull or Gamma distribution.
SMS is a store-and-forward messaging service, so you have to add in the delay that can be added by the various SMSCs (Short Message Service Centers) along the way. If you are connecting to one of the big aggregation houses (Sybase, TNS, mBlox etc) commercial bulk SMS providers (Clickatel, etc) then you need to allow for the message to transverse their network as well as the carriers network. If you are using a smaller shop then most likely they are using a GSM Modem (or modems) and there is a throughput limit on the message the can receive and process (as well as push out)
All that said, if you are using a direct connection or one of the big guys MO (mobile originated) messages coming to you as a CP (content provider) should take less than 5 seconds. Add to that the time it takes the Mobile Subscribers to reply.
I would say that anecdotal evidence form services I've worked on before, where the Mobile Subscriber needs to provide a simple reply it's usually within 10 seconds or not at all.
If you are polling for specific replies I would poll at 5 and 10 seconds then apply an exponential back off.
All of this is from a North American point-of-view. Europe will be fairly close, but places like Africa, Asia will be a bit slower as the networks are a bit slower. (unless you are connected directly to the operator and even then some of them are slow).