Real Time Audio Processing Structure?

Real Time Audio Processing Structure? - audio

I have looked into tons of different resources regarding real-time audio processing, but not many that fit my particular use case. I would like to add filters to microphone input that can be sent, rather than to an output device, to something like a Discord, Zoom, or Skype call.
A basic example would be joining a Zoom call with an alien/robot voice.
Voice modulators that I aim to emulate, such as VoiceMod (not open source), create their own input source that you would select on Discord, but I have not seen this anywhere else. I doubt this is something I could use web-audio-api for - Is this something that requires a server in order to buffer, filter, and redirect audio?

Related

How can I do speaker identification (diarization) with microsoft speech to text without previous voice enrollment?

In my application, I need to record a conversation between people and there's no room in the physical workflow to take a 20 second sample of each person's voice for the purpose of training the recognizer, nor to ask each person to read a canned passphrase for training. But without doing that, as far as I can tell, there's no way to get speaker identification.
Is there any way to just record, say, 5 people speaking and have the recognizer automatically classify returned text as belonging to one of the 5 distinct people, without previous training?
(For what it's worth, IBM Watson can do this, although it doesn't do it very accurately, in my testing.)

If I understand your question right then Conversation Transcription should be a solution for your scenario, as it will show the speakers as Speaker[x] and iterate for each new speaker, if you don't generate user profiles beforehand.
User voice samples are optional. Without this input, the transcription
will show different speakers, but shown as "Speaker1", "Speaker2",
etc. instead of recognizing as pre-enrolled specific speaker names.
You can get started with the real-time conversation transcription quickstart.

Microsoft Conversation Transcription which is in Preview, now targeting to microphone array device. So the input recording should be recorded by a microphone array. If your recordings are from common microphone, it may not work and you need special configuration. You can also try Batch diarization which support offline transcription with diarizing 2 speakers for now, it will support 2+ speaker very soon, probably in this month.

Record LineOut output directly to file with JSyn

I have built a loopstation in JSyn. It allows you to record and play back samples. By playing multiple samples you can layer up sounds (e.g. one percussion sample, one melody, etc)
JSyn allows me to connect each of the sample players directly to my lineout where it is mixed automatically. But now I would like to record the sound just as the user hears it to a .wav-file. But I am not sure what I should connect the input port of the recorder to.
What is the smartest way to connect the audio output of all samples to the WaveRecorder?
In other words: In the Programmer's Guide there is an example for this but I am not sure how I create the "finalMix" used there.

Rather than using multiple LineOuts, just use one LineOut.
You can mix all of your signals together using a chain of MultiplyAdd units.
http://www.softsynth.com/jsyn/docs/javadocs/com/jsyn/unitgen/MultiplyAdd.html
Or you can use a Mixer unit.
http://www.softsynth.com/jsyn/docs/javadocs/com/jsyn/unitgen/MixerStereoRamped.html
Then connect the mix to your WaveRecorder and to your single LineOut.

Advice on dynamically combining mpeg-dash mpd data

I'm doing research for a project that's about to start.
We will be supplied hundreds of 30 second video files that the end user can select (via various filters) we then want to play them as if it was one video.
It seems that Media Source Extensions with MPEG-DASH is the way to go.
I feel like it could possibly be solve in the following way, but I'd like to ask if this sounds right from anyone who has done similar things
My theory:
Create mpd's for each video (via mp4box or similar tool)
User make selections (each of which has a mpd)
Read each mpd and get their <period> elements (most likely only one in each)
Create a new mpd file and insert all the <period> elements into it in order.
Caveats
I imagine this may be problematic if the videos were all different sizes formats etc, but in this case we can assume consistency.
So my question is to anyone with mpeg-dash / mpd exterience, does this sound right? or is there a better way to acheive this?

Sounds right, multi period is the only feasible way in my opinion.
Ideally you would encode all the videos with the same settings to provide the end user a consistent experience. However, it shouldn't be a problem if quality or even aspect ratio etc change from one period to another from a technical point of view. You'll need a player which supports multi period, such as dash.js or Bitmovin.

Audio introduce clicking disturbing noise after few minutes of live RTSP/RTP streaming

I'm using Directshow filters and getting clicking sound after few minutes of streaming. It's like a mouse clicking sound. If I do not use Reference clock, the issue resolve but the Audio-Video sync does not work and Lypsing doesn't work properly
While using VLC Player it works fine.
Update:
Thanks for your quick reply. I changed implementation of source filter, but still no success.
Previously the graph prepared as below:
Push Source -> ACM Wrapper -> DC-DSP Filter (Amplify filter) -> Render
I checked using graph edit, that DC-DSP filter can implement before decoder so implement change as below:
Push Source -> DC-DSP Filter (Amplify filter) -> ACM Wrapper -> Render
I checked timestamps of audio and video is working and lipsync is ok.
Is there any way to change the priority of audio in Directshow filter. In case of any delay in audio Directshow should not drop audio? In my case I think filter dropping audio not video, This may help to resolve this issue.

Synchronization is achieved by proper time stamping of payload data. There is no RTSP streaming in stock filters, so you are using some third party filter, which is presumably having the time stamping issue.
To add to this, there is a "rate matching" issue in case of mismatching rates of data origin and the clock of your audio renderer. There is an attempt to compensate it, but once again it is important how exactly the source filter is implementing it.

Can a single MIDI track play more then one note at once?

I am writing my own MIDI parser and everything seems to be going nicely.
I am testing against some of the files I see in the wild. I noticed that a MIDI track never appears to have more then one note on at once (produces more then one tone). Is this by design, can a midi track require more then one note to play at once?
(I am not referring to the number of simultaneous tracks, I am referring to the number of tones in a single track.)
The midi files I have tested look like this:
ON_NOTE71:ON_NOTE75:ON_NOTE79
ON_NOTE71:OFF_NOTE71:ON_NOTE75:OFF_NOTE75:ON_NOTE79:OFF_NOTE79
Can it look like this?
ON_NOTE71:ON_NOTE73:OFF_NOTE73:OFF_NOTE71
How do I detect this alternative structure?

Yes. Playing more than one note at once is known as polyphony. Different MIDI specifications define support for different levels of polyphony.
See http://www.midi.org/techspecs/gm.php

The number of notes that can play at once is a hardware implementation detail. Your software should allow for any number of simultaneous notes to be playing at the same time. I suggest keeping a table of which notes are currently on so that you can send a note off for each one when playback is stopped. Ideally the table should have a count for each note that is increased when a note on happens and decreased when a note off happens. That way if a certain pitch has two note on events pending you can send two note off events. You can't know how the device you're communicating with will handle successive note on events for the same pitch so it's safest to send an equal number of note off events.

Yes. Both controllers and software can produce such events.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Real Time Audio Processing Structure? - audio

Related

How can I do speaker identification (diarization) with microsoft speech to text without previous voice enrollment?

Record LineOut output directly to file with JSyn

Advice on dynamically combining mpeg-dash mpd data

Audio introduce clicking disturbing noise after few minutes of live RTSP/RTP streaming

Can a single MIDI track play more then one note at once?

Categories

Resources