How to test traffic/media quality for any H264 call
Including what parameters to test/look into for the media quality based tests.
VoIP quality is mostly subjective. There are many contributing factors like packet loss, jitter, delay, etc that determine quality of your service. Anyway QoS measurment methods can be divided into two broad categories:
Subjective methods like MOS
Objective methods like R-Factor and E-Model
If you want to read more about these topics you can refer to a book like Speech quality of VoIP assessment and prediction.
Also if you want a quick assessment there are some tools out there that can help you. For example Homer can calculate an objective quality of call using RTCP data.VoIP Monitor is also a great tool.
Related
I am using the command line tool aubiopitch to analyze voice recordings. My goal is to determine the fundamental frequency of the voice recorded. I know, of course, that the frequency varies – that's why I want to calculate an "average" in Hz over a 30-second recording.
My question: aubio uses different methods to determine the pitch of a recording: Schmitt trigger, harmonic comb, yin, yinfft etc. Which one of those would be my preferred choice when dealing with pure human voice recordings (no background music, atmo etc.).
I would recommend using yinfast or yinfft (default). For a discussion of the algorithms, their parameters, and their performance, see Chapter 3 of this document.
Note that the median is better suited than the average in this case.
CREPE is good and outperforms many others since it uses advanced neural-network machine learning for pitch prediction. It might be unstable in unseen conditions though and might not be very easy to plug since it requires tensorflow.
For more traditional and lightweight solution oyu can try REAPER.
I'm a beginner with tensorflow and Python and I'm trying to build an app that automatically detects, in a football (soccer) match some key moments (yellow/red cards, goals, etc).
I'm starting to understand how to do a video analysis training the program on a dataset built by me, downloading images from the web and tagging them. In order to obtain some better results for the analysis, I was wondering if someone had some suggestions on tutorials to follow in order to understand how to train my app also on audio files, to make the program able to understand when there is a pitch variation in the audio of the video and combine both video and audio analysis in order to get better results.
Thank you in advance
Since you are new to Python and to tensorflow, I recommend you focus on just audio for now, especially since its a strong indicator of events of importance in a football match (red/yellow cards, nasty fouls, goals, strong chances, good plays, etc).
Very simply, without using much ML at all, you can use the average volume of a time period to infer significance. If you want to get a little more sophisticated, you can consider speech-to-text libraries to look for keywords in commentator speech.
Using video to try to determine when something important is happening is much, much more challenging.
This page can help you get started with audio signal processing in Python.
https://bastibe.de/2012-11-02-real-time-signal-processing-in-python.html
I am currently conducting research on American pop songs using Spotify's audio features (e.g., danceability, tempo, and valence...). But, I couldn't find any documentation that contains details about how they measured the features. I know there's a brief description of the features. But, it doesn't tell about any the exact measurement. Could you let me know where I can find it?
Thanks.
The Echonest was a music data analysis platform acquired by Spotify, and its expertise is being currently used to power up Spotify recommendation tools.
Audio Features API endpoint extracts a more "High Level" analysis from audio and songs, whereas Audio Analysis endpoint extracts more "Low Level" and technical data.
Essentially, "High-level" features are more explicit and make use of clearer semantics -plain english, in order to be easily understood by the layman ("danceability", for instance), but it all comes from Low Level analysis, really.
Here you have some documentation, if you wish to dive deeper into the matter:
http://docs.echonest.com.s3-website-us-east-1.amazonaws.com/_static/AnalyzeDocumentation.pdf
Assuming each participant agrees to the recording and transcription of the Skype call, is there a way to transcribe the meeting (either live or offline or both) such that it produces a text transcript where each spoken text is correctly attributed to the speaker. The transcript could then be input to any variety of search or NLP algorithms.
The top 3 Google search hits of "automatically transcribe Skype" refer to apps which make manual transcription easier:
(1) http://www.dummies.com/how-to/content/how-to-convert-skype-audio-to-text-with-transcribe.html
(2) http://ask.metafilter.com/231400/How-to-record-and-transcribe-Skype-conversation
(3) https://www.ttetranscripts.com/blog/how-to-record-and-transcribe-your-skype-conversations
While it would be trivial to record the audio and send it to a speech-to-text engine, I doubt it would be very high quality because the best results are usually speaker dependent models (else we wouldn't have to take time to train Dragon Naturally Speaking).
But, before we can choose speaker dependent transcription models, we need to know which segment of the audio belongs to which speaker. There's 2 ways that this is solved:
There is an easy way to retrieve all the audio that came from each participant, e.g. you just record all the audio from each speaker's microphone during the call, and you don't have to do any segmentation.
In case the first option isn't feasible or prohibitive in some way, we have to use a Speaker Diarization algorithm, which segments the audio into N clusters/speakers (most algorithms allow for being told how many speakers in the audio, but some can figure this out on their own). For real-time transcript as the call goes on, I imagine we'd need some fancy Real Time Speaker Diarization algorithm.
In any case, once the segmentation is solved, each participant has their trained speaker model, which is then applied to their portions of the audio. At the end of the day, everyone gets a nice conversation transcript and later one we can do fancy things like topic analysis or maybe Big Brother wants to sift over everyone's project meetings without having to listen to hours of audio.
My question is, what would be a way to implement this in practice?
I have recently started going through sound card drivers in Linux[ALSA].
Can a link or reference be suggested where I can get good basics of Audio like :
Sampling rate,bit size etc.
I want to know exactly how samples are stored in Audio files on a computer and reverse of this which is how samples(numbers) are played back.
The Audacity tutorial is a good place to start. Another introduction that covers similar ground. The PureData tutorial at flossmanuals is also a good starting point. Wikipedia is a good source once you have the basics down.
Audio is input into a computer via an analog-to-digital converter (ADC). Digital audio is output via a digital-to-analog converter (DAC).
Sample rate is the number of times per second at which the analog signal is measured and stored digitally. You can think of the sample rate as the time resolution of an audio signal. Bit size is the number of bits used to store each sample. You can think of it as analogous to the color depth of an image pixel.
David Cottle's SuperCollider book also has a great introduction to digital audio.
I was in the same situation, and certainly this kind of information is out there but you need to do some research first. This is what I have found:
Digital Audio processing is a branch of DSP (Digital Signal
Processing).
DSP is one of the most powerful technologies that will
shape science and engineering in the twenty-first century.
Revolutionary changes have already been made in a broad range of
fields: communications, medical imaging, radar & sonar, high fidelity
music reproduction, and oil prospecting, to name just a few. Each of
these areas has developed a deep DSP technology, with its own
algorithms, mathematics, and specialized techniques…
This quote was taken from a very helpful guide that covers every topic in depth called the “The Scientist and Engineer's Guide to
Digital Signal Processing”. And though you are not asking for DSP specifically there’s a chapter that covers all digital audio related topics with a very good explanation.
You can find it in the chapter 22 - Audio Processing, and covers all this topics:
Human Hearing: how the sound is perceived by our ears, this is the
basis of how then the sound is then generated artificially.
Timbre: explains the properties of sound, like loudness, pitch and
timbre.
Sound Quality vs. Data Rate: once you know the previous concepts
we start to translate it to the electronic side.
High Fidelity Audio: gives you a picture of how sound is then
processed digitally.
Companding: here you can find how sound is then processed and
compressed for telecommunications.
Speech Synthesis and Recognition: More processes applied to the
sound, like filters, synthesis, etc.
Nonlinear Audio Processing: this is more advanced but understandable,
for sound treatment and other topics.
It explains the basics of sound in the real world, in case you might want to take a look, and then it explains how the sound is processed in the computer including what you are asking for.
But there are other topics that can be found in wikipedia that are more specific, let’s say the “Digital audio” page that explains every detail of this topic, this site can be used as a reference for further research, just in the beginning you can find a few links to sample rate, sound waves, digital forms, standards, bit depth, telecommunications, etc. There are a few things you might need to study more, like the nyquist-shannon theorem, fourier transforms, complex numbers and so on, but this is only used in very specific and advanced topics that you might not review or use. But I mention it just in case you are interested. You can find information in both the DSP guide book and wikipedia although you need to study some math.
I’ve been using python to develop and study these subjects with code since it has a lot of useful libraries, like numpy, sound device, scipy, etc. And then you can start plating with sound. On youtube you can find lots of videos that also guide you on how to do this. I’ve found synthesis, filters, voice recognition, you can create wav files with just code, which is great. But also I’ve seen projects in C/C++, Javascript, and other languages, so it might help you to keep learning and coding fun things.
There are a few other references across the internet but you might need to know what you are looking for, this book and the wikipedia page would be the best starting points for me, since it gives you the basics and explains in depth every topic. Then depending on the goal you want to achieve you can then start looking for more information.