What information do I get from Spotify's Audio Analysis? - audio

When using Spotify's API to analyse a track (https://developer.spotify.com/web-api/console/get-audio-analysis-track/) it returns a bunch of numbers and strings..
Does anybody know what these numbers are all about and how to interpret them?

You should take a look at Spotify's documentation for the audio analysis:
https://developer.spotify.com/web-api/get-audio-analysis/
If you look at the "track" element, you can see it returns a number of useful stats, such as the tempo, key, mode (minor/major) and loudness of the song. In the "segments" elements you can also get more a detailed pitch and timbral (tonal) analysis for parts of the song.

Related

Is there a way to get timestamps of speaker switch times using Google Cloud's speech to text service?

I know there is a way to get delineated words by speaker using the google cloud speech to text API. I'm looking for a way to get the timestamps of when a speaker changes for a longer file. I know that Descript must do something like this under the hood. , which I am trying to replicate. My desired end result is to be able to split an audio file with multiple speakers into clips of each speaker, in the order that they occurred.
I know I could probably extract timestamps for each word and then iterate through the results, getting the timestamps for when a previous result is a different speaker than the current result. This seems very tedious for a long audio file and I'm not sure how accurate this is.
Google "Speech to text" - phone model does what you are looking at by giving result end times for each identified speaker.
Check more here https://cloud.google.com/speech-to-text/docs/phone-model

Is there a way to get Mel-frequency cepstrum coefficients of a track from the Spotify API?

I am looking to get the MFCC(Mel-frequency cepstrum coefficients) of a spotify track. My main aim is to identify genre of a track, and the algorithm which I'm studying right now uses MFCC to extract features of a track.
I think there might be 2 ways to do this:
Spotify's API has an endpoint called https://api.spotify.com/v1/audio-analysis/{id}. This is what the output looks like for a track. Maybe there is a way to get MFCC from this output?
Get raw audio features of the track from an API endpoint and then use a (different) library to apply MFCC on the features.
Or, is there any other method I can try?
Thanks :)
Edit :
The output of audio-analysis API for a track given here contains a key called "tmfccrack". Is this related to the MFCC?
I found out that you can get the genre of a Spotify track by getting the genre of the corresponding artist through the Spotify API. That gets me what I want for now, but I think I should keep the question open because it asks for the MFCC of a track and not just the genre.

Bing Speech to Text API returning very wrong text

I am trying the "Bing Speech To Text API" in audio files that contains a real conversations between a person that answer customers in a call-center, and a customer that calls the call center to solve his doubts. Thus, these audios have two persons talking, and sometimes have long silence period when the customer is waiting an answer from support. These audios have 5 to 10 minutes long.
My doubt is:
What is the best aproach to translate audios like that to text, using Microsoft Cognitive Services?
What APIs do I have to use, besides Bing Speech To Text?
Do I have to cut or convert the audios before sending them to Bing Speech To Text?
I am asking that because the Bing Speech to text API is returning an text very very very very very different from the audio content. It is impossible to use or undertand. But, of course, I think I am doing some mistake.
Please, could you explain to me the best strategy to work with audio files like this?
I would be very glad for any help.
Best Regads,
I had run into this problem with conversations as well. Make sure that the transcription mode is set to "conversation" instead of "interactive."

Methods for determining acoustical similarity (but not fingerprinting)

I'm looking for methods that work in practise for determining some kind of acoustical similarity between different songs.
Most of the methods I've seen so far (MFCC etc.) seem actually to aim at finding identical songs only (i.e. fingerprinting, for music recognition not recommendation). While most recommendation systems seem to work on network data (co-listened songs) and tags.
Most Mpeg-7 audio descriptors also seem to be along this line. Plus, most of them are defined on the level of "extract this and that" level, but nobody seems to actually make any use of these features and use them for computing some song similarity. Yet even an efficient search of similar items...
Tools such as http://gjay.sourceforge.net/ and http://imms.luminal.org/ seem to use some simple spectral analysis, file system location, tags, plus user input such as the "color" and rating manually assigned by the user or how often the song was listened and skipped.
So: which audio features are reasonably fast to compute for a common music collection, and can be used to generate interesting playlists and find similar songs? Ideally, I'd like to feed in an existing playlist, and get out a number of songs that would match this playlist.
So I'm really interested in accoustic similarity, not so much identification / fingerprinting. Actually, I'd just want to remove identical songs from the result, because I don't want them twice.
And I'm also not looking for query by humming. I don't even have a microphone attached.
Oh, and I'm not looking for an online service. First of all, I don't want to send all my data to Apple etc., secondly I want to get only recommendations from the songs I own (I don't want to buy additional music right now, while I havn't explored all of my music. I havn't even converted all my CDs into mp3 yet ...) and secondly my music taste is not mainstream; I don't want the system to recommend Maria Carey all the time.
Plus of course, I'm really interested in what techniques work well, and which don't... Thank you for any recommendations of relevant literature and methods.
Only one application has ever done this really well. MusicIP mixer.
http://www.spicefly.com/article.php?page=musicip-software
It hasn't been updated for about ten years (and even then the interface was a bit clunky), it requires a very old version of Java, and doesn't work with all file formats - but it was and still is cross-platform and free. It does everything you're asking : generates acoustic fingerprints for every mp3/ogg/flac/m3u in your collection, saves them to a tag on the song, and given one or more songs, generates a playlist similar to those songs. It only uses the acoustics of the songs, so it's just as likely to add an unreleased track which only you have on your own hard drive as a famous song.
I love it, but every time I update my operating system / buy a new computer it takes forever to get it working again.

What elements are involved with songs

I want to store and write songs. Are songs all just pitch? If I stored only the pitch of each part of the song and apply the pitch to a bing sound and play it to replicate the song?
I'm very confused.
At minimum you will require a sequence of notes, which have a pitch and duration. This can be improved with chords and other types of polyphony, dynamics (volume or loudness), timbre, etc.
You should look into MIDI technology and related file formats for ideas about such a system, and a possible means for playing your songs on a computer.

Resources