Extracting note onset from MIDI

Extracting note onset from MIDI - audio

I need to extract musical features (note details->pitch, duration, rhythm, loudness, note start time) from a polyphonic (having 2 scores for treble and bass - bass may also have chords) MIDI file. I'm using the jMusic API to extract these details from a MIDI file. My approach is to go through each score, into parts, then phrases and finally notes and extract the details.
With my approach, it's reading all the treble notes first and then the bass notes - but chords are not captured (i.e. only a single note of the chord is taken), and I cannot identify from which point onwards are the bass notes.
So what I tried was to get the note onsets (i.e. the start time of note being played) - since the starting time of both the treble and bass notes at the start of the piece should be same - But I cannot extract the note onset using jMusic API. For each note it shows 0.0.
Is there any way I can identify the voice (treble or bass) of a note? And also all the notes of a chord? How is the voice or note onset for each note stored in MIDI? Is this different for each MIDI file?
Any insight is greatly appreciated. Thanks in advance

You might want to have a look at this question: Actual note duration from MIDI duration
Where a possible approach to extracting notes from a midi file is discussed.
Consider that a MIDI file can be split on multiple tracks (a "type 1" midifile).
Once you have identified notes, identifying chords can still be tricky. Say you have 3 notes: C, E, G happening "at the same time" (i.e. having been identified as being sound at the same point in a measure). When are they to be considered the C major chord?
played on the same channel
played by the same instrument (even if on different channels)
played on the same channel even if they appear on different tracks
Midifile format is very simple (maybe even too simple!!) I suggest you have a look at its description here: http://duskblue.org/proj/toymidi/midiformat.pdf

Related

How to create and play a vector from samples of tones?

I'm reading through Mark Owen's "Practical Signal Processing" and an exercise in the second chapter says to "Create a vector containing samples of several seconds of a 400Hz sine wave at 32kHz."(question 2.3)
Since the book doesn't endorse any one technology, I'm trying to do this in Supercollider with:
"Pbind(\freq, Pseq([400,400,400,400,400,400,400,400,400,400,]), \dur, 0.15;).play;"
but I have two problems: how do I remove the gap between the notes during pattern playback and how do I generate the tones in the pattern at a particular sample rate?
Thank you!

It sounds like you're working at the "wrong" level. Using Pbind is very high-level, specifying patterns of musical events, whereas the author presumably wants you to think about the maths involved in generating individual audio data samples.
Since it's an exercise for the reader I won't give a complete answer, but: SuperCollider has a sin() operator as do many other languages. You can generate a list of values and then apply sin() e.g. via
sin([0,1,2,3,4,5])
or
sin((0..100))
Those are simple examples; they don't get the frequency or sample rate or duration you specify.
The question doesn't seem to ask you to play back the result, but if you wanted to do that you could do it by loading your calculated audio into a Buffer:
x = sin((0..1000));
b = Buffer.sendCollection(s, x);
b.play

Determining the 'amount' of speaking in a video

I'm working on a project to transcribe lecture videos. We are currently just using humans to do the transcriptions as we believe it is easier to transcribe than editing ASR, especially for technical subjects (not the point of my question, though I'd love any input on this). From our experiences we've found that after about 10 minutes of transcribing we get anxious or lose focus. Thus we have been splitting videos into ~5-7 minute chunks based on logical breaks in the lecture content. However, we've found that the start of a lecture (at least for the class we are piloting) often has more talking than later on, which often has time where the students are talking among themselves about a question. I was thinking that we could do signal processing to determine the rough amount of speaking throughout the video. The idea is to break the video into segments containing roughly the same amount of lecturing, as opposed to segments that are the same length.
I've done a little research into this, but everything seems to be a bit overkill for what I'm trying to do. The videos for this course, though we'd like to generalize, contain basically just the lecturer with some occasional feedback and distant student voices. So can I just simply look at the waveform and roughly use the spots containing audio over some threshold to determine when the lecturer is speaking? Or is an ML approach really necessary to quantify the lecturer's speaking?
Hope that made sense, I can clarify anything if necessary.
Appreciate the help as I have no experience with signal processing.

Although there are machine learning mehtods that are very good at discriminating voice from other sounds, you don't seem to require that sort of accuracy for your application. A simple level-based method similar to the one you proposed should be good enough to get you an estimate of speaking time.
Level-Based Sound Detection
Goal
Given an audio sample, discriminate the portions with a high amount of sounds from the portions that consist of background noise. This can then be easily used to estimate the amount of speech in a sound file.
Overview of Method
Rather than looking at raw levels in the signal, we will first convert it to a sliding-window RMS. This gives a simple measure of how much audio energy is at any given point of the audio sample. By analyzing the RMS signal we can automatically determine a threshold for distinguishing between backgroun noise and speech.
Worked Example
I will be working this example in MATLAB because it makes the math easy to do and lets me create illustrations.
Source Audio
I am using President Kennedy's "We choose to go to the moon" speech. I'm using the audio file from Wikipedia, and just extracting the left channel.
imported = importdata('moon.ogg');
audio = imported.data(:,1);
plot(audio);
plot((1:length(audio))/imported.fs, audio);
title('Raw Audio Signal');
xlabel('Time (s)');
Generating RMS Signal
Although you could techinically implement an overlapping per-sample sliding window, it is simpler to avoid the overlapping and you'll get very similar results. I broke the signal into one-second chunks, and stored the RMS values in a new array with one entry per second of audio.
audioRMS = [];
for i = 1:imported.fs:(length(audio)-imported.fs)
audioRMS = [audioRMS; rms(audio(i:(i+imported.fs)))];
end
plot(1:length(audioRMS), audioRMS);
title('Audio RMS Signal');
xlabel('Time (s)');
This results in a much smaller array, full of positive values representing the amount of audio energy or "loudness" per second.
Picking a Threshold
The next step is to determine how "loud" is "loud enough." You can get an idea of the distribution of noise levels with a histogram:
histogram(audioRMS, 50);
I suspect that the lower shelf is the general background noise of the crowd and recording environment. The next shelf is probably the quieter applause. The rest is speech and loud crowd reactions, which will be indistinguishable to this method. For your application, the loudest areas will almost always be speech.
The minimum value in my RMS signal is .0233, and as a rough guess I'm going to use 3 times that value as my criterion for noise. That seems like it will cut off the whole lower shelf and most of the next one.
A simple check against that threshold gives a count of 972 seconds of speech:
>> sum(audioRMS > 3*min(audioRMS))
ans =
972
To test how well it actually worked, we can listen to the audio that was eliminated.
for i = 1:length(speech)
if(~speech(i))
clippedAudio = [clippedAudio; audio(((i-1)*imported.fs+1):i*imported.fs)];
end
end
>> sound(clippedAudio, imported.fs);
Listening to this gives a bit over a minute of background crowd noise and sub-second clips of portions of words, due to the one-second windows used in the analysis. No significant lengths of speech are clipped. Doing the opposite gives audio that is mostly the speech, with clicks heard as portions are skipped. The louder applause breaks also make it through.
This means that for this speech, the threshold of three times the minimum RMS worked very well. You'll probably need to fiddle with that ratio to get good automatic results for your recording environment, but it seems like a good place to start.

Methods for determining acoustical similarity (but not fingerprinting)

I'm looking for methods that work in practise for determining some kind of acoustical similarity between different songs.
Most of the methods I've seen so far (MFCC etc.) seem actually to aim at finding identical songs only (i.e. fingerprinting, for music recognition not recommendation). While most recommendation systems seem to work on network data (co-listened songs) and tags.
Most Mpeg-7 audio descriptors also seem to be along this line. Plus, most of them are defined on the level of "extract this and that" level, but nobody seems to actually make any use of these features and use them for computing some song similarity. Yet even an efficient search of similar items...
Tools such as http://gjay.sourceforge.net/ and http://imms.luminal.org/ seem to use some simple spectral analysis, file system location, tags, plus user input such as the "color" and rating manually assigned by the user or how often the song was listened and skipped.
So: which audio features are reasonably fast to compute for a common music collection, and can be used to generate interesting playlists and find similar songs? Ideally, I'd like to feed in an existing playlist, and get out a number of songs that would match this playlist.
So I'm really interested in accoustic similarity, not so much identification / fingerprinting. Actually, I'd just want to remove identical songs from the result, because I don't want them twice.
And I'm also not looking for query by humming. I don't even have a microphone attached.
Oh, and I'm not looking for an online service. First of all, I don't want to send all my data to Apple etc., secondly I want to get only recommendations from the songs I own (I don't want to buy additional music right now, while I havn't explored all of my music. I havn't even converted all my CDs into mp3 yet ...) and secondly my music taste is not mainstream; I don't want the system to recommend Maria Carey all the time.
Plus of course, I'm really interested in what techniques work well, and which don't... Thank you for any recommendations of relevant literature and methods.

Only one application has ever done this really well. MusicIP mixer.
http://www.spicefly.com/article.php?page=musicip-software
It hasn't been updated for about ten years (and even then the interface was a bit clunky), it requires a very old version of Java, and doesn't work with all file formats - but it was and still is cross-platform and free. It does everything you're asking : generates acoustic fingerprints for every mp3/ogg/flac/m3u in your collection, saves them to a tag on the song, and given one or more songs, generates a playlist similar to those songs. It only uses the acoustics of the songs, so it's just as likely to add an unreleased track which only you have on your own hard drive as a famous song.
I love it, but every time I update my operating system / buy a new computer it takes forever to get it working again.

Simple audio filter-bank

I'm new to audio filters so please excuse me if i'm saying something wrong.
I like to write a code which can split up audio stored in PCM samples into two or three frequency bands and do some manipulation (like modifying their audio levels) or analysis on them then reconstruct audio samples from the output.
As far as i read on the internet for this task i could use FFT-IFFT and do manipulation on the complex form or use a time domain based filterbank which for example is used by the MP2 audio encoding format. Maybe a filter-bank is a better choice, at least i read somewhere it can be more CPU usage friendly in real time streaming environments. However i'm having hard times understanding the mathematical stuff behind a filterbank. I'm trying to find some source code (preferably in Java or C/C++) about this topic, so far with no success.
Can somebody provide me tips or links which can get me closer to an example filter bank?

Using FFT to split an Audio signal into few bands is overkill.
What you need is one or two Linkwitz-Riley filters. These filters split a signal into a high and low frequency part.
A nice property of this filter is, that if you add the low and high frequency parts you get almost the original signal back. There will be a little bit of phase-shift but the ear will not be able to hear this.
If you need more than two bands you can chain the filters. For example if you want to separate the signal at 100 and 2000Hz it would in pseudo-code somewhat like this:
low = linkwitz-riley-low (100, input-samples)
temp = linkwitz-riley-high (100, input-samples)
mids = linkwitz-riley-low (2000, temp)
highs = linkwitz-riley-high (2000, temp);
and so on..
After splitting the signal you can for example amplifiy the three output bands: low, mids and highs and later add them together to get your processed signal.
The filter sections itself can be implemented using IIR filters. A google search for "Linkwitz-Riley digital IIR" should give lots of good hits.
http://en.wikipedia.org/wiki/Linkwitz-Riley_filter

You should look up wavelets, especially Daubechies wavelets. They will let you do the trick, they're FIR filters and they're really short.
Update
Downvoting with no explanation isn't cool. Additionally, I'm right. Wavelets are filter banks and their job is to do precisely what is described in the question. IMHO, that is. I've done it many times myself.

There's a lot of filter source code to be found here

How to distinguish chords from single notes?

I am a bit stuck here as I cant seem to find some algorithms in trying to distinguish whether a sound produced is a chord or a single note. I am working on Guitar instrument.
Currently, what I am experimenting on is trying to get the Top 5 frequencies with the highest amplitudes, and then determining if they are harmonics of the fundamental (the one with the highest amplitude) or not. I am working on the theory that single notes contain more harmonics than chords, but I am unsure as to if this is the case.
Another thing I am considering is trying to add in the various amplitude values of the harmonics as well as comparing notes comprising the 'supposed chord' to the result from the FFT.
Can you help me out here? It would be really appreciated. Currently, I am only working on Major and Minor chords first.
Thank you very much!

Chord recognition is still a research topic. A good solution might require some fairly sophisticated AI pattern matching techniques. The International Society for Music Information Retrieval seems to run an annual contest on automatic transcription type problems. You can look up the conference and research papers on what has been tried, and how well it works.
Also note that the fundamental pitch is not necessarily the frequency with the highest FFT amplitude result. With a guitar, it very often is not.

You need to think about it in terms of the way we hear sound. Looking for the top 5 frequencies isnt going to do you any good.
You need to look for all frequencies within (Max Frequency Amplitude)/srt(2) to determin the chord/not chord aspect of the signal.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string