Disclaimer: Complete beginner with neural networks & audio representation. Please bear with me.
I have this idea for my bachelor's thesis (MIR) that involves applying a beat-like time-based pattern to constrain where a CNN-based acoustic model finds onsets/offsets. The problem is that I'm having a hard time figuring out how to implement this concept.
The initial plan was to just insert both the spectrogram and the pattern into the CNN and hope it processes it, but I don't know what format the pattern should be in. I know CNNs are best at processing images but the initial format of said pattern is "time-based" (beats per minute/second). Can this number be represented as an image to be compared to the spectrogram? If so, in what format? Or should I handle this problem in a different way? Thank you in advance!
Related
I would like to know a couple of things to clear my confusion. I want to work on a medical neuroimage MRI image scans dataset from the ADNI database.
Each Alzheimer's Disease (AD) MRI image scan has multiple slices.
Do I have to separate each image scan slice and label each of them as AD or combine all image scan slices as a one-image scan and label it for classification?
Most of the medical neuroimage DICOM, NfINT, NII, etc., format. Is it mandatory to convert them to png or jpg for the CNN network model or keep it in NfNIT or nii format?
I have read several existing papers on neuroimaging regarding Alzheimer's disease but did not find the above question answer. Even I have sent an email to the research paper writer in reply; I got they can not help on this as they are very busy and mention their sincere apology for that.
It will be very helpful if anyone has the answer to clear my confusion and thought.
Thank you.
You can train with NIfTI, using, for example, TorchIO. There's no need to separate each slice, you can use the 3D image as is.
You can find some examples in the documentation.
Disclaimer: I'm the main developer of TorchIO.
I am looking to understand various spectrograms for audio analysis. I want to convert an audio file into 10 second chunks, generate spectrograms for each and use a CNN model to train on top of those images to see if they are good or bad.
I have looked at linear, log, mel, etc and read somewhere that mel based spectrogram is best to be used for this. But with no proper verifiable information. I have used the simple following code to generate mel spectrogram.
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
My question is which spectrogram best represents features of an audio file for training with CNN? I have used linear but some audio files the linear spectrogram seems to be the same
To add to what has been stated, I recommend reading through A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging by Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler.
For their data, they achieved nearly identical classification accuracy between simple STFTs and melspectrograms. So melspectrograms seem to be the clear winner for dimension reduction if you don't mind the preprocessing. The authors also found, as jonner mentions, that log-scaling (essentially converting amplitude to a db scale) improves accuracy. You can easily do this with Librosa (using your code) like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_db = librosa.core.power_to_db(S)
As for normalization after db-scaling, that seems hit or miss depending on your data. From the paper above, the authors found nearly no difference using various normalization techniques for their data.
One last thing that should be mentioned is a somewhat new method called Per-Channel Energy Normalization. I recommend reading Per-Channel Energy Normalization: Why and How by Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee,
Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Unfortunately, there are some parameters that need adjusting depending on the data, but in many cases seems to do as well as or better than logmelspectrograms. You can implement it in Librosa like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_pcen = librosa.pcen(S)
Although, like I mentioned, there are parameters within pcen that need adjusting! Here is Librosa's documentation on PCEN to get you started if you are interested.
Log-scaled mel-spectrograms is the current "standard" for use with Convolutional Neural Networks. It was the most commonly used in Audio Event Detection and Audio Scene Classification literature between 2015-2018.
To be more invariant to amplitude changes, normalized is usually applied. Either to entire clips or the windows being classified. Mean/std normalization works fine, generally.
But from the perspective of a CNN, there is relatively small difference between the different spectrometer variations. So this is unlikely to fix your issue if two or more spectrograms are basically the same.
What type of neural net architecture would one use to map sounds to other sounds? Neural nets are great at learning to go from sequences to other sequences, so sound augmentation/generation seems like it'd be a very popular application of them (but unfortunately, it's not - I could only find a (fairly old) magenta project dealing with it, and like, 2 other blog posts).
Assuming I have a sufficiently large dataset of input sounds / output sounds of the same length, how would i format the data? Perhaps train a CNN on spectrograms (something like cycleGAN or pix2pix), maybe use the actual data from the WAV file and use an LSTM? Is there some other type of weird architecture no one has heard about that's good for sound? Help me out please!
To anyone else doing a similar thing - the answer is using fast fourier transforms to get the data into a manageable state, and then people usually use RNNs or LSTMs to do stuff with the data - not CNNs.
I am new to computer vision, and now I am do some research on object detection. I have read papers about faster RCNN and RFCN, also read YOLO. It seems the biggest problem is the speed? And all of them use image data data only. Are there any models that combines text and image data? Which means we can use the information from text to help detection when the training data is small. For example, when the training data is small, the model cannot tell dogs and cats clearly, but the model could tell there is a bone near that object, and the model gets some information from text that the object near a bone is most likely a dog, thus the model now could tell what the object is. Does this kind of algorithm exist? I haven't found them, hope you could help me. Thanks a lot.
It seems you have mostly referred to research on Deep Networks for Object Detection. Prior to the success of deep networks, researchers were looking to to the possibility of using text with image features to implement ideas similar to yours. You might want to refer to papers from ACM Multimedia and IEEE TMM, especially those before 2014.
The problem was that those approaches could not perform as well as the simplest of the deep networks that use only images. There is some work on combining both images and text, such as this paper. I am sure at least some researchers are already working on this.
I am trying to do a timeline detection problem using text classification. As a newbie I am confused as to how I can go about with this. Is this a classification problem? i.e, Can I use the years(timelines) as outcomes and solve this as a classification problem?
You should be able to solve this as a classification problem as you suggest. An option could be to find or build a corpus consisting of texts tagged with the period in which they're set, and train a classification algorithm on this data set.
Another option could be to train a word space model on such a data set, and generate vectors for different periods of time (e.g. the 50s, 60s etc.). You could then create a document vector for the text you wish to classify, and find which of these time vectors yields the best match.
Might not work, but it could be interesting to see what results you get.
Hope this helps!