How to handle long audio clips in machine learning? [closed] - audio

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
What do people do when handling long audio clip(2min-5min, 44.1khz) in machine learning tasks such as music classification?
Is there any methods except downsampling that would help to reduce the dimensionality of audio data?

Usually you are extracting frequency features like spectrogram or MFCC and then you classify them. They have less values than raw audio, so they are easier to analyze.
You can find some visualizations of spectrograms and MFCC here (related to speech, but scales):
https://www.kaggle.com/davids1992/speech-visualization-and-exploration
Note that pooling somehow reduces dimensionality of data in CNN.
So find about spectral analysis. You are rarely working with raw waves, although they are starting to work also, like WaveNet:
https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Related

Feeding an image to stacked resnet blocks to create an embedding [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Do you have any code example or paper that refers to something like the following diagram?
I want to know why we want to stack multiple resnet blocks as opposed to multiple convolutional block as in more traditional architectures? Any code sample or referring to one will be really helpful.
Also, how can I transfer that to something like the following that can contain self-attention module for each resnet block?
Applying self-attention to the outputs of Resnet blocks at the very high resolution of the input image may lead to memory issues: The memory requirements of self-attention blocks grow quadratically with the input size (=resolution). This is why in, e.g., Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He Non-Local Neural Networks (CVPR 2018) they introduced self-attention only at a very deep layer of the architecture, once the feature map was substantially sub-sampled.

What is the difference between the different GloVe models? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
https://nlp.stanford.edu/projects/glove/
I'm trying to use GloVe for summarizing music reviews, but I'm wondering which version is the best for my project. Will "glove.840B.300d.zip" give me a more accurate text summarization since it used way more tokens? Or perhaps the Wikipedia 2014 + Gigaword 5 is more representative than Common Crawl? Thanks!
Unfortunately I don't think anyone can give you a better answer for this than:
"try several options, and see which one works the best"
I've seen work that uses the Wikipedia 2014 + Gigaword 100d vectors that produced SOTA results for reading comprehension. Without experimentation, it's difficult to say conclusively which corpus is closer to your music review set, or what the impact of larger dimensional word embeddings will be.
This is just random advice, but I guess I would suggest trying in this order:
100d from Wikipedia+Gigaword
300d from Wikipedia+Gigaword
300d from Common Crawl
You might as well start with the smaller dimensional embeddings while prototyping, and then you could experiment with larger embeddings to see if you get a performance enhancement.
And in the spirit of promoting other group's work, I would definitely say you should look at these ELMo vectors from AllenNLP:
http://allennlp.org/elmo
They look very promising!

conv2d is more accurate or conv1d in image classification? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have executed a program with image classification and it was running good .I was running the code with conv1D and conv2D . I am getting accuracy of 0.854 for both conv1D and conv2D.
Can i know the exact differences between these two things in detail?
Conv1d is a convolution filter of 1 dimension (imagine it like a one dimension array). Conv2d is a filter with 2 dimensions (like a 2d array) and it is more suitable for data like images where it can retain more spatial information in a data point because it is applied to all the neighbors. You can see what is a kernel to understand why this is better for data like images. For non image data I guess it will not have significant impact whether you use 1d or 2d convolution arrays.
Note: Also this site is for programming problems, maybe you should ask your question in Data Science

Can I match speaker with pitch, timbre and volume? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I want to make a speaker recognition system. I don’t want to make it using deep learning as perhaps it will require a lot of data. Can I implement it using audio components mentioned above or more?
In all case, you will need data learning if you want to "recognize" speakers. A classical approach is based on MFCC computation and a classification by kMeans (or more elaborate GMMs).
You'll find here an overview of the full system of the LIUM for speaker diarization (which is more sophisticated).

Best architecture for object recognition [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm evaluating the options of using HTM (hierarchical temporal memory) and CNN (convolutional neural network) for object recognition. Which architecture (model) would is most appropriate in this case?
Convolutional Neural Network and its variants are best tool for object recognition .
You can try with AlexNet,VGGNEt, ResNet, Batch Normalization , Dropout etc.
Always prefer using pretrained models and using transfer learning first in these cases. You can check out the implementation of Inception V3 etc. for object detection on tensorflow website and use them for transfer learning for your project.

Resources