Audio classification with Keras: presence of human voice - audio

I'd like to create an audio classification system with Keras that simply determines whether a given sample contains human voice or not. Nothing else. This would be my first machine learning attempt.
This audio preprocessor exists. It claims not to be done, but it's been forked a few times:
https://github.com/drscotthawley/audio-classifier-keras-cnn
I don't understand how this one would work, but I'm ready to give it a try:
https://github.com/keunwoochoi/kapre
But let's say I got one of those to work, would the rest of the process be similar to image classification? Basically, I've never fully understood when to use Softmax and when to use ReLu. Would this be similar with sound as it would with images once I've got the data mapped as a tensor?

Sounds can be seen as a 1D image and be worked with with 1D convolutions.
Often, dilated convolutions may do a good work, see Wave Nets
Sounds can also be seen as sequences and be worked with RNN layers (but maybe they're too bulky in amount of data for that)
For your case, you need only one output with a 'sigmoid' activation at the end and a 'binary_crossentropy' loss.
Result = 0 -> no voice
Result = 1 -> there's voice
When to use 'softmax'?
The softmax function is good for multiclass problems (not your case) where you want only one class as a result. All the results of a softmax function will sum 1. It's intended to be like a probability of each class.
It's mainly used at the final layer, because you only get classes as the final result.
It's good for cases when only one class is correct. And in this case, it goes well with the loss categorical_crossentropy.
Relu and other activations in the middle of the model
These are not very ruled. There are lots of possibilities. I often see relu in image convolutional models.
Important things to know are they "ranges". What are the limits of their outputs?
Sigmoid: from 0 to 1 -- at the end of the model this will be the best option for your presence/abscence classification. Also good for models that want many possible classes together.
Tanh: from -1 to 1
Relu: from 0 to limitless (it simply cuts negative values)
Softmax: from 0 to 1, but making sure the sum of all values is 1. Good at the end of models that want only 1 class among many classes.

Oftentimes it is useful to preprocess the audio to a spectrogram:
Using this as input, you can use classical image classification approaches (like convolutional neural networks). In your case you could divide the input audio in frames of around 20ms-100ms (depending on the time resolution you need) and convert those frames to spectograms. Convolutional networks can also be combined with recurrent units to take a larger time context into account.
It is also possible to train neural networks on raw waveforms using 1D Convolutions. However research has shown that preprocessing approaches using a frequency transformation achieve better results in general.

Related

Training/Predicting with CNN / ResNet on all classes each iteration - concatenation of input data + Hungarian algorithm

So I've got a simple pytorch example of how to train a ResNet CNN to learn MNIST labeling from this link:
https://zablo.net/blog/post/using-resnet-for-mnist-in-pytorch-tutorial/index.html
It's working great, but I want to hack it a bit so that it does 2 things. First, instead of predicting digits, it predicts animal shapes/colors for a project I'm working on. That's already working quite well already and am happy with it.
Second, I'd like to hack the training (and possibly layers) so that predictions is done in parallel on multiple images at a time. In the MNIST example, basically prediction (or output) would be done for an image that has 10 digits at a time concatenated by me. For clarity, each 10-image input will have the digits 0-9 appearing only once each. The key here is that each of the 10 digit gets a unique class/label from the CNN/ResNet and each class gets assigned exactly once. And that digits that have high confidence will prevent other digits with lower confidence from using that label (a Hungarian algorithm type of approach).
So in my use case I want to train on concatenated images (not single images) as in Fig A below and force the classifier to learn to predict the best unique label for each of the concatenated images and do this all at once. Such an approach should outperform single image classification - and it's particularly useful for my animal classification because otherwise the CNN can sometimes return the same ID for multiple animals which is impossible in my application.
I can already predict in series as in Fig B below. And indeed looking at the confidence of each prediction I am able to implement a Hungarian-algorithm like approach post-prediction to assign the best (most confident) unique IDs in each batch of 4 animals. But this doesn't always work and I'm wondering if ResNet can try and learn the greedy Hungarian assignment as well.
In particular, it's not clear that implementing A simply requires augmenting the data input and labels in the training set will do it automatically - because I don't know how to penalize or dissalow returning the same label twice for each group of images. So for now I can generate these training datasets like this:
print (train_loader.dataset.data.shape)
print (train_loader.dataset.targets.shape)
torch.Size([60000, 28, 28])
torch.Size([60000])
And I guess I would want the targets to be [60000, 10]. And each input image would be [1, 28, 28, 10]? But I'm not sure what the correct approach would be.
Any advice or available links?
I think this is a specific type of training, but I forgot the name.

Keras regression - Should my first/last layer have an activation function?

I keep seeing examples floating around the internet where the input and/or output layer have either no activation function, a linear activation function, or None. What I'm confused about is when to use one, and how to know if you should? I also am confused about what the number of nodes should be for the input layer.
Right now I have a regression problem, I'm trying to predict a real value based on an array of inputs (about 54). Should I be using relu in my activation function for the input layer? Should I have linear as my output activation? My data is linearly scaled from 0 to 1 for each feature independently as they're different units. I was also unsure of the number of nodes I should use for my input layer as I see some examples pick an arbitrary number not related to their input shape, and other examples saying to specifically set it to the number of inputs, or number of inputs plus one for a bias. But none of the examples so far have explained their reasoning behind their choices.
Since my model isn't performing very well, I thought asking what the architecture should be could help me fine tune it more.

Image Classification using Tensorflow

I am doing transfer-learning/retraining using Tensorflow Inception V3 model. I have 6 labels. A given image can be one single type only, i.e, no multiple class detection is needed. I have three queries:
Which activation function is best for my case? Presently retrain.py file provided by tensorflow uses softmax? What are other methods available? (like sigmoid etc)
Which Optimiser function I should use? (GradientDescent, Adam.. etc)
I want to identify out-of-scope images, i.e. if users inputs a random image, my algorithm should say that it does not belong to the described classes. Presently with 6 classes, it gives one class as a sure output but I do not want that. What are possible solutions for this?
Also, what are the other parameters that we may tweak in tensorflow. My baseline accuracy is 94% and I am looking for something close to 99%.
Since you're doing single label classification, softmax is the best loss function for this, as it maps your final layer logit values to a probability distribution. Sigmoid is used when it's multilabel classification.
It's always better to use a momentum based optimizer compared to vanilla gradient descent. There's a bunch of such modified optimizers like Adam or RMSProp. Experiment with them to see what works best. Adam is probably going to give you the best performance.
You can add an extra label no_class, so your task will now be a 6+1 label classification. You can feed in some random images with no_class as the label. However the distribution of your random images must match the test image distribution, else it won't generalise.

Where do the input filters come from in conv-neural nets (MNIST Example)

I am a newby to the convolutional neural nets... so this may be an ignorant question.
I have followed many examples and tutorials now on the MNIST example in TensforFlow. In the CNN examples, all authors talk bout using the 'input filters' to run in the CNN. But no one that I can find mentions WHERE they come from. Can anyone answer where these come from? Or are they magically obtained from the input images.
Thanks! Chris
This is an image that one professor uses, be he does not exaplain if he made them or TensorFlow auto-extracts these somehow.
Disclaimer: I am not an expert, more of an enthusiast.
To cut a long story short: filters are the CNN equivalent of weights, and all a neural network essentially does is learning their optimal values.
Which it does by iterating through a training dataset, making predictions, comparing them to the label/value already assigned to each training unit (usually an image in case of a CNN) and adjusting weights to minimize the error function (the difference between the predicted value and the actual value).
Initial values of filters/weights do not matter that much, so although they might affect the speed of convergence to a small degree, I believe they are often assigned random values.
It is the job of the neural network to figure out the optimal weights, not of the person implementing it.

Neural Network Always Produces Same/Similar Outputs for Any Input [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 10 months ago.
Improve this question
I have a problem where I am trying to create a neural network for Tic-Tac-Toe. However, for some reason, training the neural network causes it to produce nearly the same output for any given input.
I did take a look at Artificial neural networks benchmark, but my network implementation is built for neurons with the same activation function for each neuron, i.e. no constant neurons.
To make sure the problem wasn't just due to my choice of training set (1218 board states and moves generated by a genetic algorithm), I tried to train the network to reproduce XOR. The logistic activation function was used. Instead of using the derivative, I multiplied the error by output*(1-output) as some sources suggested that this was equivalent to using the derivative. I can put the Haskell source on HPaste, but it's a little embarrassing to look at. The network has 3 layers: the first layer has 2 inputs and 4 outputs, the second has 4 inputs and 1 output, and the third has 1 output. Increasing to 4 neurons in the second layer didn't help, and neither did increasing to 8 outputs in the first layer.
I then calculated errors, network output, bias updates, and the weight updates by hand based on http://hebb.mit.edu/courses/9.641/2002/lectures/lecture04.pdf to make sure there wasn't an error in those parts of the code (there wasn't, but I will probably do it again just to make sure). Because I am using batch training, I did not multiply by x in equation (4) there. I am adding the weight change, though http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-2.html suggests to subtract it instead.
The problem persisted, even in this simplified network. For example, these are the results after 500 epochs of batch training and of incremental training.
Input |Target|Output (Batch) |Output(Incremental)
[1.0,1.0]|[0.0] |[0.5003781562785173]|[0.5009731800870864]
[1.0,0.0]|[1.0] |[0.5003740346965251]|[0.5006347214672715]
[0.0,1.0]|[1.0] |[0.5003734471544522]|[0.500589332376345]
[0.0,0.0]|[0.0] |[0.5003674110937019]|[0.500095157458231]
Subtracting instead of adding produces the same problem, except everything is 0.99 something instead of 0.50 something. 5000 epochs produces the same result, except the batch-trained network returns exactly 0.5 for each case. (Heck, even 10,000 epochs didn't work for batch training.)
Is there anything in general that could produce this behavior?
Also, I looked at the intermediate errors for incremental training, and the although the inputs of the hidden/input layers varied, the error for the output neuron was always +/-0.12. For batch training, the errors were increasing, but extremely slowly and the errors were all extremely small (x10^-7). Different initial random weights and biases made no difference, either.
Note that this is a school project, so hints/guides would be more helpful. Although reinventing the wheel and making my own network (in a language I don't know well!) was a horrible idea, I felt it would be more appropriate for a school project (so I know what's going on...in theory, at least. There doesn't seem to be a computer science teacher at my school).
EDIT: Two layers, an input layer of 2 inputs to 8 outputs, and an output layer of 8 inputs to 1 output, produces much the same results: 0.5+/-0.2 (or so) for each training case. I'm also playing around with pyBrain, seeing if any network structure there will work.
Edit 2: I am using a learning rate of 0.1. Sorry for forgetting about that.
Edit 3: Pybrain's "trainUntilConvergence" doesn't get me a fully trained network, either, but 20000 epochs does, with 16 neurons in the hidden layer. 10000 epochs and 4 neurons, not so much, but close. So, in Haskell, with the input layer having 2 inputs & 2 outputs, hidden layer with 2 inputs and 8 outputs, and output layer with 8 inputs and 1 output...I get the same problem with 10000 epochs. And with 20000 epochs.
Edit 4: I ran the network by hand again based on the MIT PDF above, and the values match, so the code should be correct unless I am misunderstanding those equations.
Some of my source code is at http://hpaste.org/42453/neural_network__not_working; I'm working on cleaning my code somewhat and putting it in a Github (rather than a private Bitbucket) repository.
All of the relevant source code is now at https://github.com/l33tnerd/hsann.
I've had similar problems, but was able to solve by changing these:
Scale down the problem to manageable size. I first tried too many inputs, with too many hidden layer units. Once I scaled down the problem, I could see if the solution to the smaller problem was working. This also works because when it's scaled down, the times to compute the weights drop down significantly, so I can try many different things without waiting.
Make sure you have enough hidden units. This was a major problem for me. I had about 900 inputs connecting to ~10 units in the hidden layer. This was way too small to quickly converge. But also became very slow if I added additional units. Scaling down the number of inputs helped a lot.
Change the activation function and its parameters. I was using tanh at first. I tried other functions: sigmoid, normalized sigmoid, Gaussian, etc.. I also found that changing the function parameters to make the functions steeper or shallower affected how quickly the network converged.
Change learning algorithm parameters. Try different learning rates (0.01 to 0.9). Also try different momentum parameters, if your algo supports it (0.1 to 0.9).
Hope this helps those who find this thread on Google!
So I realise this is extremely late for the original post, but I came across this because I was having a similar problem and none of the reasons posted here cover what was wrong in my case.
I was working on a simple regression problem, but every time I trained the network it would converge to a point where it was giving me the same output (or sometimes a few different outputs) for each input. I played with the learning rate, the number of hidden layers/nodes, the optimization algorithm etc but it made no difference. Even when I looked at a ridiculously simple example, trying to predict the output (1d) of two different inputs (1d):
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
class net(nn.Module):
def __init__(self, obs_size, hidden_size):
super(net, self).__init__()
self.fc = nn.Linear(obs_size, hidden_size)
self.out = nn.Linear(hidden_size, 1)
def forward(self, obs):
h = F.relu(self.fc(obs))
return self.out(h)
inputs = np.array([[0.5],[0.9]])
targets = torch.tensor([3.0, 2.0], dtype=torch.float32)
network = net(1,5)
optimizer = torch.optim.Adam(network.parameters(), lr=0.001)
for i in range(10000):
out = network(torch.tensor(inputs, dtype=torch.float32))
loss = F.mse_loss(out, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("Loss: %f outputs: %f, %f"%(loss.data.numpy(), out.data.numpy()[0], out.data.numpy()[1]))
but STILL it was always outputting the average value of the outputs for both inputs. It turns out the reason is that the dimensions of my outputs and targets were not the same: the targets were Size[2], and the outputs were Size[2,1], and for some reason PyTorch was broadcasting the outputs to be Size[2,2] in the MSE loss, which completely messes everything up. Once I changed:
targets = torch.tensor([3.0, 2.0], dtype=torch.float32)
to
targets = torch.tensor([[3.0], [2.0]], dtype=torch.float32)
It worked as it should. This was obviously done with PyTorch, but I suspect maybe other libraries broadcast variables in the same way.
For me it was happening exactly like in your case, the output of neural network was always the same no matter the training & number of layers etc.
Turns out my back-propagation algorithm had a problem. At one place I was multiplying by -1 where it wasn't required.
There could be another problem like this. The question is how to debug it?
Steps to debug:
Step1 : Write the algorithm such that it can take variable number of input layers and variable number of input & output nodes.
Step2 : Reduce the hidden layers to 0. Reduce input to 2 nodes, output to 1 node.
Step3 : Now train for binary-OR-Operation.
Step4 : If it converges correctly, go to Step 8.
Step5 : If it doesn't converge, train it only for 1 training sample
Step6 : Print all the forward and prognostication variables (weights, node-outputs, deltas etc)
Step7 : Take pen&paper and calculate all the variables manually.
Step8 : Cross verify the values with algorithm.
Step9 : If you don't find any problem with 0 hidden layers. Increase hidden layer size to 1. Repeat step 5,6,7,8
It sounds like a lot of work, but it works very well IMHO.
I know, that for the original post this is far, too late but maybe I can help someone with this, as I faced the same problem.
For me the problem was, that my input data had missing values in important columns, where the training/test data were not missing. I replaced these values with zero values and voilĂ , suddenly the results were plausible. So maybe check your data, maybe it si misrepresented
It's hard to tell without seeing a code sample but it is possible occure for a net because its number of hidden neron.with incresing in number of neron and number of hiden layer it is not possible to train a net with small set of training data.until it is possible to make a net with smaller layer and nerons it is amiss to use a larger net.therefore perhaps your problem solved with attention to this matters.
I haven't tested it with the XOR problem in the question, but for my original dataset based on Tic-Tac-Toe, I believe that I have gotten the network to train somewhat (I only ran 1000 epochs, which wasn't enough): the quickpropagation network can win/tie over half of its games; backpropagation can get about 41%. The problems came down to implementation errors (small ones) and not understanding the difference between the error derivative (which is per-weight) and the error for each neuron, which I did not pick up on in my research. #darkcanuck's answer about training the bias similarly to a weight would probably have helped, though I didn't implement it. I also rewrote my code in Python so that I could more easily hack with it. Therefore, although I haven't gotten the network to match the minimax algorithm's efficiency, I believe that I have managed to solve the problem.
I faced a similar issue earlier when my data was not properly normalized. Once I normalized the data everything ran correctly.
Recently, I faced this issue again and after debugging, I found that there can be another reason for neural networks giving the same output. If you have a neural network that has a weight decay term such as that in the RSNNS package, make sure that your decay term is not so large that all weights go to essentially 0.
I was using the caret package for in R. Initially, I was using a decay hyperparameter = 0.01. When I looked at the diagnostics, I saw that the RMSE was being calculated for each fold (of cross validation), but the Rsquared was always NA. In this case all predictions were coming out to the same value.
Once I reduced the decay to a much lower value (1E-5 and lower), I got the expected results.
I hope this helps.
I was running into the same problem with my model when number of layers is large. I was using a learning rate of 0.0001. When I lower the learning rate to 0.0000001 the problem seems solved. I think algorithms stuck on local minumums when learning rate is too low
It's hard to tell without seeing a code sample, but a bias bug can have that effect (e.g. forgetting to add the bias to the input), so I would take a closer look at that part of the code.
Based on your comments, I'd agree with #finnw that you have a bias problem. You should treat the bias as a constant "1" (or -1 if you prefer) input to each neuron. Each neuron will also have its own weight for the bias, so a neuron's output should be the sum of the weighted inputs, plus the bias times its weight, passed through the activation function. Bias weights are updated during training just like the other weights.
Fausett's "Fundamentals of Neural Networks" (p.300) has an XOR example using binary inputs and a network with 2 inputs, 1 hidden layer of 4 neurons and one output neuron. Weights are randomly initialized between +0.5 and -0.5. With a learning rate of 0.02 the example network converges after about 3000 epochs. You should be able to get a result in the same ballpark if you get the bias problems (and any other bugs) ironed out.
Also note that you cannot solve the XOR problem without a hidden layer in your network.
I encountered a similar issue, I found out that it was a problem with how my weights were being generated.
I was using:
w = numpy.random.rand(layers[i], layers[i+1])
This generated a random weight between 0 and 1.
The problem was solved when I used randn() instead:
w = numpy.random.randn(layers[i], layers[i+1])
This generates negative weights, which helped my outputs become more varied.
I ran into this exact issue. I was predicting 6 rows of data with 1200+ columns using nnet.
Each column would return a different prediction but all of the rows in that column would be the same value.
I got around this by increasing the size parameter significantly. I increased it from 1-5 to 11+.
I have also heard that decreasing your decay rate can help.
I've had similar problems with machine learning algorithms and when I looked at the code I found random generators that were not really random. If you do not use a new random seed (such Unix time for example, see http://en.wikipedia.org/wiki/Unix_time) then it is possible to get the exact same results over and over again.

Resources