How does ReLu work with zero-centered output domain? - keras

In the problem i am trying to solve, my output domain is zero centered, between -1 an 1. When looking up activation functions i noticed that ReLu outputs values between 0 and 1, which basically would mean that your output is all negative or all positive.
This can be mapped back to the appropriate domain through inverse normalization but ReLu is designed to determine the "strength" of a neuron in a single direction, but in my problem, i need to determine the strength of a neuron in one of two direction. If i use tanh, i have to worry about vanishing/exploding gradients, but if i use ReLu, my output will always be "biased" towards positives or negative values because essentially really small values would be have to mapped to a postitive domain and large value a negative domain or visa versa.
Other info: I've used ReLu and it works well but i fear that it is for the wrong reasons. The reason i say this is that it seems for either the pos or neg domain approaching smaller values will mean a stronger connection up to a point, then which it will not be activated at all. Yes the network can technically work (probably harder than it needs to) to keep the entire domain of train outputs in the positive space, but if a value happens to exceed the bounds of the training set it will be non-existent? when in reality it should be even more active
What is the appropriate way to deal with zero centered output domains?

I think you have to use Sign function. It's zero center and have -1 , 1 as the out put.
Sign function:
https://helloacm.com/wp-content/uploads/2016/10/math-sgn-function-in-cpp.jpg

You could go with variations of ReLU which output values with mean closer to zero or being zero (ELU, CELU, PReLU and others) and having other interesting specific traits. Furthermore, it would help with the dying neurons problem in ReLU.
Anyway, I'm not aware of any hard research proving usefulness of one over the other, it is still in experimentation phase and really problem dependent from what I recall (pls correct me if I'm wrong).
And you should really check whether activation function is problematic in your case, it might be totally fine to go with ReLU.

First, you don't have to put an activation function after the last layer in your neural network. Activation function is required between layers to introduce non-linearity, so it's not required in the last layer.
You're free to experiment various options:
Use tanh. Vanishing/exploding gradient is sometimes not a problem in practice depending on the network architecture, and if you initialize the weights properly.
Do nothing. The NN should be trained to output value between -1 to 1 for "typical" inputs. You can clip the value in application layer.
Clip the output in the network. E.g. out = tf.clip_by_value(out, -1.0, 1.0)
Be creative and try your other ideas.
At the end, ML is a process of trial-and-error. Try different things and find something that works for you. Good luck.

Related

Keras loss functions: how to round?

I'm trying to recognize turning points in sequences, the points after which some process behaves differently. I use a keras model to do this. Input is the sequence (always the same length) and output should be 0 before the turning points, a 1 after the turning point.
I want the loss function to depend on the distance between the actual turning point and the predicted turning point.
I tried to round (to obtain the label 0 or 1), followed by summing the total number of 1's to get the "index" of the turning point. Assumed here is that the model gives just one turning point, as the data (synthetically produced) also has just one turning point. Tried is:
def dist_loss(yTrue,yPred):
turningPointTrue = K.sum(yTrue)
turningPointPred = K.sum(K.round(yPred))
return K.abs(turningPointTrue-turningPointPred)
This does not work, the following error is given:
ValueError: An operation has None for gradient. Please make sure
that all of your ops have a gradient defined (i.e. are
differentiable). Common ops without gradient: K.argmax, K.round,
K.eval.
I think this means that K.round(yPred) gives a singular value, instead of a vector/tensor. Does anyone know how to solve this issue?
The round operation has no defined gradient, so it cannot be used at all inside a loss function, since for training of a neural network the gradient of the loss with respect to the weights has to be computed, and this implies that all the parts of the network and loss must be differentiable (or a differentiable approximation must be available).
In your case you should try to find an approximation to round that is differentiable, but unfortunately I don't know if there is one. One example of such approximation is the softmax function as approximation of the max function.

Questions about standardizing and scaling

I am trying to generate a model that uses several physico-chemical properties of a molecule (incl. number of atoms, number of rings, volume, etc.) to predict a numeric value Y. I would like to use PLS Regression, and I understand that standardization is very important here. I am programming in Python, using scikit-learn. The type and range for the features varies. Some are int64 while other are float. Some features generally have small (positive or negative) values, while other have very large value. I have tried using various scalers (e.g. standard scaler, normalize, minmax scaler, etc.). Yet, the R2/Q2 are still low. I have a few questions:
Is it possible that by scaling, some of the very important features lose their significance, and thus contribute less to explaining the variance of the response variable?
If yes, if I identify some important features (by expert knowledge), is it OK to scale other features but those? Or scale the important features only?
Some of the features, although not always correlated, have values that are in a similar range (e.g. 100-400), compared to others (e.g. -1 to 10). Is it possible to scale only a specific group of features that are within the same range?
The whole idea of scaling is to make models more robust to analysis on features space. For example, if you have 2 features as 5 Kg and 5000 gm, we know both are same, but for some algorithm, which are sensitive to metric space such as KNN, PCA etc, they will be more weighted towards second features, so scaling must be done for these algos.
Now coming to your question,
Scaling doesn't effect the significance of features. As i explained above, it helps in better analysis of data.
No, you should not do, reason explained above.
If you want to include domain knowledge in your model, you can use it as prior information. In short, for linear model, this is same as regularization. It has very good features. if you think, you have many useless-features, you can use L1 regularization, which creates sparse effect on features space, which is nothing but assign 0 weight to useless features. Here is the link for more-info.
One more point, some method such as tree based model doesn't need scaling, In last, it mostly depend on the model, you choose.
Lose significance? Yes. Contribute less? No.
No, it's not OK. It's either all or nothing.
No. The idea of scaling is not to decrease / increase significance / effect of a variable. It's to transform all variables to a common scale that can be interpreted.

Audio signal source separation with neural network

What I am trying to do is separating the audio sources and extract its pitch from the raw signal.
I modeled this process myself, as represented below:
Each sources oscillate in normal modes, often makes its component peaks' frequency integer multiplication. It's known as Harmonic. And then resonanced, finally combined linearly.
As seen in above, I've got many hints in frequency response pattern of audio signals, but almost no idea how to 'separate' it. I've tried countless of my own models. This is one of them:
FFT the PCM
Get peak frequency bins and amplitudes.
Calculate pitch candidate frequency bins.
For each pitch candidates, using recurrent neural network analyze all the peaks and find appropriate combination of peaks.
Separate analyzed pitch candidates.
Unfortunately, I've got non of them successfully separates the signal until now.
I want any of advices to solve these kind of problem.
Especially in modeling of source separation like my one above.
Because no one has really attempted to answer this, and because you've marked it with the neural-network tag, I'm going to address the suitability of a neural network to this kind of problem. As the question was somewhat non-technical, this answer will also be "high level".
Neural networks require some sort of sample set from which to learn. In order to "teach" a neural net to solve this problem you would essentially need to have a working set of known solutions to work from. Do you have this? If so, read on. If not, a neural is probably not what you are seeking. You stated that you have "many hints" but no real solution. This leads me to believe you probably don't have sample sets. If you can get them, great, otherwise you might be out of luck.
Supposing now that you have a sample set of Raw Signal samples and corresponding Source 1 and Source 2 outputs... Well, now you're going to need a method for deciding on a topology. Assuming you don't know a lot about how neural nets work (and don't want to), and assuming you also don't know the exact degree of complexity of the problem, I would probably recommend the open source NEAT package to get you started. I am not affiliated in any way with this project, but I have used it, and it allows you to (relatively) intelligently evolve neural network topologies to fit the problem.
Now, in terms of how a neural net would solve this specific problem. The first thing that comes to mind is that all audio signals are essentially time-series. That is to say, the information they convey is somehow dependent and related to the data at previous timesteps (e.g. the detection of some waveform cannot be done from a single time-point; it requires information about previous timesteps as well). Again, there's a million ways of solving this problem, but since I'm already recommending NEAT I'd probably suggest you take a look at the C++ NEAT Time Series mod.
If you're going down this route, you'll probably be wanting to use some sort of sliding window to provide information about the recent past at each time step. For a quick and dirty intro to sliding windows, check out this SO question:
Time Series Prediction via Neural Networks
The size of the sliding window can be important, especially if you're not using recurrent neural nets. Recurrent networks allow neural nets to remember previous time steps (at the cost of performance - NEAT is already recurrent so that choice is made for you here). You will probably want the sliding window length (ie. the number of timesteps in the past provided at every time step) to be roughly equal to your conservative guess of the largest number of previous timesteps required to gain enough information to split your waveform.
I'd say this is probably enough information to get you started.
When it comes to deciding how to provide the neural net with the data, you'll first want to normalise the input signals (consider a sigmoid function) and experiment with different transfer functions (sigmoid would probably be a good starting point).
I would imagine you'll want to have 2 output neurons, providing normalised amplitude (which you would denormalise via the inverse of the sigmoid function) as the output representing Source 1 and Source 2 respectively. For the fitness value (the way you judge the ability of each tested network to solve the problem) would be something along the lines of the negative of the RMS error of the output signal against the actual known signal (ie. tested against the samples I was referring to earlier that you will need to procure).
Suffice to say, this will not be a trivial operation, but it could work if you have enough samples to train the network against. What is a good number of samples? Well as a rule of thumb it's roughly a number that is large enough such that a simple polynomial function of order N (where N is the number of neurons in the netural network you require to solve the problem) cannot fit all of the samples accurately. This is basically to ensure you are not simply overfitting the problem, which is a serious issue with neural networks.
I hope this has been helpful! Best of luck.
Additional note: your work to date wouldn't have been in vain if you go down this route. A neural network is likely to benefit from additional "help" in the form of FFTs and other signal modelling "inputs", so you might want to consider taking the signal processing you have already done, organising into an analog, continuous representation and feeding it as an input alongside the input signal.

Statistics Dummy Variable as Dependent Variable Regression

I have a bunch of independent variables: height, weight, etc that I want to regress a dummy variable on to. For instance, if I wanted to regress diabetes (0 if patient doesnt have diabetes, 1 if patient does have diabetes) and I wanted to figure out the effect of an increase in 1 pound of weight on the probability of having diabetes, how would I do that? I'm sure there are multiple ways of doing it but I just never have heard of a model that does this. I thought it was the probit model but I'm not sure. Any thoughts?
The problem you are describing is known as logistic regression; a web search for that should turn up a lot of hits. Most commonly, the response is some function of a linear combination of inputs, but more generally, the response could be a nonlinear function of inputs.
The dependence of the response on an input (e.g. weight) is interesting, but not exactly well-posed, since the change of the probability of the response varies over the range of the input variable; the change is very small for very large or very small values of the input, and reaches some maximum in between.

How do I interpret an incorrect result?

I have been using libsvm. It produces some good results (95% on positives, 94% on negatives). When I examine the ones that it gets incorrect, however, I am confused about why it got them wrong. How do I determine what it is doing wrong? (More importantly, how do I explain it to my boss?). Some of the testing inputs it gets wrong are very close (visually) to some of the testing inputs it gets right.
About my problem: I am looking at images, 32x32 pixels, 8-bit greyscale. I am evaluating different feature detectors and using them as a dense representation (i.e. at every pixel) of the image. Hence, my feature length is often 1024; some of the feature detectors have multiple outputs, sometimes I do not use every pixel but every 3rd or 5th, etc.. It is a binary classification task, looking for figures in the image; for example, I am trying to find a square, with various letters for negatives. The SVM does well. But sometimes, it will classify a T as a square, and I don't know why. If I'm using probabilities, then sometimes the probability is quite high. What do I do to get an insight into what it is doing and why?

Resources