Why does sigmoid function outperform tanh and softmax in this case? - keras

The sigmoid function gives better results than tanh or softmax for the below neural network.
If I change the activation function from sigmoid to tanh or softmax the error increases an accuracy decreases. Although I have learned that tanh and softmax are better compared to sigmoid. Could someone help me understand this?
The datasets I used are iris and Pima Indians Diabetes Database. I have used TensorFlow 1.5 and Keras 2.2.4
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split
import numpy as np
dataset = np.genfromtxt('diabetes.csv', dtype=float, delimiter=',')
X = dataset[1:, 0:8]
Y = dataset[1:, 8]
xtrain, xtest, ytrain, ytest = train_test_split(X, Y, test_size=0.2, random_state=42)
model = Sequential()
model.add(Dense(10, input_dim=8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(xtrain, ytrain, epochs=50, batch_size=20)
print(model.metrics_names)
print(model.evaluate(xtest, ytest))

The value range is between -1 and 1, but that's not necessarily a problem as far as the Tanh is concerned. By learning suitable weights, the Tanh can fit to the value range [0,1] using the bias. Therefore both the Sigmoid and the Tangh can be used here. Only Softmax is not possible for the reasons mentioned. See the code below:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
X = np.hstack((np.linspace(0, 0.45, num=50), np.linspace(0.55, 1, num=50)))
Y = (X > 0.5).astype('float').T
model = Sequential()
model.add(Dense(1, input_dim=1, activation='tanh'))
model.compile(loss='binary_crossentropy', optimizer='SGD', metrics=['accuracy'])
model.fit(X, Y, epochs=100)
print(model.evaluate(X, Y, verbose=False))
Whenever someone says you should always prefer foo over bar in machine learning, it's probably an inadmissible simplification. There are anti-patterns that one can explain to people, things that never work, like the Softmax in the example above. If the rest were that simple, AutoML would be a very boring field of research ;) . PS: I'm not exactly working on AutoML.

Softmax activation function is generally used as a categorical activation. This is because softmax squashes the outputs between the range (0,1) so that the sum of the outputs is always 1. If your output layer only has one unit/neuron, it will always have a constant 1 as an output.
Tanh, or hyperbolic tangent is a logistic function that maps the outputs to the range of (-1,1). Tanh can be used in binary classification between two classes. When using tanh, remember to label the data accordingly with [-1,1].
Sigmoid function is another logistic function like tanh. If the sigmoid function inputs are restricted to real and positive values, the output will be in the range of (0,1). This makes sigmoid a great function for predicting a probability for something.
So, all in all, the output activation function is usually not a choice of model performance but actually is dependent on the task and network architecture you are working with.

Related

Minimizing and maximizing the loss

I would like to train an autoencoder in such a way that the reconstruction error will be low on some observations, and high on the others.
from keras.model import Sequential
from keras.layers import Dense
import keras.backend as K
def l1Loss(y_true, y_pred):
return K.mean(K.abs(y_true - y_pred))
model = Sequential()
model.add(Dense(5, input_dim=10, activation='relu'))
model.add(Dense(10, activation='sigmoid'))
model.compile(optimizer='adam', loss=l1Loss)
for i in range(1000):
model.train_on_batch(x_good, x_good) # minimize on low
model.train_on_batch(x_bad, x_bad, ???) # need to maximize this part, so that mse(x_bad, x_bad_reconstructed is high)
I saw something about replacing ??? with sample_weight=-np.ones(batch_size), but I have no idea if this is fitting for my goal.
If you set sample weight to negative numbers, then minimizing it would in fact lead to maximization of its absolute value.

How do i calculate accuracy of my ANN in this case

I am running the following code. I want to calculate accuracy of my ANN for test data. I am using windows platfrom, python 3.5
import numpy
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
dataset=pd.read_csv('main.csv')
dataset=dataset.fillna(0)
X=dataset.iloc[:, 0:6].values
#X = X[numpy.logical_not(numpy.isnan(X))]
y=dataset.iloc[:, 6:8].values
#y = y[numpy.logical_not(numpy.isnan(y))]
#regr = LinearRegression()
#regr.fit(numpy.transpose(numpy.matrix(X)), numpy.transpose(numpy.matrix(y)))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.24,random_state=0)
create model
model = Sequential()
model.add(Dense(4, input_dim=6, kernel_initializer='normal', activation='relu'))
model.add(Dense(4, kernel_initializer='normal', activation='relu'))
model.add(Dense(2, kernel_initializer='normal'))
Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, batch_size=5, epochs=5)
y_pred=model.predict(X_test)
Now, i want to calculate the accuracy of y_pred. Any help will be appreciated.
The above code is self explanatory. I am currently using only 5 epochs just for experimenting.
Keras already implements metrics such as accuracy, so you just need to change the model.compile line to:
model.compile(loss='mean_squared_error', optimizer='adam',
metrics = ["accuracy"])
Then training and validation accuracy (in the [0, 1] range) will be presented at the progress bar during training, and you can compute accuracy with model.evaluate as well, which will return a tuple of loss and metrics (accuracy in this case).
Besides the suggestion of using keras. You can compute the accuracy using scikit-learn as follows:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
For more information, check the documentation : sklearn.metrics.accuracy_score
Although in a narrow technical sense both answers already provided are correct, there is a more general issue with your question which affects the essence of it: are you in a regression or a classification context?
If you are in a regression context (as implied by your loss='mean_squared_error' and the linear activation in your output layer), then the simple augmentation of model compilation
model.compile(loss='mean_squared_error', optimizer='adam',
metrics = ["accuracy"])
will, as Matias says, provide the accuracy. Nevertheless, accuracy is meaningless in a regression setting; see the answer & discussion here for more details.
If you are in a classification context (as implied by your wish to calculate the accuracy, which is meaningful only in classification), then your loss function should not be the MSE, but the cross-entropy instead, plus that the activation of your last layer should not be linear.
to compute accuracy we can use model.evaluate function

How to use LSTM to predict values from a different range than it was trained on

I try to train a simple LSTM to predict the next number in a sequence (1,2,3,4,5 --> 6).
from keras.models import Sequential
from keras.layers import LSTM, Dense
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
xs = [[[(j+i)/100] for j in range(5)] for i in range(100)]
ys = [(i+5)/100 for i in range(100)]
x_train, x_test, y_train, y_test = train_test_split(xs, ys)
model = Sequential()
model.add(LSTM(1, input_shape=(5,1), return_sequences=True))
model.add(LSTM(1, return_sequences=False))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam', metrics=['accuracy'])
training = model.fit(x_train, y_train, epochs=200)
new_xs = np.array(xs)*5
new_ys = np.array(ys)*5
pred = model.predict(new_xs)
plt.scatter(range(len(pred)), pred, c='r')
plt.scatter(range(len(new_ys)), new_ys, c='b')
In order for the net to learn anything I had to normalize the training data (divided it by 100). It did work indeed for the data from the range it was trained on.
I want it to be able to predict the numbers form outside the range it was trained on, but as soon as it leaves the range, it starts to diverge:
When I increased the number of units in both LSTM layers to 30 it looks a little better, but it's still diverging:
Is LSTM capable of learning that task without adding an infinite number of units?
I was looking for an answer to the same question, and I came across the following paper from 2019 and the corresponding Git repo. In particular, see section 5.3 in the paper. It seems like ABBA-LSTM is the solution, though it depends on the time series problem you're trying to solve.

how to build LSTM RNN network for binary classification?

I am trying to build a deep learning network for binary classification using LSTM based RNN.
Here is what I have tried using python
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM
import numpy as np
train = np.loadtxt("TrainDatasetFinal.txt", delimiter=",")
test = np.loadtxt("testDatasetFinal.txt", delimiter=",")
y_train = train[:,7]
y_test = test[:,7]
train_spec = train[:,6]
test_spec = test[:,6]
model = Sequential()
model.add(Embedding(8, 256, input_length=1))
model.add(LSTM(output_dim=128, activation='sigmoid',
inner_activation='hard_sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(train_spec, y_train, batch_size=2000, nb_epoch=11)
score = model.evaluate(test_spec, y_test, batch_size=2000)
Here is a sample from the dataset
(Patient Number, time in millisecond, accelerometer x-axis,y-axis,
z-axis,magnitude, spectrogram,label (0 or 1))
1,15,70,39,-970,947321,596768455815000,0
1,31,70,39,-970,947321,612882670787000,0
1,46,60,49,-960,927601,602179976392000,0
1,62,60,49,-960,927601,808020878060000,0
1,78,50,39,-960,925621,726154800929000,0
I believe that the my problem in those lines but I cannot recognize the error
model.add(Embedding(8, 256, input_length=1))
model.add(LSTM(output_dim=128, activation='sigmoid',
inner_activation='hard_sigmoid'))
and this is the error I have got
InvalidArgumentError (see above for traceback): indices[0,0] = -2147483648 is not in [0, 8)
Is the sample from your dataset provided above, the data you are trying to feed into the model? If so, there is a problem because your data is 2-dimensional, but for an RNN you need a 3-dimensional input tensor. You need a feature dimension, a batch size dimension and a time dimension. It looks like you are missing a proper time dimension. You should not have a column with 15, 31, 46,... (time in milliseconds) this should be shaped into its own dimension, so your input data looks like a "cube". Otherwise, you don't need a temporal model at all. Furthermore, you should standardize your input since your features have vastly different orders of magnitude. Moreover, the batch size of 2000 is almost certainly too large. Are you trying to express that your whole training set has 2000 samples? In this case, you may not have enough training data for the model you are building.

Keras and Sklearn logreg returning different results

I'm comparing the results of a logistic regressor written in Keras to the default Sklearn Logreg. My input is one-dimensional. My output has two classes and I'm interested in the probability that the output belongs to the class 1.
I'm expecting the results to be almost identical, but they are not even close.
Here is how I generate my random data. Note that X_train, X_test are still vectors, I'm just using capital letters because I'm used to it. Also there is no need for scaling in this case.
X = np.linspace(0, 1, 10000)
y = np.random.sample(X.shape)
y = np.where(y<X, 1, 0)
Here's cumsum of y plotted over X. Doing a regression here is not rocket science.
I do a standard train-test-split:
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)
Next, I train a default logistic regressor:
from sklearn.linear_model import LogisticRegression
sk_lr = LogisticRegression()
sk_lr.fit(X_train, y_train)
sklearn_logreg_result = sk_lr.predict_proba(X_test)[:,1]
And a logistic regressor that I write in Keras:
from keras.models import Sequential
from keras.layers import Dense
keras_lr = Sequential()
keras_lr.add(Dense(1, activation='sigmoid', input_dim=1))
keras_lr.compile(loss='mse', optimizer='sgd', metrics=['accuracy'])
_ = keras_lr.fit(X_train, y_train, verbose=0)
keras_lr_result = keras_lr.predict(X_test)[:,0]
And a hand-made solution:
pearson_corr = np.corrcoef(X_train.reshape(X_train.shape[0],), y_train)[0,1]
b = pearson_corr * np.std(y_train) / np.std(X_train)
a = np.mean(y_train) - b * np.mean(X_train)
handmade_result = (a + b * X_test)[:,0]
I expect all three to deliver similar results, but here is what happens. This is a reliability diagram using 100 bins.
I have played around with loss functions and other parameters, but the Keras logreg stays roughly like this. What might be causing the problem here?
edit: Using binary crossentropy is not the solution here, as shown by this plot (note that the input data has changed between the two plots).
While both implementations are a form of Logistic Regression there's quite a few differences. While both solutions converge to a comparable minimum (0.75/0.76 ACC) they are not identical.
Optimizer - keras uses vanille SGD where sklearn's LR is based on
liblinear which implements trust region Newton method
Regularization - sklearn has built in L2 regularization
Weights -The weights are randomly initialized and probably sampled from a different distribution.

Resources