I'm currently trying to use the scikit learn package for its neural network functionality. I have a complex problem to solve with it, but to start out I am just trying a couple of basic tests to familiarize myself with it. I have gotten it to do something, but it isn't producing meaningful results. My code:
import sklearn.neural_network.multilayer_perceptron as nnet
import numpy
def generateTargetDataset(expression="%s", generateRange=(-100,100), s=1000):
expression = expression.replace("x", "%s")
x = numpy.random.rand(s,)
y = numpy.zeros((s,), dtype="float")
numpy.multiply(x, abs(generateRange[1]-generateRange[0]), x)
numpy.subtract(x, min(generateRange), x)
for z in range(0, numpy.size(x)):
y[z] = eval(expression % (x[z]))
x = x.reshape(-1, 1)
outTuple = (x, y)
return(outTuple)
print("New Net + Training")
QuadRegressor = nnet.MLPRegressor(hidden_layer_sizes=(10), warm_start=True, verbose=True, learning_rate_init=0.00001, max_iter=10000, algorithm="sgd", tol=0.000001)
data = generateTargetDataset(expression="x**2", s=10000, generateRange=(-1,1))
QuadRegressor.fit(data[0], data[1])
print("Net Trained")
xt = numpy.random.rand(10000, 1)
yr = QuadRegressor.predict(xt)
yr = yr.reshape(-1, 1)
xt = xt.reshape(-1, 1)
numpy.multiply(xt, 100, xt)
numpy.multiply(yr, 10000, yr)
numpy.around(yr, 2, out=yr)
numpy.around(xt, 2, out=xt)
out = numpy.concatenate((xt, yr), axis=1)
numpy.set_printoptions(precision=4)
numpy.savetxt(fname="C:\\SCRATCHDIR\\numpydump.csv", X=out, delimiter=",")
I don't understand how to post the data it gives me, but it spits out between 7000 and 10000 for all inputs between 0 and 100. It seems to be correctly mapped very close to the top of the range, but for inputs close to 0, it just returns something near 7000.
EDIT: I forgot to add this. The network has the same behavior if I remove the dummy training to y=x, but I read somewhere that sometimes you can help a network along by training it to a different but closer function and then using that already weighted network as a starting ground. It didn't work but I just hadn't taken that bit out yet.
My recommendation is to reduce the number of neurons per layer, and increase the training dataset size. Right now, you have a lot of parameters to train in your network, and a small training set (~10K). However, the main point of my answer is that sklearn probably isn't a great choice for your end application.
So you have a complex problem you want to solve with neural networks?
I have a complex problem to solve with it, but to start out I am just trying a couple of basic tests to familiarize myself with it.
According to the official user guide, sklearn's implementation of neural networks isn't designed for large applications and is a lot less flexible than other options for deep learning.
One Python deep learning library I've had good experiences with is keras, a modular, easy-to-use library with GPU support.
Here's a sample I coded up that trains a single perceptron to do quadratic regression.
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD
import numpy as np
import matplotlib.pyplot as plt
model = Sequential()
model.add(Dense(1, init = 'uniform', input_dim=1))
model.add(Activation('sigmoid'))
model.compile(optimizer = SGD(lr=0.02, decay=1e-6, momentum=0.9, nesterov=True), loss = 'mse')
data = np.random.random(1000)
labels = data**2
model.fit(data.reshape((len(data),1)), labels, nb_epoch = 1000, batch_size = 128, verbose = 1)
tdata = np.sort(np.random.random(100))
tlabels = tdata**2
preds = model.predict(tdata.reshape((len(tdata), 1)))
plt.plot(tdata, tlabels)
plt.scatter(tdata, preds)
plt.show()
This outputs a scatter plot of the test data points, along with a plot of the true curve.
As you can see, the results are reasonable. In general, neural networks are hard to train, and I had to do some parameter tuning before I got this example working.
It looks like you're using Windows. This question may be helpful for installing Keras on Windows.
Related
I want to implement a Keras model to predict the similarity between two sentences from words embeddings as follows (I included my full script at the end):
Load words embeddings models, e.g., Word2Vec and fastText.
Generate samples (X1 and X2) by computing the average word vectors for all words in a sentence. If two or more models are used, calculate the arithmetic mean of all embeddings (Frustratingly Easy Meta-Embedding -- Computing Meta-Embeddings by Averaging Source Word Embeddings).
Concatenate X1 and X2 into one array before feeding them to the network.
Compile (and evaluate) the Keras model.
The entire script is as follows:
import numpy as np
from gensim.models import Word2Vec
from keras.layers import Dense
from keras.models import Sequential
from sklearn.model_selection import train_test_split
def encoder_vector(v: str, model: Word2Vec) -> np.array:
wv_dim = model.vector_size
if v in model.wv:
return model.wv[v]
else:
return np.zeros(wv_dim)
def encoder_words_avg(words: list[str], model: Word2Vec) -> np.array:
dim = model.vector_size
words = [word for word in words if word in model.wv]
if len(words) >= 1:
return np.mean(model.wv[words], axis=0)
else:
return np.zeros(dim)
def load_samples(mappings, w2v_model, fast_model):
dim = w2v_model.vector_size
num = len(mappings)
X1 = np.zeros((num, dim))
X2 = np.zeros((num, dim))
y = np.zeros((num, 1))
for i in range(num):
mapping = mappings[i].split("|")
sentence_1, sentence_2 = mapping[1:]
e = np.zeros((2, dim))
# Compute meta-embedding by averaging all embeddings.
e[0, :] = encoder_words_avg(words=sentence_1.split(), model=w2v_model)
e[1, :] = encoder_words_avg(words=sentence_1.split(), model=fast_model)
X1[i] = e.mean(axis=0)
e[0, :] = encoder_words_avg(words=sentence_2.split(), model=w2v_model)
e[1, :] = encoder_words_avg(words=sentence_2.split(), model=fast_model)
X2[i] = e.mean(axis=0)
y[i] = 0.0 if mapping[0].startswith("-") else 1.0
return X1, X2, y
def baseline_model(X_train, X_test, y_train, y_test):
model = Sequential()
model.add(
Dense(
200,
input_shape=(X_train.shape[1],),
activation="relu",
kernel_initializer="he_uniform",
)
)
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="sgd", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(X_train, y_train, batch_size=8, epochs=14)
# Evaluate the trained model, using the train and test data
_, train_acc = model.evaluate(X_train, y_train, verbose=0)
_, test_acc = model.evaluate(X_test, y_test, verbose=0)
print("Train: %.3f, Test: %.3f\n" % (train_acc, test_acc))
return model
def main():
w2v_model = Word2Vec.load("")
fast_model = Word2Vec.load("")
mappings = [
"1|boiled chicken egg|hen egg whole boiled",
"2|tomato|tomato substance",
"3|sweet potatoes|potato chip",
"-1|watering plants|cornsalad plant",
"-2|butter|butane",
"-3|olive plant|black olives",
]
X1, X2, y = load_samples(mappings, w2v_model=w2v_model, fast_model=fast_model)
# Concatenate both arrays into one before feeding to the network.
X = np.concatenate([X1, X2], axis=1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
model = baseline_model(X_train, X_test, y_train, y_test)
model.summary()
The above script seems to work, but the prediction result is very poor even when using only Word2Vec (which makes me think there could be an issue with the Keras model...). Any ideas on how to improve the outcome? Am I doing something wrong?
Thank you.
It's unclear what you're intending to predict.
Do you want your Keras NN to report the same value as the precise cosine-similarity calculation, between the two text summary vectors, would report? If so, why not just... do the calculation? It's not something I'd necessarily expect a neural-architecture to approxmate better.
Alternatively, if your tiny 6-pair dataset is the target:
Your existing 'gold standard' answers don't seem obviously correct to me. Superficially, 'olive plant' & 'black olives' seem nearly as 'similar' as 'tomato' & 'tomato substance'. Similarly, 'watering plants' & 'cornsalad plant' about-as-similar as 'sweet potatoes' & 'potato chip'.
A mere 6 examples (maybe 5 after train/test split?) is both inadequate to usefully train a larger neural classifier, and to the extent the classifer might be easily trained (indeed 'overfit') to the 5 training examples, it won't necessarily have learned anything generalizable to the one hold-out example (which is using vectors quite far from the training texts). (With such a paucity of training data, & testing using inputs that might be arbitrarily different than the training data, "very poor" performance would be expected. Neural nets require lots of varied training examples!)
Finally, the strategy of creating combined-embeddings-by-averaging, as investigated by your linked paper, is another atypical practice that seems fishy to me. Even if it could offer some benefits, there's no reason to mix that atypical, somewhat non-intuitive extra practice into your experiment before even having things work with a more typical and simple baseline approach, for comparison, to be sure the extra 'meta'/averaging is worth the complication.
The paper itself doesn't really show any advantage over concatenation, which has a stronger theoretical basis (preserving each model's full independent spaces) than averaging, except by a tiny amount in 1-of-6 tests. Further, average of GLoVe & CBOW performs the same or worse than GLoVe alone on 3 on their 6 evaluations – and pretty minimally better on the 3 other evaluations. That implies to me the outperformance might be mainly random jitter introduced by the extra steps, and the averaging is – at best – a cheap option to consider as a tiny boost, not a generally-better approach.
The paper also leaves many natural related questions unaddressed:
Is averaging better than, say, just picking a random half of each models' dimensions for concatenation? That'd be even cheaper!
Might some of the slight lift in some tasks be due not to the averaging, but the other transformations they've applied – the l2-normalization applied to each source model, or across the whole of each dimension for the GLoVE model? (It's unclear if this model-postprocessing was only applied before dual-model averaging, or also to GLoVe in its solo evaluation.)
There's other work suggesting post-training transformations of word-vector spaces may improve performance on downstream tasks – see for example 'All But The Top' – so which steps, exactly, get which advantages is important to distinguish.
How to increase the inputs of each layer in the neural network by a specific scale?
I am working on a neural network with Keras and TensorFlow.
I'd like to implement some features in the neural network. During the training, I want to remove a specific range of input for each layer. For example
Let's say the input of the layer one is a range of [-2 2]. I'd like to make sure no input at [0 0.5]. So I'd like to add 0.5 to all the inputs whose value is at [0 0.5].
How could I do that? during the training process.
Thank you very much
You might try Lambda functions. An example implementation below. I hope this helps.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
def myClippingFunction(x):
y = tf.math.logical_and(tf.math.greater_equal(x, [[0]]), tf.math.less_equal(x, [[0.5]]) )
z = tf.where(y,x+0.5,x)
return z
#create simple model
inputA = layers.Input((1,))
x = layers.Lambda(myClippingFunction)(inputA)
myModel = keras.Model(inputs=inputA, outputs=x)
x_data = np.array([[-0.2],[0.6]])
myModel.predict(x_data)
I am using Keras functional API to build a classifier and I am using the training flag in the dropout layer to enable dropout when predicting new instances (in order to get an estimate of the uncertainty). In order to get the expected response one needs to repeat this prediction several times, with keras randomly activating links in the dense layer, and of course it is computational expensive. Therefore, I would also like to have the option to not use dropout at the prediction phase, i.e., use all the network links. Does anyone know how I can do this? Following is a sample code of what I am doing. I tried to look if predict has any relevant parameter but does not seem like it does (?). I can technically train the same model without the training flag at the dropout layer, but I do not want to do this (or better I want to have a more clean solution, rather than having 2 different models).
from sklearn.datasets import make_circles
from keras.models import Sequential
from keras.utils import to_categorical
from keras.layers import Dense
from keras.layers import Dropout
import numpy as np
import keras
# generate a 2d classification sample dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
trainy = to_categorical(trainy)
testy = to_categorical(testy)
inputlayer = keras.layers.Input((2,))
d = keras.layers.Dense(500, activation = 'relu')(inputlayer)
d1 = keras.layers.Dropout(rate = .3)(d,training = True)
out = keras.layers.Dense(2, activation = 'softmax')(d1)
model = keras.Model(inputs = inputlayer, outputs = out)
model.compile(loss = 'categorical_crossentropy',metrics = ['accuracy'],optimizer='adam')
model.fit(x = trainX, y = trainy, validation_data=(testX, testy),epochs=1000, verbose=1)
# another prediction on a specific sample
print(model.predict(testX[0:1,:]))
# another prediction on the same sample
print(model.predict(testX[0:1,:]))
Running the above example I get the following output:
[[0.9230819 0.07691813]]
[[0.8222245 0.17777553]]
which is as expected, different class probabilities for the same input, since there is a random (de)activation of the links from the dropout layer.
Any suggestions on how I can enable/disable dropout at the prediction phase with the functional API?
Sure, you do not need to set the training flag when building the Dropout layer. After training your model you define this function:
mc_func = K.function([model.input, K.learning_phase()],
[model.output])
Then you call mc_func with your input and flag 1 to enable dropout, or 0 to disable it:
stochastic_pred = mc_func([some_input, 1])
deterministic_pred = mc_func([some_input, 0])
I'm new to keras and have been experimenting with various things such as BatchNormalization but it is not working at all. When the BatchNormalization line is commented out it will converge to around 0.04 loss or better, but with it as it is it will converge to 0.71 and get stuck around there, I'm not sure what's wrong.
from sklearn import preprocessing
from sklearn.datasets import load_boston
from keras.models import Model
from keras.layers import Input, Dense
from keras.layers.normalization import BatchNormalization
import keras.optimizers
boston = load_boston()
x = boston.data
y = boston.target
normx = preprocessing.scale(x)
normy = preprocessing.scale(y)
# doesnt construct output layer
def layer_looper(inputs, number_of_loops, neurons):
inputs_copy = inputs
for i in range(number_of_loops):
inputs_copy = Dense(neurons, activation='relu')(inputs_copy)
inputs_copy = BatchNormalization()(inputs_copy)
return inputs_copy
inputs = Input(shape = (13,))
x = layer_looper(inputs, 40, 20)
predictions = Dense(1, activation='linear')(x)
model = Model(inputs=inputs, outputs=predictions)
opti = keras.optimizers.Adam(lr=0.0001)
model.compile(loss='mean_absolute_error', optimizer=opti, metrics=['acc'])
print(model.summary())
model.fit(normx, normy, epochs=5000, verbose=2, batch_size=128)
I have tried experimenting with batch sizes and the optimizer but it doesn't seem very effective. Am I doing something wrong?
I've increased learning rate to 0.01 and it seems like the network is able to learn something (I get Epoch 1000/5000- 0s - loss: 0.2330) .
I think it's worth to note the following from the abstract of original Batch Normalization paper:
Batch Normalization allows us to use much higher learning rates and
be less careful about initialization. It also acts as a regularizer (...)
That hinted to increased learning rate (that's something you might want to experiment with).
Be aware that since it works like regularization, BatchNorm should make your training loss worse - it's supposed to prevent overfitting and thus close the gap between the train and test/valid errors.
I'm comparing the results of a logistic regressor written in Keras to the default Sklearn Logreg. My input is one-dimensional. My output has two classes and I'm interested in the probability that the output belongs to the class 1.
I'm expecting the results to be almost identical, but they are not even close.
Here is how I generate my random data. Note that X_train, X_test are still vectors, I'm just using capital letters because I'm used to it. Also there is no need for scaling in this case.
X = np.linspace(0, 1, 10000)
y = np.random.sample(X.shape)
y = np.where(y<X, 1, 0)
Here's cumsum of y plotted over X. Doing a regression here is not rocket science.
I do a standard train-test-split:
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)
Next, I train a default logistic regressor:
from sklearn.linear_model import LogisticRegression
sk_lr = LogisticRegression()
sk_lr.fit(X_train, y_train)
sklearn_logreg_result = sk_lr.predict_proba(X_test)[:,1]
And a logistic regressor that I write in Keras:
from keras.models import Sequential
from keras.layers import Dense
keras_lr = Sequential()
keras_lr.add(Dense(1, activation='sigmoid', input_dim=1))
keras_lr.compile(loss='mse', optimizer='sgd', metrics=['accuracy'])
_ = keras_lr.fit(X_train, y_train, verbose=0)
keras_lr_result = keras_lr.predict(X_test)[:,0]
And a hand-made solution:
pearson_corr = np.corrcoef(X_train.reshape(X_train.shape[0],), y_train)[0,1]
b = pearson_corr * np.std(y_train) / np.std(X_train)
a = np.mean(y_train) - b * np.mean(X_train)
handmade_result = (a + b * X_test)[:,0]
I expect all three to deliver similar results, but here is what happens. This is a reliability diagram using 100 bins.
I have played around with loss functions and other parameters, but the Keras logreg stays roughly like this. What might be causing the problem here?
edit: Using binary crossentropy is not the solution here, as shown by this plot (note that the input data has changed between the two plots).
While both implementations are a form of Logistic Regression there's quite a few differences. While both solutions converge to a comparable minimum (0.75/0.76 ACC) they are not identical.
Optimizer - keras uses vanille SGD where sklearn's LR is based on
liblinear which implements trust region Newton method
Regularization - sklearn has built in L2 regularization
Weights -The weights are randomly initialized and probably sampled from a different distribution.