I have one linear regression model: y ~ x1 + x2 (1)
and now let x3 = x1+x2, x4=x1-x2, to form a new regression y ~ x3 + x4 (2),
would the prediction of (1) and (2) be the same?
If I add L1 regularization to both models, would the prediction of (1) and (2) be the same?
That's not code question, its a substitution problem :
y = (x1+x2) + (x1-x2) = 2x1 ...
Related
I created three Convolutional Autoencoders with the same architecture to extract features from some images related to different types of trees.
My code is something like:
model1 = myAutoencoder()
model2 = myAutoencoder()
model3 = myAutoencoder()
opt = keras.optimizers.Adam(learning_rate=0.001)
loss = keras.losses.MeanSquaredError()
model1.compile(opt=opt, loss=loss)
model2.compile(opt=opt, loss=loss)
model3.compile(opt=opt, loss=loss)
Then I train:
#X1, X2, X3 are tensors of 64x64 RGB images: for example(100, 64,64,3)
model1.fit(X1, X1)
model2.fit(X2, X2)
model3.fit(X3, X3)
However, only the first model is learning, while the second and third are stuck with the same loss as in the figure:
enter image description here
Interestingly, if I swap the positions of let's say model1 and model2, like this:
model2.fit(X2, X2)
model1.fit(X1, X1)
model3.fit(X3, X3)
then only model 2 is learning and models 1 and 3 are stuck. I cannot figure out why...
edit: The actual training that I am doing is this:
def scheduler(epoch, lr):
if epoch < 50:
return lr
else:
return lr * np.math.exp(-0.1)
model2.fit(X2, X2, epochs=100, callbacks=[LearningRateScheduler(scheduler)])
model1.fit(X1, X1, epochs=100, callbacks=[LearningRateScheduler(scheduler)])
model3.fit(X3, X3, epochs=100, callbacks=[LearningRateScheduler(scheduler)])
I figured out that if I delete the callbacks the learning process is "normal", is there a reason why the callbacks are interfering between models?
I started watching a tutorial on PyTorch and I am learning the concept of logistic regression.
I tried it using some stock data that I had. I have inputs, which contains two parameters trade_quantity and trade_value, and targets which has the corresponding stock price.
inputs = torch.tensor([[182723838.00, 2375432.00],
[185968153.00, 2415558.00],
[181970093.00, 2369140.00],
[221676832.00, 2811589.00],
[339785916.00, 4291782.00],
[225855390.00, 2821301.00],
[151430199.00, 1889032.00],
[122645372.00, 1552998.00],
[129015052.00, 1617158.00],
[121207837.00, 1532166.00],
[139554705.00, 1789392.00]])
targets = torch.tensor([[76.90],
[76.90],
[76.90],
[80.70],
[78.95],
[79.60],
[80.05],
[78.90],
[79.40],
[78.95],
[77.80]])
I defined the model function, the loss as the mean square error, and tried to run it a few times to get some predictions. Here's the code:
def model(x):
return x # w.t() + b
def mse(t1, t2):
diff = t1 - t2
return torch.sum(diff * diff) / diff.numel()
preds = model(inputs)
loss = mse(preds, targets)
loss.backward()
with torch.no_grad():
w -= w.grad * 1e-5
b -= b.grad * 1e-5
w.grad.zero_()
b.grad.zero_()
I am using Jupyter for this and ran the last part of the code a few times, after which the predictions come as:
tensor([[inf],
[inf],
[inf],
[inf],
[inf],
[inf],
[inf],
[inf],
[inf],
[inf],
[inf]], grad_fn=<AddBackward0>)
If I run it for a few more times the predictions become nan. Can you please tell me why is this happening?
To me, this looks more like linear regression than logistic regression. You are trying to fit a linear model onto your data. It's different to a binary classification task where you would need to use a special kind of activation function (a sigmoid for instance) so that the output is either 0 or 1.
In this particular instance you want to solve a 2D linear problem given input x of shape (batch, x1, x2) (where x1 is trade_quantity and x2 is trade_value) and target (batch, y) (y being the stock_price).
So the objective is to find the best w and b matrices (weight matrix and bias column) so that x#w + b is the closest to y as possible, according to your criterion, the mean square error.
I would recommend normalizing your data so it stays in a [0, 1] range. You can do so by measuring the mean and standard deviation of inputs and targets.
inputs_min, inputs_max = inputs.min(axis=0).values, inputs.max(axis=0).values
targets_min, targets_max = targets.min(axis=0).values, targets.max(axis=0).values
Then applying the transformation:
x = (inputs - inputs_min)/(inputs_max - inputs_min)
y = (targets - targets_min)/(targets_max - targets_min)
Try changing your learning rate and have it run for multiple epochs.
lr = 1e-2
for epochs in range(100):
preds = model(x)
loss = mse(preds, y)
loss.backward()
with torch.no_grad():
w -= lr*w.grad
b -= lr*b.grad
w.grad.zero_()
b.grad.zero_()
I use a (1, 2) randomly initialized matrix for w (and a (1,) matrix for b):
w = torch.rand(1, 2)
w.requires_grad = True
b = torch.rand(1)
b.requires_grad = True
And got the following train loss over 100 epochs:
To find the right hyperparameters, it's better to have a validation set. This set will get normalized with the mean and std from the train set. It will be used to evaluate the performances at the end of each epoch on data that is 'unknown' to the model. Same goes for your test set, if you have one.
Now I'm trying to make some predictors with several explantory variables.
I want to ask you whether my method is wrong.
I have one score (which will be Y here) and 4 features (X1, X2, X3, and X4) with 100 observations.
Now I investigated the relationships between Y and X1, X2, X3, and X4 individually by using Pearson's correlation coefficients and linear regression (beta).
Using the relation, I made some predictor of Y through the weighted sum.
I made step-wise linear regression model and used the weight (it showed increased correlation coefficients and beta).
Due to the lack of my knowledge, I want to know whether or not it is valid approach and get some tips (other approaches when X's have different unit).
Thank you.
KHW.
Thank you for your advise.
Actually, I know that the measrues X probabily have linear relationships between Y. My approach was to investigate Pearson's corrleation coefficient and regression beta for X1, X2, X3, and X4 individualy. I found linear relationships between the features and Y, but I wanted to increase predictability (r and beta here) by combining the features. Hence I made Y' = B(0) + B(1) * X1 + B(2) * X2 + ... + B(4) * X4, where Y' is the estimated Y and B(0) is an intercept.
First, I made Y' = B(0) + B(1)* X1 + B(4) * X4 using step-wise linear regression.
Second, I made Y' = B(0) + B(1) * X1 + B(2) * X2 + ... + B(4) * X4 using multiple linear regression.
They fitted quite good, but I worry that I actually want to say the features could predict Y, but choosing coefficient using regression requires Y, which means it is not a predictor.
Could k-fold cross validation be an way to validate?
Thank you
mean_sqr = tf.reduce_mean(tf.pow(y_ - y, 2))
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
gradients, variables = zip(*optimizer.compute_gradients(mean_sqr))
opt = optimizer.apply_gradients(list(zip(gradients, variables)))
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for j in range(TRAINING_EPOCHS):
sess.run(opt, feed_dict={x: batch_xs, y_: batch_xs})
I don't clearly understand what compute_gradients returns? Does it return sum(dy/dx) for a given x values assigned by batch_xs, and update gradient in apply_gradients function such as :
theta <- theta - LEARNING_RATE*1/m*gradients?
Or does it already return average of gradients that is summed for each x values in a given batch such as sum(dy/dx)*1/m, m is defined as batch_size?
compute_gradients(a,b) returns d[ sum a ]/db. So in your case this returns d mean_sq / d theta, where theta is set of all variables. There is no "dx" in this equation, you are not computing gradients wrt. inputs. So what happens with batch dimension? You remove it yourself in the definition of mean_sq:
mean_sqr = tf.reduce_mean(tf.pow(y_ - y, 2))
thus (I am assuming y is 1D for simplicity)
d[ mean_sqr ] / d theta = d[ 1/M SUM_i=1^M (pred(x_i), y_i)^2 ] / d theta
= 1/M SUM_i=1^M d[ (pred(x_i), y_i)^2 ] / d theta
so you are in control of whether it sums over batch, takes the mean or does something different, if you would define mean_sqr to use reduce_sum instead of a reduce_mean, gradients would be the sum over the batch and so on.
On the other hand apply_gradients simply "applies the gradients", the exact rule for application is optimiser dependent, for GradientDescentOptimizer it would be
theta <- theta - learning_rate * gradients(theta)
For Adam that you are using the equation is more complex of course.
Note however that tf.gradients is more like "backprop" than true gradient in mathematical sense - meaning that it depends on the graph dependencies and does not recognise dependences which are in "opposite" direction.
I am using a Siamese architecture in my model for a classification task of whether both the inputs are similar.
in1 = Input(shape=(None,), dtype='int32', name='in1')
x1 = Embedding(output_dim=dim, input_dim=n_symbols, input_length=None,
weights=[embedding_weights], name='x1')(in1)
in2 = Input(shape=(None,), dtype='int32', name='in2')
x2 = Embedding(output_dim=dim, input_dim=n_symbols, input_length=None,
weights=[embedding_weights], name='x2')(in2)
l = Bidirectional(LSTM(units=100, return_sequences=False))
y1 = l(x1)
y2 = l(x2)
y = concatenate([y1, y2])
out = Dense(1, activation='sigmoid')(y)
model = Model(inputs=[in1, in2], outputs=[out])
It works correctly as the number of weights to be trained remain the same even when I use a single input. The thing that confused my though, was the tensorboard vizualization of the model.
tensorboard graph
Shouldn't both x1 and x2 map to the same bidirectional node?
Also, what do the 18 and 32 tensors signify?