PySpark LinearRegressionWithSGD, model predict dimensions mismatch

PySpark LinearRegressionWithSGD, model predict dimensions mismatch - apache-spark

I've come across the following error:
AssertionError: dimension mismatch
I've trained a linear regression model using PySpark's LinearRegressionWithSGD.
However when I try to make a prediction on the training set, I get "dimension mismatch" error.
Worth mentioning:
Data was scaled using StandardScaler, but the predicted value was not.
As can be seen in code the features used for training were generated by PCA.
Some code:
pca_transformed = pca_model.transform(data_std)
X = pca_transformed.map(lambda x: (x[0], x[1]))
data = train_votes.zip(pca_transformed)
labeled_data = data.map(lambda x : LabeledPoint(x[0], x[1:]))
linear_regression_model = LinearRegressionWithSGD.train(labeled_data, iterations=10)
The prediction is the source of the error, and these are the variations I tried:
pred = linear_regression_model.predict(pca_transformed.collect())
pred = linear_regression_model.predict([pca_transformed.collect()])
pred = linear_regression_model.predict(X.collect())
pred = linear_regression_model.predict([X.collect()])
The regression weights:
DenseVector([1.8509, 81435.7615])
The vectors used:
pca_transformed.take(1)
[DenseVector([-0.1745, -1.8936])]
X.take(1)
[(-0.17449817243564397, -1.8935926689554488)]
labeled_data.take(1)
[LabeledPoint(22221.0, [-0.174498172436,-1.89359266896])]

This worked:
pred = linear_regression_model.predict(pca_transformed)
pca_transformed is of type RDD.
The function handles RDD's and arrays differently:
def predict(self, x):
"""
Predict the value of the dependent variable given a vector or
an RDD of vectors containing values for the independent variables.
"""
if isinstance(x, RDD):
return x.map(self.predict)
x = _convert_to_vector(x)
return self.weights.dot(x) + self.intercept
When a simple array is used, there might be a dimension mismatch issue (like the error in the question above).
As can be seen, if x is not an RDD, it's being converted to a vector. The thing is the dot product will not work unless you take x[0].
Here is the error reproduced:
j = _convert_to_vector(pca_transformed.take(1))
linear_regression_model.weights.dot(j) + linear_regression_model.intercept
This works just fine:
j = _convert_to_vector(pca_transformed.take(1))
linear_regression_model.weights.dot(j[0]) + linear_regression_model.intercept

Related

What are the elements from a data in Pytorch Geometric?

I am studying on GNN, and to code :
Pytorch Geometric Introduction Code from Pytorch Geometric Tutorial
import torch_geometric
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root="tutorial1",name= "Cora")
data = dataset[0]
print(data)
Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])
##############(I omitted my neural network and train(), which are not related to my question)########
def test():
model.eval()
logits, accs = model(), []
for _, mask in data('train_mask', 'val_mask', 'test_mask'):
pred = logits[mask].max(1)[1]
acc = pred.eq(data.y[mask]).sum().item() / mask.sum().item()
accs.append(acc)
return accs
What I am curious is that
for_, mask in data('train_mask', 'val_mask', 'test_mask):
Because I don't understand what data('train_mask', 'val_mask', 'test_mask) is. The result is
<generator object Data.__call__ at 0x7f617c8498d0>
So I don't get what it is. I read some documentations of generator, but then how can I see what the elements are?

The data object you retrieve from the Planetoid dataset is a single graph. You have the following attributes:
x the node features, hence it's dimension is number of nodes (2703) times feature dimension (1433)
edge_index the edge list
y the "ground truth"/class labels or in that specific case the classification of the papers. Hence, it's shape is the number of nodes.
The three masks: train_mask, val_mask, test_mask. If I access them via data.train_mask, it gives me a boolean tensor with the length = number of nodes. This is the "default split" of the dataset. They should be disjoint and if True the respective node is in that set.

ValueError: Number of features of the model must match the input. Model n_features is 464 and input n_features is 2

I am tryin to predict the cost of perfumes but I got error in the line "answer = (clf.predict(result))"
cursor.execute('SELECT * FROM info')
info = cursor.fetchall()
for line in info:
z.append(line[0:2])
y.append(line[2])
enc.fit(z)
x = enc.transform(z).toarray()
result = []
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x, y)
new = input('enter the Model, Volume of your perfume to see the cost: example = Midnighto,50 ').split(',')
enc.fit([new])
result = enc.transform([new]).toarray()
answer = (clf.predict(result))
print(answer)

You don't have to fit enc again with your new input, only transform with the function fitted with X (what you do is OneHotEncoding the new sample taking in consideration only one possible value of the feature, you have to consider all the possible categories of the feature you have in your X data). So delete the next row:
enc.fit([new])
After that, please check if X and results has the same number of features. You can use function shape.
Furthermore, I recommend you use training and test data to see if your model is overfitted or not. Then, you can apply your personal predict.

Find wrongly categorized samples from validation step

I am using a keras neural net for identifying category in which the data belongs.
self.model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(lr=0.001, decay=0.0001),
metrics=[categorical_accuracy])
Fit function
history = self.model.fit(self.X,
{'output': self.Y},
validation_split=0.3,
epochs=400,
batch_size=32
)
I am interested in finding out which labels are getting categorized wrongly in the validation step. Seems like a good way to understand what is happening under the hood.

You can use model.predict_classes(validation_data) to get the predicted classes for your validation data, and compare these predictions with the actual labels to find out where the model was wrong. Something like this:
predictions = model.predict_classes(validation_data)
wrong = np.where(predictions != Y_validation)

If you are interested in looking 'under the hood', I'd suggest to use
model.predict(validation_data_x)
to see the scores for each class, for each observation of the validation set.
This should shed some light on which categories the model is not so good at classifying. The way to predict the final class is
scores = model.predict(validation_data_x)
preds = np.argmax(scores, axis=1)
be sure to use the proper axis for np.argmax (I'm assuming your observation axis is 1). Use preds to then compare with the real class.
Also, as another exploration you want to see the overall accuracy on this dataset, use
model.evaluate(x=validation_data_x, y=validation_data_y)

I ended up creating a metric which prints the "worst performing category id + score" on each iteration. Ideas from link
import tensorflow as tf
import numpy as np
class MaxIoU(object):
def __init__(self, num_classes):
super().__init__()
self.num_classes = num_classes
def max_iou(self, y_true, y_pred):
# Wraps np_max_iou method and uses it as a TensorFlow op.
# Takes numpy arrays as its arguments and returns numpy arrays as
# its outputs.
return tf.py_func(self.np_max_iou, [y_true, y_pred], tf.float32)
def np_max_iou(self, y_true, y_pred):
# Compute the confusion matrix to get the number of true positives,
# false positives, and false negatives
# Convert predictions and target from categorical to integer format
target = np.argmax(y_true, axis=-1).ravel()
predicted = np.argmax(y_pred, axis=-1).ravel()
# Trick from torchnet for bincounting 2 arrays together
# https://github.com/pytorch/tnt/blob/master/torchnet/meter/confusionmeter.py
x = predicted + self.num_classes * target
bincount_2d = np.bincount(x.astype(np.int32), minlength=self.num_classes**2)
assert bincount_2d.size == self.num_classes**2
conf = bincount_2d.reshape((self.num_classes, self.num_classes))
# Compute the IoU and mean IoU from the confusion matrix
true_positive = np.diag(conf)
false_positive = np.sum(conf, 0) - true_positive
false_negative = np.sum(conf, 1) - true_positive
# Just in case we get a division by 0, ignore/hide the error and set the value to 0
with np.errstate(divide='ignore', invalid='ignore'):
iou = false_positive / (true_positive + false_positive + false_negative)
iou[np.isnan(iou)] = 0
return np.max(iou).astype(np.float32) + np.argmax(iou).astype(np.float32)
~
usage:
custom_metric = MaxIoU(len(catagories))
self.model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(lr=0.001, decay=0.0001),
metrics=[categorical_accuracy, custom_metric.max_iou])

sklearn Ridgecv cross validation iterable

I am confused about the parameter cv in RidgeCV of sklearn.linear_model
Indeed, I already have my data splitted into a training set and validation set, and the documentation of RidgeCV says the parameter cv can be an iterable yielding train/test splits. So I write the following:
m = linear_model.RidgeCV(cv=zip(x_validation, y_validation))
m.fit(x_train, y_train)
But it does not work.
Python throws the following error
IndexError: arrays used as indices must be of integer (or boolean) type
What is wrong with my understanding of the parameter cv and is there an easy manner to use my own and already splitted validation set?

it seems the parameter cv expects a list of indices for use as training set and list of indices for use as validation set, so a solution
x = np.concatenate(x_train, x_validation)
y = np.concatenate(y_train, y_validation)
train_fraction = 0.9
train_indices = range(int(train_fraction * x.shape[0]))
validation_indices = range(int(train_fraction * x.shape[0]), x.shape[0])
m = linear_model.RidgeCV(cv=zip(train_indices, validation_indices))
m.fit(x, y)

Bad input argument to theano function

I am new to theano. I am trying to implement simple linear regression but my program throws following error:
TypeError: ('Bad input argument to theano function with name "/home/akhan/Theano-Project/uog/theano_application/linear_regression.py:36" at index 0(0-based)', 'Expected an array-like object, but found a Variable: maybe you are trying to call a function on a (possibly shared) variable instead of a numeric array?')
Here is my code:
import theano
from theano import tensor as T
import numpy as np
import matplotlib.pyplot as plt
x_points=np.zeros((9,3),float)
x_points[:,0] = 1
x_points[:,1] = np.arange(1,10,1)
x_points[:,2] = np.arange(1,10,1)
y_points = np.arange(3,30,3) + 1
X = T.vector('X')
Y = T.scalar('Y')
W = theano.shared(
value=np.zeros(
(3,1),
dtype=theano.config.floatX
),
name='W',
borrow=True
)
out = T.dot(X, W)
predict = theano.function(inputs=[X], outputs=out)
y = predict(X) # y = T.dot(X, W) work fine
cost = T.mean(T.sqr(y-Y))
gradient=T.grad(cost=cost,wrt=W)
updates = [[W,W-gradient*0.01]]
train = theano.function(inputs=[X,Y], outputs=cost, updates=updates, allow_input_downcast=True)
for i in np.arange(x_points.shape[0]):
print "iteration" + str(i)
train(x_points[i,:],y_points[i])
sample = np.arange(x_points.shape[0])+1
y_p = np.dot(x_points,W.get_value())
plt.plot(sample,y_p,'r-',sample,y_points,'ro')
plt.show()
What is the explanation behind this error? (didn't got from the error message). Thanks in Advance.

There's an important distinction in Theano between defining a computation graph and a function which uses such a graph to compute a result.
When you define
out = T.dot(X, W)
predict = theano.function(inputs=[X], outputs=out)
you first set up a computation graph for out in terms of X and W. Note that X is a purely symbolic variable, it doesn't have any value, but the definition for out tells Theano, "given a value for X, this is how to compute out".
On the other hand, predict is a theano.function which takes the computation graph for out and actual numeric values for X to produce a numeric output. What you pass into a theano.function when you call it always has to have an actual numeric value. So it simply makes no sense to do
y = predict(X)
because X is a symbolic variable and doesn't have an actual value.
The reason you want to do this is so that you can use y to further build your computation graph. But there is no need to use predict for this: the computation graph for predict is already available in the variable out defined earlier. So you can simply remove the line defining y altogether and then define your cost as
cost = T.mean(T.sqr(out - Y))
The rest of the code will then work unmodified.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark LinearRegressionWithSGD, model predict dimensions mismatch - apache-spark

Related

What are the elements from a data in Pytorch Geometric?

ValueError: Number of features of the model must match the input. Model n_features is 464 and input n_features is 2

Find wrongly categorized samples from validation step

sklearn Ridgecv cross validation iterable

Bad input argument to theano function

Categories

Resources