I am confused about the parameter cv in RidgeCV of sklearn.linear_model
Indeed, I already have my data splitted into a training set and validation set, and the documentation of RidgeCV says the parameter cv can be an iterable yielding train/test splits. So I write the following:
m = linear_model.RidgeCV(cv=zip(x_validation, y_validation))
m.fit(x_train, y_train)
But it does not work.
Python throws the following error
IndexError: arrays used as indices must be of integer (or boolean) type
What is wrong with my understanding of the parameter cv and is there an easy manner to use my own and already splitted validation set?
it seems the parameter cv expects a list of indices for use as training set and list of indices for use as validation set, so a solution
x = np.concatenate(x_train, x_validation)
y = np.concatenate(y_train, y_validation)
train_fraction = 0.9
train_indices = range(int(train_fraction * x.shape[0]))
validation_indices = range(int(train_fraction * x.shape[0]), x.shape[0])
m = linear_model.RidgeCV(cv=zip(train_indices, validation_indices))
m.fit(x, y)
Related
For data with the shape (num_samples,features), MinMaxScaler from sklearn.preprocessing can be used to normalize it easily.
However, when using the same method for time series data with the shape (num_samples, time_steps,features), sklearn will give an error.
from sklearn.preprocessing import MinMaxScaler
import numpy as np
#Making artifical time data
x1 = np.linspace(0,3,4).reshape(-1,1)
x2 = np.linspace(10,13,4).reshape(-1,1)
X1 = np.concatenate((x1*0.1,x2*0.1),axis=1)
X2 = np.concatenate((x1,x2),axis=1)
X = np.stack((X1,X2))
#Trying to normalize
scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X) <--- error here
ValueError: Found array with dim 3. MinMaxScaler expected <= 2.
This post suggests something like
(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
Yet, it only works for data with only 1 feature. Since my data has more than 1 feature, this method doesn't work.
How to normalize time series data with multiple features?
To normalize a 3D tensor of shape (n_samples, timesteps, n_features) use the following:
(timeseries-timeseries.min(axis=2))/(timeseries.max(axis=2)-timeseries.min(axis=2))
Using the argument axis=2 will return the result of the tensor operation performed along the 3rd dimension i.e., the feature axis. Thus each feature will be normalized independently.
i'm dealing with Isolation Forest for APT classification and it gets encouraging results.
Now i'm trying to improve the performance with feature selection, and i consider PCA approach.
df = pd.read_csv("features_labeled_PULITOIsolationFor.csv")
apt_list = list(set(df.apt))[1:] ## Il primo è nan quindi lo scarto
for apt_name in apt_list:
out_file = open("test.txt","a")
out_file.write(apt_name + "\n" + "\n")
out_file.close()
df_apt17 = df[df["apt"] == apt_name]
df_other = df[df["apt"] != apt_name]
nf = 150
pca = PCA(n_components=nf)
df_apt17 = pca.fit_transform(df_apt17.drop("apt", 1))
print(df_apt17.shape)
kf = KFold(n_splits=10, random_state=1, shuffle=True)
df_apt17 = df_apt17.reset_index()
df_other = df_other.reset_index()
After this, i perform Cross Validation dividing the apt in question in Kfolder and i use df_other dataframe for testing folder merged with some elements of current APT.
However, although PCA seems to work since the reduction of features (seen by .shape on dataframe), it gives me error on the reset_index() function:
df_apt17 = df_apt17.reset_index()
AttributeError: 'numpy.ndarray' object has no attribute 'reset_index'
how can i deal with this problem?
Thanks to everyone
By reading the docs in sklearn (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) it say that the method pca.fit_transform() return an array.
Arrays don't have reset_index() method, only pandas.DataFrame.
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters: X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features
is the number of features.
y : Ignored.
Returns: X_new : array-like, shape (n_samples, n_components)
If you need to use reset_index(), you need transform it back to a pandas.DataFrame.
I've come across the following error:
AssertionError: dimension mismatch
I've trained a linear regression model using PySpark's LinearRegressionWithSGD.
However when I try to make a prediction on the training set, I get "dimension mismatch" error.
Worth mentioning:
Data was scaled using StandardScaler, but the predicted value was not.
As can be seen in code the features used for training were generated by PCA.
Some code:
pca_transformed = pca_model.transform(data_std)
X = pca_transformed.map(lambda x: (x[0], x[1]))
data = train_votes.zip(pca_transformed)
labeled_data = data.map(lambda x : LabeledPoint(x[0], x[1:]))
linear_regression_model = LinearRegressionWithSGD.train(labeled_data, iterations=10)
The prediction is the source of the error, and these are the variations I tried:
pred = linear_regression_model.predict(pca_transformed.collect())
pred = linear_regression_model.predict([pca_transformed.collect()])
pred = linear_regression_model.predict(X.collect())
pred = linear_regression_model.predict([X.collect()])
The regression weights:
DenseVector([1.8509, 81435.7615])
The vectors used:
pca_transformed.take(1)
[DenseVector([-0.1745, -1.8936])]
X.take(1)
[(-0.17449817243564397, -1.8935926689554488)]
labeled_data.take(1)
[LabeledPoint(22221.0, [-0.174498172436,-1.89359266896])]
This worked:
pred = linear_regression_model.predict(pca_transformed)
pca_transformed is of type RDD.
The function handles RDD's and arrays differently:
def predict(self, x):
"""
Predict the value of the dependent variable given a vector or
an RDD of vectors containing values for the independent variables.
"""
if isinstance(x, RDD):
return x.map(self.predict)
x = _convert_to_vector(x)
return self.weights.dot(x) + self.intercept
When a simple array is used, there might be a dimension mismatch issue (like the error in the question above).
As can be seen, if x is not an RDD, it's being converted to a vector. The thing is the dot product will not work unless you take x[0].
Here is the error reproduced:
j = _convert_to_vector(pca_transformed.take(1))
linear_regression_model.weights.dot(j) + linear_regression_model.intercept
This works just fine:
j = _convert_to_vector(pca_transformed.take(1))
linear_regression_model.weights.dot(j[0]) + linear_regression_model.intercept
I am currently trying to visualize the output of an intermediate layer in Keras 1.0 (which I could do with Keras 0.3) but it does not work anymore.
x = model.input
y = model.layers[3].output
f = theano.function([x], y)
But I get the following error:
MissingInputError: ("An input of the graph, used to compute DimShuffle{x,x,x,x}(keras_learning_phase), was not provided and not given a value.Use the Theano flag exception_verbosity='high',for more information on this error.", keras_learning_phase)
Prior to Keras 1.0, with my graph model, I could just do:
x = graph.inputs['input'].input
y = graph.nodes[layer].get_output(train=False)
f = theano.function([x], y, allow_input_downcast=True)
So I suspect it to come from the "train=False" parameter which I don't know how to set in the new version.
Thank you for your help
Try:
In the import statements first give
from keras import backend as K
from theano import function
then
f = K.function([model.layers[0].input, K.learning_phase()],
[model.layers[3].output])
# output in test mode = 0
layer_output = get_3rd_layer_output([X_test, 0])[0]
# output in train mode = 1
layer_output = get_3rd_layer_output([X_train, 1])[0]
This was just answered by François Chollet on github:
Your model apparently has a different behavior in training and test mode, and so needs to know what mode it should be using.
Use
iterate = K.function([input_img, K.learning_phase()], [loss, grads])
and pass 1 or 0 as value for the learning phase, based on whether you want the model in training mode or test mode.
https://github.com/fchollet/keras/issues/2417
I am new to theano. I am trying to implement simple linear regression but my program throws following error:
TypeError: ('Bad input argument to theano function with name "/home/akhan/Theano-Project/uog/theano_application/linear_regression.py:36" at index 0(0-based)', 'Expected an array-like object, but found a Variable: maybe you are trying to call a function on a (possibly shared) variable instead of a numeric array?')
Here is my code:
import theano
from theano import tensor as T
import numpy as np
import matplotlib.pyplot as plt
x_points=np.zeros((9,3),float)
x_points[:,0] = 1
x_points[:,1] = np.arange(1,10,1)
x_points[:,2] = np.arange(1,10,1)
y_points = np.arange(3,30,3) + 1
X = T.vector('X')
Y = T.scalar('Y')
W = theano.shared(
value=np.zeros(
(3,1),
dtype=theano.config.floatX
),
name='W',
borrow=True
)
out = T.dot(X, W)
predict = theano.function(inputs=[X], outputs=out)
y = predict(X) # y = T.dot(X, W) work fine
cost = T.mean(T.sqr(y-Y))
gradient=T.grad(cost=cost,wrt=W)
updates = [[W,W-gradient*0.01]]
train = theano.function(inputs=[X,Y], outputs=cost, updates=updates, allow_input_downcast=True)
for i in np.arange(x_points.shape[0]):
print "iteration" + str(i)
train(x_points[i,:],y_points[i])
sample = np.arange(x_points.shape[0])+1
y_p = np.dot(x_points,W.get_value())
plt.plot(sample,y_p,'r-',sample,y_points,'ro')
plt.show()
What is the explanation behind this error? (didn't got from the error message). Thanks in Advance.
There's an important distinction in Theano between defining a computation graph and a function which uses such a graph to compute a result.
When you define
out = T.dot(X, W)
predict = theano.function(inputs=[X], outputs=out)
you first set up a computation graph for out in terms of X and W. Note that X is a purely symbolic variable, it doesn't have any value, but the definition for out tells Theano, "given a value for X, this is how to compute out".
On the other hand, predict is a theano.function which takes the computation graph for out and actual numeric values for X to produce a numeric output. What you pass into a theano.function when you call it always has to have an actual numeric value. So it simply makes no sense to do
y = predict(X)
because X is a symbolic variable and doesn't have an actual value.
The reason you want to do this is so that you can use y to further build your computation graph. But there is no need to use predict for this: the computation graph for predict is already available in the variable out defined earlier. So you can simply remove the line defining y altogether and then define your cost as
cost = T.mean(T.sqr(out - Y))
The rest of the code will then work unmodified.