How to add residuals values in my data frame? - statistics

I've residuals value in my model result. I would like to transform and add them in my data frame.That's means, to have results with low residuals values. I'm using FE model. So, i've tried to add but my method doesn't work. Here my R details:
fixed = plm(sp ~lag(debt)+lag(I(debt^2))+outgp++bcour+gvex+vlexp+vlimp, data=bdata, index=c("country", "year"), model="within")
method for adding my residuals values:
bdata$resid = fixed$resid
thanks for your hand

Related

ValueError: shapes (5,14) and (16,) not aligned: 14 (dim 1)!= 16 (dim 0)

I am working on housing dataset and when trying to fit the linear regression model getting error as mentioned. Complete code as below.
I am not sure where is code going wrong. I tried pasting the code as it is from the reference book.
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:\t", lin_reg.predict(some_data_prepared))
ERROR: ValueError: shapes (5,14) and (16,) not aligned: 14 (dim 1) != 16 (dim 0)
What am I doing wrong here?
Explanation
Hi, I guess you are reading and following the Hands on Machine Learning with Scikit Learn and Tensorflow book. The problem also occurred to me.
In the following part of the code you select from the data set the first 5 instances. One of the attributes in the data set which is called ocean_proximity is an object and for the linear regression model to be able to operate with it, it must be translated to an integer, which in the book is done with a one hot encoding.
One hot encoding works by analyzing all the categories that can be assigned to the attribute, in this case 5 ('<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND'), and then creating a matrix of that length for each instance and zeroing every element of the matrix except the category of that instance which is assigned a 1 (or another value). For example:
If ocean_proximity equals '<1H OCEAN' the conversion would be [1, 0, 0, 0, 0]
In this piece of code you select the five first instances of the data set, but this does not assure you that all the categories in "ocean_proximity" will appear. It could happen that only 3 of them appear or just 1. Therefor if you apply a one hot encoding to those five selected rows and only 3 categories appear (for example just 'INLAND', 'ISLAND' and 'NEAR BAY'), the matrices created by the one hot encoding will be of length 3.
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
The error is just telling you that, since the one hot conversion of some_data created matrices of a length inferior to 5, the total columns in some_data_prepared is 14, which is less than the columns in housing_prepared (16), thus making the model unable to predict the prices.
If you transform both some_data_prepared and housing_prepared into dataframes and then call .head() you will see the problem.
some_data_prepared.head()
housing_prepared.head()
Solution
To solve the problem you must create the columns missing in some_data_prepared by creating a zeroed numpy array of shape [5,x] (being 5 the number of rows and x the number of columns missing) and concatenating it to some_data_prepared to match the shape of the housing_prepared data set.
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.fit_transform(some_data)
dummy_array = np.zeros((5,1))
some_data_prepared = np.c_[some_data_prepared, dummy_array]
predictions = linear_regression.predict(some_data_prepared)
print("Predictions: ", predictions)
print("Labels: ", some_labels.values)
Missing category values (ocean proximity in this case) in some_data compared to housing_prepared is the issue.
housing_prepared.shape gives (16512, 16), but some_data_prepared.shape gives (5,14), so add zeros for the missing columns:
dummy_array = np.zeros((5,2))
some_data_prepared = np.c_[some_data_prepared,dummy_array]
the 2 in np.zeros determines the difference of columns
I've at first encountered the same issue on the considered piece of code. After exploring the issues of the handson-ml repository, I think I have understood the subtlety which is causing the error here.
My guess is that (as in my case), closing the notebook might have caused what was in memory (and the trained model in particular) to be lost. In my case, I could get the result and avoid the error rerunning the notebook from the beginning.
Instead, from a theoretical viewpoint, you should never call fit() or fit_transform() on data which is not training data (eg on some_data). Here, running fit_transform(some_data) and then stacking the dummy array to some_data_prepared works, but it forces the model to be trained again on some_data rather than on housing_prepared, which is not what you want.

How to use Principal Component Analysis while predicting?

Suppose my original dataset has 8 features and I apply PCA with n_components = 3 (I am using sklearn.decomposition.PCA). Then I train my model using those 3 PCA components (which are now my new features).
Do I need to apply PCA while predicting as well ?
Do I need to do that even if I am predicting only one data point?
What confuses me is that when I do prediction, every data point is a row in a 2D matrix (consists of all data points that I want to predict). So if I apply PCA on just one data point, then the corresponding row vector will be converted to a zero vector.
If you fitted your model on the first three components of the PCA, you have to transform appropriately any new data. For example, consider this code taken from here:
pca = PCA(n_components=n_components, svd_solver='randomized',
whiten=True).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
In the code, they first fit PCA on the trainig. Then they transform both training and testing, and then they apply the model (in their case, SVM) on the transformed data.
Even if your X_test consists of only 1 data point, you could still use PCA. Just transform your data into a 2D matrix. For example, if your data point is [1,2,0,5] then X_test=[[1,2,0,5]]. That is, it is a 2D matrix with 1 row.

How to SVM Train my Edge images using Java code

I have set of images on which I performed edge detection using OpenCV 3.1. The edges are stored in MAT of OpenCV. Can someone help me in processing for Java SVM train and test code on those set of images ?
Following discussion in comments I am providing you with an example project which I built for android studio a while back.
This was used to classify images depending on Lab color spaces.
//1.a Assign the parameters for SVM training here
double nu = 0.999D;
double gamma = 0.4D;
double epsilon = 0.01D;
double coef0 = 0;
//kernel types are Linear(0), Poly(1), RBF(2), Sigmoid(3)
//For Poly(1) set degree and gamma
double degree = 2;
int kernel_type = 4;
//1.b Create an SVM object
SVM B_channel_svm = SVM.create();
B_channel_svm.setType(104);
B_channel_svm.setNu(nu);
B_channel_svm.setCoef0(coef0);
B_channel_svm.setKernel(kernel_type);
B_channel_svm.setDegree(degree);
B_channel_svm.setGamma(gamma);
B_channel_svm.setTermCriteria(new TermCriteria(2, 10, epsilon));
// Repeat Step 1.b for the number of SVMs.
//2. Train the SVM
// Note: training_data - If your image has n rows and m columns, you have to make a matrix of size (n*m, o), where o is the number of labels.
// Note: Label_data is same as above, n rows and m columns, make a matrix of size (n*m, o) where o is the number of labels.
// Note: Very Important - Train the SVM for the entire data as training input and the specific column of the Label_data as the Label. Here, I train the data using B, G and R channels and hence, the name B_channel_SVM. I make 3 different SVM objects separately but you can do this by creating only one object also.
B_channel_svm.train(training_data, Ml.ROW_SAMPLE, Label_data.col(0));
G_channel_svm.train(training_data, Ml.ROW_SAMPLE, Label_data.col(1));
R_channel_svm.train(training_data, Ml.ROW_SAMPLE, Label_data.col(2));
// Now after training we "predict" the outcome for a sample from the trained SVM. But first, lets prepare the Test data.
// As above for the training data, make a matrix of (n*m, o) and use the columns to predict. So, since I created 3 different SVMs, I will input three separate matrices for the three SVMs of size (n*m, 1).
//3. Predict the testing data outcome using the trained SVM.
B_channel_svm.predict(scene_ml_input, predicted_final_B, StatModel.RAW_OUTPUT);
G_channel_svm.predict(scene_ml_input, predicted_final_G, StatModel.RAW_OUTPUT);
R_channel_svm.predict(scene_ml_input, predicted_final_R, StatModel.RAW_OUTPUT);
//4. Here, predicted_final_ are matrices which gives you the final value as in Label(0,1,2... etc) for the input data (edge profile in your case)
Now, I hope you have an idea for how SVM works. You basically need to do these steps:
Step 1: Identify labels - In your case Gestures from edge profile.
Step 2: Assign values to the labels - For example, if you are trying to classify haptic gestures - Open Hand = 1, Closed Hand/Fist = 2, Thumbs up = 3 and so on.
Step 3: Prepare the training data (edge profiles) and Labels (1,2,3) etc. according to the process above.
Step 4: Prepare data for prediction using the transformation calculated using SVM.
Very Important for SVM on OpenCV - Normalize your data, make sure you all matrices are of Same Type - CvType
Hope it helps. Feel free to ask questions if you have any doubts and post what you have tried. I can solve the problem for you if you send me some images but then you won't learn anything right? ;)

scikit-learn: Is there a way to provide an object as an input to predict function of a classifier?

I am planning to use an SGDClassifier in production. The idea is to train the classifier on some training data, use cPickle to dump it to a .pkl file and reuse it later in a script. However, there are certain high cardinality fields which are categorical in nature and translated to one hot matrix representation which creates around 5000 features. Now the input that I get for the predict will only have one of these features and rest all will be zeroes. It will also include ofcourse the other numerical features apart from this. From the docs, it appears that the predict function expects an array of array as input. Is there any way I can transform my input to the format expected by the predict function without having to store the fields everytime I train the model ?
Update
So, let us say my input contains 3 fields:
{
rate: 10, // numeric
flagged: 0, //binary
host: 'somehost.com' // keeping this categorical
}
host can have around 5000 different values. Now I loaded the file to a pandas dataframe, used the get_dummies function to transform the host field to around 5000 new fields which are binary fields.
Then I trained by model and stored it using cPickle.
Now, when I need to use the predict function, for the input, I only have 3 fields (shown above). However, as per my understanding the predict endpoint will expect an array of vectors and each vector is supposed to have those 5000 fields.
For the entry that I need to predict, I know only one field for that entry which will be the value of host itself.
For example, if my input is
{
rate: 5,
flagged: 1
host: 'new_host.com'
}
I know that the fields expected by the predict should be:
{
rate: 5,
flagged: 1
new_host: 1
}
But if I translate it to vector format, I don't know which index to place the new_host field. Also, I don't know in advance what other hosts are (unless I store it somewhere during the training phase)
I hope I am making some sense. Let me know if I am doing it the wrong way.
I don't know which index to place the new_host field
A good approach that has worked for me is to build a pipeline which you then use for training and prediction. This way you do not have to concern yourself with the column index of whatever output is produced by your transformation:
# in training
pipl = Pipeline(steps=[('binarizer', LabelBinarizer(),
('clf', SGDClassifier())])
model = pipl.train(X, Y)
pickle.dump(mf, model)
# in production
model = pickle.load(mf)
y = model.predict(X)
As X, Y inputs you need to pass an array-like object. Make sure the input is the same structure for both training and test, e.g.
X = [[data.get('rate'), data.get('flagged'), data.get('host')]]
Y = [[y-cols]] # your example doesn't specify what is Y in your data
More flexible: Pandas DataFrame + Pipeline
What also works nicely is to use a Pandas DataFrame in combination with sklearn-pandas as it allows you to use different transformations on different column names. E.g.
df = pd.DataFrame.from_dict(data)
mapper = DataFrameMapper([
('host', sklearn.preprocessing.LabelBinarizer()),
('rate', sklearn.preprocessing.StandardScaler())
])
pipl = Pipeline(steps=[('mapper', mapper),
('clf', SGDClassifier())])
X = df[x-cols]
y = df[y-col(s)]
pipl.fit()
Note that x-cols and y-col(s) are the list of the feature and target columns respectively.
You should use a scikit-learn transformer instead of get_dummies. In this case, LabelBinarizer makes sense. Seeing as LabelBinarizer doesn't work in a pipeline, this is one way to do what you want:
binarizer = LabelBinarizer()
# fitting LabelBinarizer means it remembers all the columns it's seen
one_hot_data = binarizer.fit_transform(X_train[:, categorical_col])
# replace string column with one-hot representation
X_train = np.concatenate([np.delete(X_train, categorical_col, axis=1),
one_hot_data], axis=1)
model = SGDClassifier()
clf.fit(X_train, y)
pickle.dump(f, {'clf': clf, 'binarizer': binarizer})
then at prediction time:
estimators = pickle.load(f)
clf = estimators['clf']
binarizer = estimators['binarizer']
one_hot_data = binarizer.transform(X_test[:, categorical_col])
X_test = np.concatenate([np.delete(X_test, categorical_col, axis=1),
one_hot_data], axis=1)
clf.predict(X_test)

Scikit-Learn Linear Regression how to get coefficient's respective features?

I'm trying to perform feature selection by evaluating my regressions coefficient outputs, and select the features with the highest magnitude coefficients. The problem is, I don't know how to get the respective features, as only coefficients are returned form the coef._ attribute. The documentation says:
Estimated coefficients for the linear regression problem. If multiple
targets are passed during the fit (y 2D), this is a 2D array of
shape (n_targets, n_features), while if only one target is passed,
this is a 1D array of length n_features.
I am passing into my regression.fit(A,B), where A is a 2-D array, with tfidf value for each feature in a document. Example format:
"feature1" "feature2"
"Doc1" .44 .22
"Doc2" .11 .6
"Doc3" .22 .2
B are my target values for the data, which are just numbers 1-100 associated with each document:
"Doc1" 50
"Doc2" 11
"Doc3" 99
Using regression.coef_, I get a list of coefficients, but not their corresponding features! How can I get the features? I'm guessing I need to modfy the structure of my B targets, but I don't know how.
What I found to work was:
X = your independent variables
coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1)
The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)
You can do that by creating a data frame:
cdf = pd.DataFrame(regression.coef_, X.columns, columns=['Coefficients'])
print(cdf)
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(logistic.coef_)})
I suppose you are working on some feature selection task. Well using regression.coef_ does get the corresponding coefficients to the features, i.e. regression.coef_[0] corresponds to "feature1" and regression.coef_[1] corresponds to "feature2". This should be what you desire.
Well I in its turn recommend tree model from sklearn, which could also be used for feature selection. To be specific, check out here.
Coefficients and features in zip
print(list(zip(X_train.columns.tolist(),logreg.coef_[0])))
Coefficients and features in DataFrame
pd.DataFrame({"Feature":X_train.columns.tolist(),"Coefficients":logreg.coef_[0]})
This is the easiest and most intuitive way:
pd.DataFrame(logisticRegr.coef_, columns=x_train.columns)
or the same but transposing index and columns
pd.DataFrame(logisticRegr.coef_, columns=x_train.columns).T
Suppose your train data X variable is 'df_X' then you can map into a dictionary and feed into pandas dataframe to get the mapping:
pd.DataFrame(dict(zip(df_X.columns,model.coef_[0])),index=[0]).T
Try putting them in a series with the data columns names as index:
coeffs = pd.Series(model.coef_[0], index=X.columns.values)
coeffs.sort_values(ascending = False)

Resources