I have done PCA on a dataset with 840 features and got 460 components to give a variance of 70%. I have applied this into PCA and got the tansformed PCA features.
What is the next step to get dataset to have only these features?
I am doing this in pyspark
I intend to apply segmentation to the new dataset with reduced features
Related
I am training a Logistic Regression classifier in sklearn and using RFECV to reduce the number of features. I have 10,000 data items with 3000 features each. Using RFECV, I get 106 features. The code for this is shown below:
clf = LogisticRegression(solver='lbfgs', max_iter=10000)
rfecv = RFECV(clf, step=0.1, verbose=True)
rfecv = rfecv.fit(X_train, y_train)
X_train = X_train[:, rfecv.support_]
clf.fit(X_train, y_train)
And my stats (accuracy, precision, recall and F1 score) all improve (a bit) with 106 vs. 3000 features. However, I did another test to see if I could zero out some of these 106 coefficients. So I just set all the coefficients whose absolute value is below a certain threshold to 0 and I see this:
As I increase the threshold, I do zero out more weights (the percentage of 0 weights are the diamond points). The max absolute value of coefficients is 1.62 so zeroing everything about 1.7 means 100% of the coefficients are 0, so the 1.6 and 1.7 lines make sense.
But it seems like the stats stay pretty steady as I zero more weights till threshold = 1.2 or so. But I am zeroing out 93% of the coefficients by this point. I thought I would see a more gradual decrease in the stats from 0 till 1.6 but it seems like there is a sharp change only at around 1.2-1.3.
So am I doing something wrong with sklearn and how to use RFECV? Or is this something about logistic regression that I'm not understanding. Or is it just that for this dataset, I can actually predict the class just as well with 3000 or 100 or just 5 features?
I have this dataset in which the positive class consists of component failures for a specific component of the APS system.
I am doing Predictive Maintenance using Microsoft Azure Machine Learning Studio.
As you can see from the pictures below, I am using 4 algorithm: Logistic Regression, Random Forest, Decision Tree and SVM. And you can see that the Output dataset in the score model node consists of 16k rows. However, when I see the output of the Evaluate Model, in the confusion matrix there are only 160 observations for the Logistic Regression, and the correct number, 16k for Random Forest. I have the same problem, only 160 observations in the models of Decision Tree and SVM. And the same problem is repeated in other experiments for example after feature selection, normalization etc.: some evaluate model does not use all the rows of the test dataset, and some other node does it.
How can I fix this problem? Because I am interested in the real number of false positive and false negatives.
The output metrics shown are based on the validation set (e.g. “validation metric”, “val-accuracy”).All the metrics computed and displayed are on validation set and not on the original training set. All those metrics are calculated only over the validation set without considering the training set, otherwise we would inflate the performances of the model by considering data already used to train the model.
I have this protein dataset that I need to perform a RFE on. There are 100 examples with binary class labels (sick - 1, healthy - 0) and 9847 features for each example. To reduce the dimensionality I am performing a RFECV with a LogisticRegression estimator and 5 fold CV. This is the code:
model = LogisticRegression()
rfecv = RFECV(estimator=model, step=1, cv=StratifiedKFold(5), n_jobs=-1)
rfecv.fit(X_train, y_train)
print("Number of features selected: %d" % rfecv.n_features_)
Number of features selected: 9874
I then plot the number of features vs the CV scores:
plt.figure()
plt.xlabel("feature count")
plt.ylabel("CV accuracy")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
What I think is happening (and this is what I need an expert for) is that the first peak shows the optimal number of features. After that the curve drops and only starts to climb again because of overfitting, not really seperating classes but examples. Could this be the case? And if so how can I obtain these features (i.e. the ones at that first peak), because rfecv.support_ only gives me the ones where the highest accuracy was reached (meaning: all of them).
And while I am at it: How would I choose the best estimator for the RFE? Is it just by trial and error, going through all possible classifiers or is there any logic why I would use a Logit over a linear SVC for example?
One way that i use for feature relevance is the RandomForest or ExtremeRandomizedTrees.
i can use:
rfecv.n_features
to see how much features the find and:
rfec.ranking
to see the features index in descending order. another algorithm that you can use is the PCA to reduce the dimension of you Dataset.
I'm trying to use caffe to simulate the SGDclassifier and Logisticregression linear models in sklearn. As we all know, in caffe, one "InnerProduct" layer plus one "Softmaxwithloss" layer represent a logistic regression Y = Logit(WX+b).
I'm now using the digits dataset in the sklearn datasets package as the trianing set(5/6 of all the data-label pairs) and testing set(the rest 1/6). However, the accuracy obtained by SGDclassifer() or LogisticRegression() could reach nearly 90%, while the accuracy obtained by two-layer Neural Network cannot exceed 30% after training. Is this because of the parameter settings or something else? The gap between them is just kind of too large.
I have a new question about scikit for you.
Classification problem, logistic regression as estimator.
I have my X dataset, with my features.
I want to use my algorithm through cross validation and I have two ways: I split manually my dataset in 5 subsets, end I iterate for 5 times leaving every time a different set for testing. I obtain my scores, but what I want now is the average of the coefficients to use with the estimator to predict on a new dataset. I read somewhere on stackoverflow that it's possible to pass the coefficients to the scikit logistic regression estimator.
The otherway is to use cross_val_score:
lrmodel=LogisticRegression(penalty='l2',C=1)
cv.cross_val_score(lrmodel, Xf, y, cv=5,scoring='log_loss', verbose=0)
gives me the cross-entropy after a cross validation estimation. But what if now I want to use the average coefficients and use the estimator for a new prediction on my new yet unlabeled dataset?
Thank you!