How do I build a SVM classifier in Weka to only consider certain features in the data set? - svm

I am new to WEKA and I am working on an assignment which currently has 128 different features. I am told to build a SVM classifier on the data and that the classifier should:
consider only the top 10 "chi-square" features in the data.
the top 10 features should be selected fresh in each fold using 10 fold cross validation.
I have already obtained the top 10 features using the ChiSquaredAttributeEval evaluator. How do I go about building a SVM with points 1&2 in mind?
Edit:How do I show the WEKA 10-fold cross validation workflow diagram for this classifier as well?

Use the following classifier setup to encapsulate the attribute selection step and the training of your support vector machine on the reduced dataset:
meta.FilteredClassifier
|
+- filter: supervised.attribute.AttributeSelection (ChiSquaredAttributeEval + search method)
|
+- classifier: functions.SMO

Related

Evaluate Model Node in Azure ML Studio does not take all the rows of the dataset in confusion matrix

I have this dataset in which the positive class consists of component failures for a specific component of the APS system.
I am doing Predictive Maintenance using Microsoft Azure Machine Learning Studio.
As you can see from the pictures below, I am using 4 algorithm: Logistic Regression, Random Forest, Decision Tree and SVM. And you can see that the Output dataset in the score model node consists of 16k rows. However, when I see the output of the Evaluate Model, in the confusion matrix there are only 160 observations for the Logistic Regression, and the correct number, 16k for Random Forest. I have the same problem, only 160 observations in the models of Decision Tree and SVM. And the same problem is repeated in other experiments for example after feature selection, normalization etc.: some evaluate model does not use all the rows of the test dataset, and some other node does it.
How can I fix this problem? Because I am interested in the real number of false positive and false negatives.
The output metrics shown are based on the validation set (e.g. “validation metric”, “val-accuracy”).All the metrics computed and displayed are on validation set and not on the original training set. All those metrics are calculated only over the validation set without considering the training set, otherwise we would inflate the performances of the model by considering data already used to train the model.

Random forest sklearn- OOB score

what is the difference in including oob_Score =True and not including oob_score in RandomForestClassifier in sklearn in python. The out-of-bag (OOB) error is the average error for each calculated using predictions from the trees that do not contain in their respective bootstrap sample right , so how does including the parameter oob_score= True affect the calculations of average error.
For each tree, only a share of data is selected for building the tree, i.e. training. The remaining samples are the the out-of-bag samples. These out-of-bag samples can be used directly during training to compute a test accuracy. If you activate the option, the "oob_score_" and "oob_prediction_" will be computed.
The training model will not change if you activate or not the option. Obviously, due to the random nature of RF, the model will not be exactly the same if you apply twice, but it has nothing to do with the "oob_score" option.
Unfortunately, scikit-learn option does not allow you to set the OOB ration, i.e. the percentage of samples used to build a tree. This is the case in other library (e.g. C++ Shark http://image.diku.dk/shark/sphinx_pages/build/html/rest_sources/tutorials/algorithms/rf.html).

Feature selection on a keras model

I was trying to find the best features that dominate for the output of my regression model, Following is my code.
seed = 7
np.random.seed(seed)
estimators = []
estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, epochs=3,
batch_size=20)))
pipeline = Pipeline(estimators)
rfe = RFE(estimator= pipeline, n_features_to_select=5)
fit = rfe.fit(X_set, Y_set)
But I get the following runtime error when running.
RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
How to overcome this issue and select best features for my model? If not, Can I use algorithms like LogisticRegression() provided and supported by RFE in Scikit to achieve the task of finding best features for my dataset?
I assume your Keras model is some kind of a neural network. And with NN in general it is kind of hard to see which input features are relevant and which are not. The reason for this is that each input feature has multiple coefficients that are linked to it - each corresponding to one node of the first hidden layer. Adding additional hidden layers makes it even more complicated to determine how big of an impact the input feature has on the final prediction.
On the other hand, for linear models it is very straightforward since each feature x_i has a corresponding weight/coefficient w_i and its magnitude directly determines how big of an impact it has in prediction (assuming that features are scaled of course).
The RFE estimator (Recursive feature elimination) assumes that your prediction model has an attribute coef_ (linear models) or feature_importances_(tree models) that has the length of input features and that it represents their relevance (in absolute terms).
My suggestion:
Feature selection: (Option a) Run the RFE on any linear / tree model to reduce the number of features to some desired number n_features_to_select. (Option b) Use regularized linear models like lasso / elastic net that enforce sparsity. The problem here is that you cannot directly set the actual number of selected features. (Option c) Use any other feature selection technique from here.
Neural Network: Use only features from (1) for your neural network.
Suggestion:
Perform the RFE algorithm on a sklearn-based algorithm to observe feature importance. Finally, you use the most importantly observed features to train your algorithm based on Keras.
To your question: Standardization is not required for logistic regression

Azure Machine Learning Decision Tree output

Is there any way to get the output of the Boosted Decision Tree module in ML Studio? To analyze the learned tree, like in Weka.
Update: visualization of decision trees is available now! Right-click on the output node of the "Train Model" module and select "Visualize".
My old answer:
I'm sorry; visualization of decision trees isn't available yet. (I really want it too! You can upvote this feature request at http://feedback.azure.com/forums/257792-machine-learning/suggestions/7419469-show-variable-importance-after-experiment-runs, but they are currently working on it.)
Just FYI, you can currently see what the model builds for linear algorithms by right-clicking on the "Train Model" module output node and selecting "Visualize". It will show the initial parameter values and the feature weights. But for non-linear algorithms like decision trees, that visibility is still forthcoming.
Yes, I don't know your structure but you should have your dataset and the algorithm going into a train model and put the results of the train model with your other half of the dataset (if you used split) into a score model. You can see the scored label and scored probabilities here when you press visualise
Your experiment should look a bit like this. Connect the boosted decision tree with the dataset to a train model, you can see the results in the score model

feature selection and cross validation

I want to train a regression model and in order to do so I use random forest models. However, I also need to do feature selection cause I have so many features in my dataset and I'm afraid if I used all the feature then I'll be overfitting. In order to assess the performance of my model I also perform a 5 fold cross validation and my question of these following two approaches is right and why?
1- should I split the data into two halves, do feature selection on first half and use these selected features to do 5 fold cross validation (CV) on the remaining half (in this case the 5 CV will be using exactly the same selected features).
2- do the following procedure:
1- split the data into 4/5 for training and 1/5 for testing
2- split this training data (the 4/5 of the full data) in to two halves:
a-) on the first half train the model and use the trained model to do feature selection.
b-) Use the selected features from the first part in order to train the model on the second half of the training dataset (this will be our final trained model).
3- test the performance of the model on the remaining 1/5 of the data (which is never used in the training phase)
4- repeat the previous step 5 times and in each time we randomly (without replacement) split the data into 4/5 for training and 1/5 for testing
my only concern is that in the second procedure we will have 5 models and the features of the final models will be the union of the top features of these five models, so I'm not sure if the performance of the 5CV can be reflective of the final performance of the final model especially since the final model has different features than each model in the 5fold (cause it's the union of the selected features of each model in the 5 CV)
Cross validation should always be the outer most loop in any machine learning algorithm.
So, split the data into 5 sets. For every set you choose as your test set (1/5), fit the model after doing a feature selection on the training set (4/5). Repeat this for all the CV folds - here you have 5 folds.
Now once the CV procedure is complete, you have an estimate of your model's accuracy, which is a simple average of your individual CV fold's accuracy.
As far as the final set of features for training the model on the complete set of data is concerned, do the following to select the final set of features.
-- Each time you do a CV on a fold as outlined above, vote for the features that you selected in that particular fold. At the end of 5 fold CV, select a particular number of features that have the top votes.
Use the above selected set of features to do one final procedure of feature selection and then train the model on the complete data (combined of all 5 folds) and move the model to production.
Do the CV on the full data (split it into 5 parts, and use a different combination of a Parts for every split) and then do your feature selection on the cv-splits and then your RF on the output of the selection.
Why: Because CV is checking your model under different Data Splits so your model dont overfit. Since the feature selecetion can be viewed as part of your model you have to check this to for overfitting.
After your Validated your Model with CV then fit your whole data into it and perform the transform of this single model.
Also if your worried about overfitting you should limit the RF in either deep and number of trees. CV is mostly used just as an tool in the developement process of an model and for the final model all of the data is used.

Resources