What is the main difference between Weka's RandomForest and SKlearn's RandomForestClassifier? are there any equivalent parameters between the two? - scikit-learn

I'm considering using Weka for my data processing. Therefore, I'm testing Weka's algorithms on well-known datasets such as Kaggle's Titanic dataset. However, I noticed some accuracy differences when using the same classifier (Random Forest) for my dataset. I want to know what are the equivalent hyperparameters between Weka and Python's Scikit learn implementation of the random forest algorithm so I can use the algorithm appropriately in Weka just as I used to do it in Python.

Related

Is there a comparable SKLearn RFClassifier argument to H2o's 'stopping_rounds'?

I'm translating a random forest using h20 and r into a random forest using SciKit Learn's Random Forest Classifier with python. H2o's randomForest model has an argument 'stopping_rounds'. Is there a way to do this in python using the SKLearn Random Forest Classifier model? I've looked through the documentation, so I'm afraid I might have to hard code this.
No, I don't believe scikit-learn algorithms have any sort of automatic early stopping mechanism (that's what stopping_rounds relates to in H2O algorithms). You will have to figure out the optimal number of trees manually.
Per the sklearn random forest classifier docs, early stopping is determined by the min_impurity_split (deprecated) and min_impurity_decrease arguments. It doesn't seem to have the same functionality as H2O, but it might be what you're looking for.

How to use feature selection and dimensionality reduction in Unsupervised learning?

I've been working on classifying emails from two authors. I've been successful in executing the same using supervised learning along with TFIDF vectorization of text, PCA and SelectPercentile feature selection. I used scikit-learn package to achieve the same.
Now I wanted to try the same using Unsupervised Learning KMeans algorithm to cluster the emails into two groups. I have created dataset wherein I have each data point as a single line in the python list. Since I am a newbie to unsupervised so I wanted to ask if I can apply the same dimensionality reduction tools as used in supervised (TFIDF, PCA and SelectPercentile). If not then what are their counterparts? I am using scikit-learn for coding it up.
I looked around on stackoverflow but couldn't get a satisfactory answer.
I am really stuck at this point.
Please help!
Following are the techniques for dimensionality reduction that can be applied in case of Unsupervised Learning:-
PCA: principal component analysis
Exact PCA
Incremental PCA
Approximate PCA
Kernel PCA
SparsePCA and MiniBatchSparsePCA
Random projections
Gaussian random projection
Sparse random projection
Feature agglomeration
Standard Scaler
Mentioned above are some of the approaches that can be used for dimensionality reduction of huge data in case on unsupervised learning.
You can read more about the details here.

Is it possible to obtain class probabilities using GradientBoostedTrees with spark mllib?

I am currently working with spark mllib.
I have created a text classifier using the Gradient Boosting algorithm with the class GradientBoostedTrees:
Gradient Boosted Trees
Currently I obtain the predictions to know the class of new elements but I would like to obtain the class probabilities (the value of the output before the hard decision).
In other mllib algorithms like logistic regression you can remove the threshold from the classifier to obtain the class probabilities but I can not find a way to do the same procedure with GradientBosstedTrees.
As far as I know, it's not currently possible but it is possible with random forest.
You can see this link...I have explained a procedure here
Predicting probabilities of classes in case of Gradient Boosting Trees in Spark using the tree output
In order to implement the predicted probabilities and thresholds one need to write program using the trees from
print(model.toDebugString)
output. I tried to understand how the tree works to predict which is fairly simple outside Spark.
It seems that in Spark MLLIB it is not possible to obtain the class probabilities.
You can only obtain the final classification decision.
That's a pity because that information would be very useful (If you classify a sample as positive with 99.99% of posibilities is not the same than 51%) and it is not difficult to obtain that information once the model has been trained.
An alternative is using a different software like xgboost: https://github.com/dmlc/xgboost

Transforming CountVectorizer with entropy (log-entropy) / sklearn

I would like to try out some variations around Latent Semantic Analysis (LSA) with scikit-learn. Besides pure frequency counts from CountVectorizer() and the weighted result of TfidfTransformer(), I'd like to test weighting by entropy (and log-entropy) (used in the original papers and reported to perform very well).
Any suggestions on how to proceed? I know Gensim has an implementation (LogEntropyModel()) but would prefer to stick with scikit-learn.

unary class text classification in weka?

I have a training dataset (text) for a particular category (say Cancer). I want to train a SVM classifier for this class in weka. But when i try to do this by creating a folder 'cancer' and putting all those training files to that folder and when i run to code i get the following error:
weka.classifiers.functions.SMO: Cannot handle unary class!
what I want to do is if the classifier finds a document related to 'cancer' it says the class name correctly and once i fed a non cancer document it should say something like 'unknown'.
What should I do to get this behavior?
The SMO algorithm in Weka only does binary classification between two classes. Sequential Minimal Optimization is a specific algorithm for solving an SVM and in Weka this a basic implementation of this algorithm. If you have some examples that are cancer and some that are not, then that would be binary, perhaps you haven't labeled them correctly.
However, if you are using training data which is all examples of cancer and you want it to tell you whether a future example fits the pattern or not, then you are attempting to do one-class SVM, aka outlier detection.
LibSVM in Weka can handle one-class svm. Unlike the Weka SMO implementation, LibSVM is a standalone program which has been interfaced into Weka and incorporates many different variants of SVM. This post on the Wekalist explains how to use LibSVM for this in Weka.

Resources