Can anyone explain the difference between the RandomForestClassifier and ExtraTreesClassifier in scikit learn. I've spent a good bit of time reading the paper:
P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006
It seems these are the difference for ET:
1) When choosing variables at a split, samples are drawn from the entire training set instead of a bootstrap sample of the training set.
2) Splits are chosen completely at random from the range of values in the sample at each split.
The result from these two things are many more "leaves".
Yes both conclusions are correct, although the Random Forest implementation in scikit-learn makes it possible to enable or disable the bootstrap resampling.
In practice, RFs are often more compact than ETs. ETs are generally cheaper to train from a computational point of view but can grow much bigger. ETs can sometime generalize better than RFs but it's hard to guess when it's the case without trying both first (and tuning n_estimators, max_features and min_samples_split by cross-validated grid search).
ExtraTrees classifier always tests random splits over fraction of features (in contrast to RandomForest, which tests all possible splits over fraction of features)
The main difference between random forests and extra trees (usually called extreme random forests) lies in the fact that, instead of computing the locally optimal feature/split combination (for the random forest), for each feature under consideration, a random value is selected for the split (for the extra trees). Here is a good resource to know more about their difference in more detail Random forest vs extra tree.
Related
I am using gensim Word2Vec model to train word embeddings. My code is:
w2v_model = Word2Vec(min_count=20,
window=2,
vector_size=50,
sample=6e-5,
alpha=0.03,
min_alpha=0.0007,
negative=20,
workers=cores-1)
w2v_model.build_vocab(sentences, progress_per=10000)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=50, report_delay=1)
I wonder whether I can access the negative and positive word samples during the process?
Thanks in advance.
Deep inside the training loops, for each individual 'center' word in the training texts that is to be predicted – a micro-training-example for the shallow neural-net – a different set of negative words will be chosen.
Those negative-words will be used for just that one set of forward/backward neural-net nudges, then discarded when training moves to the next word.
There's no way to access them other than changing that core code – which is actually written in Cython, & re-compiled into a native library after any changes. (It's a bit harder to tinker with than pure Python code.)
You can see where the exact choice-of-negative samples happens in the source code for one of the modes (CBOW w/ negative-sampling) here:
https://github.com/RaRe-Technologies/gensim/blob/91175ddc7e3d6f3a2af245c20af21ec3bf5e360f/gensim/models/word2vec_inner.pyx#L427
If you just need a representative set of negative-words, you could copy these steps in your own code.
If you want to know (& potentially log?) the negative words chosen for every positive prediction, I suspect that's a misguided idea:
Meaningful analysis of this algorithm's behavior won't depend on either individual micro-examples, nor the arbitrarily-random negative words chosen over all training. The interesting properties only arise from the tug-of-war happening across the interplay of all training.
As this is very deep in the training loops, even the most-efficient extra-steps, as a function of the negative-words, would slow things down a lot. Or, in the case of logging, result in 20x (for window=20) more logged-negative-words than your original training corpus. For the kinds of large corpora where this algorithm works well, such a slowdown/log could be onerous; for tiny toy-sized examples, this algorithm won't be working interestingly at all.
So the mere question, if you truly want a peek at all the (random, arbitrary) negative words during the process, suggests you may be going down a questionable path.
It'd be easier for me to imagine just wanting to see a representative set of the negatively-sampled words - because any 10, or 10,000, or 1,000,000 such randomly-chosen words are as good as any other, and the algorithm (on adequately-sized data) is robust against usual variance in which negative-words are actually chosen. And for that, you could just run the same sampling-process outside the training.
Separately: those are odd non-default choices for alpha & min_alpha - values that usually don't need any tweaking, and if tweaked should really only be done so with a conscious plan, driven by quantitaive evaluations comparing the results of alternate values. But, those specific odd unmotivated values are pretty common in some of the worst online tutorials. So beware where you're learning about word2vec!
I'm trying to apply some binary text classification but I don't feel that having millions of >1k length vectors is a good idea. So, which alternatives are there for the basic BOW model?
I think there are quite a few different approaches, based on what exactly you are aiming for in your prediction task (processing speed over accuracy, variance in your text data distribution, etc.).
Without any further information on your current implementation, I think the following avenues offer ways for improvement in your approach:
Using sparse data representations. This might be a very obvious point, but choosing the right data structure to represent your input vectors can already save you a great deal of pain. Sklearn offers a variety of options, and detail them in their great user guide. Specifically, I would point out that you could either use scipy.sparse matrices, or alternatively represent something with sklearn's DictVectorizer.
Limit your vocabulary. There might be some words that you can easily ignore when building your BoW representation. I'm again assuming that you're working with some implementation similar to sklearn's CountVectorizer, which already offers a great number of possibilities. The most obvious option are stopwords, which can simply be dropped from your vocabulary entirely, but of course you can also limit it further by using pre-processing steps such as lemmatization/stemming, lowercasing, etc. CountVectorizer specifically also allows you to control the minimum and maximum document frequency (don't confuse this with corpus frequency), which again should limit the size of your vocabulary.
I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.
I'm vectorizing words on a few different corpora with Gensim and am getting results that are making me rethink how Word2Vec functions. My understanding was that Word2Vec was deterministic, and that the position of a word in a vector space would not change from training to training. If "My cat is running" and "your dog can't be running" are the two sentences in the corpus, then the value of "running" (or its stem) seems necessarily fixed.
However, I've found that that value indeed does vary across models, and words keep changing where they are on a vector space when I train the model. The differences are not always hugely meaningful, but they do indicate the existence of some random process. What am I missing here?
This is well-covered in the Gensim FAQ, which I quote here:
Q11: I've trained my Word2Vec/Doc2Vec/etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)
Answer: The *2vec models (word2vec, fasttext, doc2vec…) begin with random initialization, then most modes use additional randomization
during training. (For example, the training windows are randomly
truncated as an efficient way of weighting nearer words higher. The
negative examples in the default negative-sampling mode are chosen
randomly. And the downsampling of highly-frequent words, as controlled
by the sample parameter, is driven by random choices. These
behaviors were all defined in the original Word2Vec paper's algorithm
description.)
Even when all this randomness comes from a
pseudorandom-number-generator that's been seeded to give a
reproducible stream of random numbers (which gensim does by default),
the usual case of multi-threaded training can further change the exact
training-order of text examples, and thus the final model state.
(Further, in Python 3.x, the hashing of strings is randomized each
re-launch of the Python interpreter - changing the iteration ordering
of vocabulary dicts from run to run, and thus making even the same
string-of-random-number-draws pick different words in different
launches.)
So, it is to be expected that models vary from run to run, even
trained on the same data. There's no single "right place" for any
word-vector or doc-vector to wind up: just positions that are at
progressively more-useful distances & directions from other vectors
co-trained inside the same model. (In general, only vectors that were
trained together in an interleaved session of contrasting uses become
comparable in their coordinates.)
Suitable training parameters should yield models that are roughly as
useful, from run-to-run, as each other. Testing and evaluation
processes should be tolerant of any shifts in vector positions, and of
small "jitter" in the overall utility of models, that arises from the
inherent algorithm randomness. (If the observed quality from
run-to-run varies a lot, there may be other problems: too little data,
poorly-tuned parameters, or errors/weaknesses in the evaluation
method.)
You can try to force determinism, by using workers=1 to limit
training to a single thread – and, if in Python 3.x, using the
PYTHONHASHSEED environment variable to disable its usual string hash
randomization. But training will be much slower than with more
threads. And, you'd be obscuring the inherent
randomness/approximateness of the underlying algorithms, in a way that
might make results more fragile and dependent on the luck of a
particular setup. It's better to tolerate a little jitter, and use
excessive jitter as an indicator of problems elsewhere in the data or
model setup – rather than impose a superficial determinism.
While I don't know any implementation details of Word2Vec in gensim, I do know that, in general, Word2Vec is trained by a simple neural network with an embedding layer as the first layer. The weight matrix of this embedding layer contains the word vectors that we are interested in.
This being said, it is in general also quite common to initialize the weights of a neural network randomly. So there you have the origin of your randomness.
But how can the results be different, regardless of different (random) starting conditions?
A well trained model will assign similar vectors to words that have similar meaning. This similarity is measured by the cosine of the angle between the two vectors. Mathematically speaking, if v and w are the vectors of two very similar words then
dot(v, w) / (len(v) * len(w)) # this formula gives you the cosine of the angle between v and w
will be close to 1.
Also, it will allow you to do arithmetics like the famous
king - man + woman = queen
For illustration purposes imagine 2D-vectors. Would these arithmetical properties get lost if you e.g. rotate everything by some angle around the origin? With a little mathematical background I can assure you: No, they won't!
So, your assumption
If "My cat is running" and "your dog can't be running" are the two
sentences in the corpus, then the value of "running" (or its stem)
seems necessarily fixed.
is wrong. The value of "running" is not fixed at all. What is (somehow) fixed, however, is the similarity (cosine) and arithmetical relationship to other words.
I have a multi-class text classification/categorization problem. I have a set of ground truth data with K different mutually exclusive classes. This is an unbalanced problem in two respects. First, some classes are a lot more frequent than others. Second, some classes are of more interest to us than others (those generally positively correlate with their relative frequency, although there are some classes of interest that are fairly rare).
My goal is to develop a single classifier or a collection of them to be able to classify the k << K classes of interest with high precision (at least 80%) while maintaining reasonable recall (what's "reasonable" is a bit vague).
Features that I use are mostly typical unigram-/bigram-based ones plus some binary features coming from metadata of the incoming documents that are being classified (e.g. whether them were submitted via email or though a webform).
Because of the unbalanced data, I am leaning toward developing binary classifiers for each of the important classes, instead of a single one like a multi-class SVM.
What ML learning algorithms (binary or not) implemented in scikit-learn allow for training tuned to precision (versus for example recall or F1) and what options do I need to set for that?
What data analysis tools in scikit-learn can be used for feature selection to narrow down the features that might be the most relevant to the precision-oriented classification of a particular class?
This is not really a "big data" problem: K is about 100, k is about 15, the total number of samples available to me for training and testing is about 100,000.
Thx
Given that k is small, I would just do this manually. For each desired class, train your individual (one vs the rest) classifier, take look at the precision-recall curve, and then choose the threshold that gives the desired precision.