How to do Text classification using word2vec - python-3.x

I want to perform text classification using word2vec.
I got vectors of words.
ls = []
sentences = lines.split(".")
for i in sentences:
ls.append(i.split())
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print(words)
vectors = []
for word in words:
vectors.append(model[word].tolist())
data = np.array(vectors)
data
output:
array([[ 0.00933912, 0.07960335, -0.04559333, 0.10600036],
[ 0.10576613, 0.07267512, -0.10718666, -0.00804013],
[ 0.09459028, -0.09901826, -0.07074171, -0.12022413],
[-0.09893986, 0.01500741, -0.04796079, -0.04447284],
[ 0.04403428, -0.07966098, -0.06460238, -0.07369237],
[ 0.09352681, -0.03864434, -0.01743148, 0.11251986],.....])
How can i perform classification (product & non product)?

You already have the array of word vectors using model.wv.syn0. If you print it, you can see an array with each corresponding vector of a word.
You can see an example here using Python3:
import pandas as pd
import os
import gensim
import nltk as nl
from sklearn.linear_model import LogisticRegression
#Reading a csv file with text data
dbFilepandas = pd.read_csv('machine learning\\Python\\dbSubset.csv').apply(lambda x: x.astype(str).str.lower())
train = []
#getting only the first 4 columns of the file
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
train.extend(sentences)
# Create an array of tokens using nltk
tokens = [nl.word_tokenize(sentences) for sentences in train]
Now it's time to use the vector model, in this example we will calculate the LogisticRegression.
# method 1 - using tokens in Word2Vec class itself so you don't need to train again with train method
model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)
# method 2 - creating an object 'model' of Word2Vec and building vocabulary for training our model
model = gensim.models.Word2vec(size=300, min_count=1, workers=4)
# building vocabulary for training
model.build_vocab(tokens)
print("\n Training the word2vec model...\n")
# reducing the epochs will decrease the computation time
model.train(tokens, total_examples=len(tokens), epochs=4000)
# You can save your model if you want....
# The two datasets must be the same size
max_dataset_size = len(model.wv.syn0)
Y_dataset = []
# get the last number of each file. In this case is the department number
# this will be the 0 or 1, or another kind of classification. ( to use words you need to extract them differently, this way is to numbers)
with open("dbSubset.csv", "r") as f:
for line in f:
lastchar = line.strip()[-1]
if lastchar.isdigit():
result = int(lastchar)
Y_dataset.append(result)
else:
result = 40
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(model.wv.syn0, Y_dataset[:max_dataset_size])
# Prediction of the first 15 samples of all features
predict = clf.predict(model.wv.syn0[:15, :])
# Calculating the score of the predictions
score = clf.score(model.wv.syn0, Y_dataset[:max_dataset_size])
print("\nPrediction word2vec : \n", predict)
print("Score word2vec : \n", score)
You can also calculate the similarity of words belonging to your created model dictionary:
print("\n\nSimilarity value : ",model.wv.similarity('women','men'))
You can find more functions to use here.

Your question is rather broad but I will try to give you a first approach to classify text documents.
First of all, I would decide how I want to represent each document as one vector. So you need a method that takes a list of vectors (of words) and returns one single vector. You want to avoid that the length of the document influences what this vector represents. You could for example choose the mean.
def document_vector(array_of_word_vectors):
return array_of_word_vectors.mean(axis=0)
where array_of_word_vectors is for example data in your code.
Now you can either play a bit around with distances (for example cosine distance would a nice first choice) and see how far certain documents are from each other or - and that's probably the approach that brings faster results - you can use the document vectors to build a training set for a classification algorithm of your choice from scikit learn, for example Logistic Regression.
The document vectors will become your matrix X and your vector y is an array of 1 and 0, depending on the binary category that you want the documents to be classified into.

Related

BERT with WMD distance for sentence similarity

I have tried to calculate the similarity between the two sentences using BERT and word mover distance (WMD). I am unable to find the correct formula for WMD in python. Also tried the WMD python library but it uses the word2vec model for embedding. Kindly help to solve the below problem to get the similarity score using WMD.
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
sentence_obama = sentence_obama.lower().split()
sentence_president = sentence_president.lower().split()
#Importing bert for creating an embedding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/bert-base-nli-mean-tokens')
#creating an embedding of both sentences
sentence_embeddings1 = model.encode(sentence_obama)
sentence_embeddings2 = model.encode(sentence_president)
distance = WMD(sentence_embeddings1, sentence_embeddings2)
print(distance)
Generally speaking, Word Mover Distance (based on Earth Mover Distance) requires a representation which each feature is associated with weight (or density). For examples bag-of-word representation of sentences with histogram of words.
Intuitively, EMD measures the cost of moving wights (dirt) in a histogram representation of features knowing the ground distance between each feature. With words as features, word vectors provide a distance measure between words, and then EMD can become WMD with word-histograms.
There are two issues with using WMD on BERT embeddings:
BERT embeddings provide contextual representation of sub-words and the sentence (representation of of a subword changes in different context).
There is no measure of density or weight on words and sub-words other than the attention mask on tokens.
The most simple and effective sentence similarity measure with BERT is based on the distance between [CLS] vectors of two sentences (the first vectors in the last hidden layers: the sentence vectors).
With all that said, I will try to find alternative ways to use WMD using pyemd module as in this Gensim implementation of WMD.
To measure which solution actually works, I will evaluate different solutions on this sentence similarity dataset in English.
import datasets
dataset = datasets.load_dataset('stsb_multi_mt', 'en')
Instead of sentence_transformers module, I use the main huggingface transformers. For simplicity I will use the following function to get tokens and sentence emebdedding for a given string:
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
def encode(sent):
inp = tokenizer(sent, return_tensors='pt')
out = model(**inp)
out = out.last_hidden_state[0].detach().numpy()
return out
Do not forget to import these modules as well:
import numpy as np
from pyemd import emd
from scipy.spatial.distance import cdist
from scipy.stats import spearmanr
We use cdist to measure vector distances, and Spearman's rank-order correlation (spearmanr) to compare our predicted similarity measure with the human judgments.
true_scores = []
pred_cls_scores = []
for item in tqdm(dataset['test']):
sent1 = encode(item['sentence1'])
sent2 = encode(item['sentence2'])
true_scores.append(item['similarity_score'])
pred_cls_scores.append(cdist(sent1[:1], sent2[:1])[0, 0])
spearmanr(true_scores, pred_cls_scores)
# SpearmanrResult(correlation=-0.737203146420342, pvalue=1.0236865615739037e-236)
Spearman's rho=0.737 is quite high!
The original post proposes to represent sentences with vectors of words based on white-space tokenization, run WMD over such representation. Here is an implementation of WMD based on EMD module similar to Gensim:
def wmdistance(sent1, sent2):
words1 = sent1.split()
words2 = sent2.split()
embs1 = np.array([encode(word)[0] for word in words1])
embs2 = np.array([encode(word)[0] for word in words2])
vocab_freq = Counter(words1 + words2)
vocab_indices = {w:idx for idx, w in enumerate(vocab_freq)}
sent1_indices = [vocab_indices[w] for w in words1]
sent2_indices = [vocab_indices[w] for w in words2]
vocab_len = len(vocab_freq)
# Compute distance matrix.
distance_matrix = np.zeros((vocab_len, vocab_len), dtype=np.double)
distance_matrix[np.ix_(sent1_indices, sent2_indices)] = cdist(embs1, embs2)
if abs((distance_matrix).sum()) < 1e-8:
# `emd` gets stuck if the distance matrix contains only zeros.
logger.info('The distance matrix is all zeros. Aborting (returning inf).')
return float('inf')
def nbow(sent):
d = np.zeros(vocab_len, dtype=np.double)
nbow = [(vocab_indices[w], vocab_freq[w]) for w in sent]
doc_len = len(sent)
for idx, freq in nbow:
d[idx] = freq / float(doc_len) # Normalized word frequencies.
return d
# Compute nBOW representation of documents. This is what pyemd expects on input.
d1 = nbow(words1)
d2 = nbow(words2)
# Compute WMD.
return emd(d1, d2, distance_matrix)
The spearman correlations are positive but not as high as the standard solution above.
pred_wmd_scores = []
for item in tqdm(dataset['test']):
pred_wmd_scores.append(wmdistance(item['sentence1'], item['sentence2']))
spearmanr(true_scores, pred_wmd_scores)
# SpearmanrResult(correlation=-0.4279390535806689, pvalue=1.6453234927014767e-62)
Perhaps, rho=0.428 is not too low for word-vector representations but it is quite low.
There are also other alternative ways to use EMD on [CLS] vectors. In order to run EMD, we need ground distances between features of the vector. So, one alternative solution is to map embeddings onto a new vector space which [CLS] vectors express weight of more meaningful features. For example, we can create a list of sentence vectors as components of the vector space. Then map the sentence vectors onto the component space, where each sentence is represented with a vector of component weight. The distance between components is measurable in the original embedding space:
def emdistance(embs1, embs2, components):
distance_matrix = cdist(components, components, metric='cosine')
sent_vec1 = 1-cdist(components, embs1[:1], metric='cosine')[:, 0]
sent_vec2 = 1-cdist(components, embs2[:1], metric='cosine')[:, 0]
return emd(sent_vec1, sent_vec2, distance_matrix)
Perhaps it is possible for some applications to find defining sentences as components, here I just sample 20 random sentences to test this:
n = 20
indices = np.arange(len(dataset['train']))
np.random.shuffle(indices)
random_sentences = [dataset['train'][int(idx)]['sentence1'] for idx in indices[:n]]
random_components = np.array([encode(sent)[0] for sent in random_sentences])
pred_emd_scores = []
for item in tqdm(dataset['test']):
sent1 = encode(item['sentence1'])
sent2 = encode(item['sentence2'])
pred_emd_scores.append(emdistance(sent1, sent2, random_components))
spearmanr(true_scores, pred_emd_scores)
#SpearmanrResult(correlation=-0.5347151444976767, pvalue=8.092612264709952e-103)
With 20 random sentences as components still rho=0.534 is a better score than bag of word rho=0.428.

Is there a way to add a 'sentiment' column after applying CountVectorizer or TfIdfTransformer to a dataframe?

I am working with app store reviews to classify them as class "0" or class "1" based on the text in the review and the sentiment the review carries.
In my classification steps I apply the following methods to my dataframe:
def get_sentiment(s):
vs = analyzer.polarity_scores(s)
if vs['compound'] >= 0.5:
return 1
elif vs['compound'] <= -0.5:
return -1
else:
return 0
df['sentiment'] = df['review'].apply(get_sentiment)
For simplicity sake, the data has already been labeled as either class '0' or '1', but I am training the model for the classification of new instances that have not been labeled yet. In short, the data I'm working with has already been labeled. They are in the classification column.
Then in my train test split method do the following:
msg_train, msg_test, label_train, label_test = train_test_split(df.drop('classification', axis=1), df['classification'], test_size=0.3, random_state=42)
So the dataframe for the X parameter has review and sentiment, and for the y parameter I only have the classification that I am training my model on.
Since the normalization is repetitive, I am running a pipeline like so for simplicity:
pipeline1 = Pipeline([
('bow', CountVectorizer(analyzer=clean_review)),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())
])
Where the clean_review function is as follows:
def clean_review(sentence):
no_punc = [c for c in sentence if c not in string.punctuation]
no_punc = ''.join(no_punc)
no_stopwords = [w.lower() for w in no_punc.split() if w not in stopwords_set]
stemmed_words = [ps.stem(w) for w in no_stopwords]
return stemmed_words
Where stopwords_set is the collection of english stopwords from the nltk library, and ps is from the PortStemmer module in the nltk library (for word stemming).
I get the following error: ValueError: Found input variables with inconsistent numbers of samples: [2, 505]
When I searched this error before, I saw that the likely issue could've been that there is a mismatch in the number of records for each attribute. I've found this not to be the case. All the records that I am using have values for every column.
Can someone else help me interpret what this error could mean?
My end goal is to have a dataframe that has the CountVectorizer and TfIdfTransformer applied to the text, but also retain the column for the sentiment of each review.
I would then like to be able to train the MultinomialNB classifier on this dataframe and apply this model to other tasks.
I'm not sure on what the error is due to since I don't know what the size of your dataframe should be. I would need more information. On which line is the error thrown?
Regarding the fact that you want to retain the sentiment column, you could apply CountVectorizer and TfIdfTransformer (by the way you could skip a step and directly apply TfidfVectorizer) only on the text data and then have another transformer in the pipeline which adds the original sentiment column before you feed the dataframe to the classifier.

Retraining pre-trained word embeddings in Python using Gensim

I want to retrain pre-trained word embeddings in Python using Gensim. The pre-trained embeddings I want to use is Google's Word2Vec in the file GoogleNews-vectors-negative300.bin.
Following Gensim's word2vec tutorial, "it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there."
Therefore I can't use the KeyedVectors and for training a model the tutorial suggests to use:
model = gensim.models.Word2Vec.load('/tmp/mymodel')
model.train(more_sentences)
(https://rare-technologies.com/word2vec-tutorial/)
However, when I try this:
from gensim.models import Word2Vec
model = Word2Vec.load('data/GoogleNews-vectors-negative300.bin')
I get an error message:
1330 # Because of loading from S3 load can't be used (missing readline in smart_open)
1331 if sys.version_info > (3, 0):
-> 1332 return _pickle.load(f, encoding='latin1')
1333 else:
1334 return _pickle.loads(f.read())
UnpicklingError: invalid load key, '3'.
I didn't find a way to convert the binary google new file into a text file properly, and even if so I'm not sure whether that would solve my problem.
Does anyone have a solution to this problem or knows about a different way to retrain pre-trained word embeddings?
The Word2Vec.load() method can only load full models in gensim's native format (based on Python object-pickling) – not any other binary/text formats.
And, as per the documentation's note that "it’s not possible to resume training with models generated by the C tool", there's simply not enough information in the GoogleNews raw-vectors files to reconstruct the full working model that was used to train them. (That would require both some internal model-weights, not saved in that file, and word-frequency-information for controlling sampling, also not saved in that file.)
The best you could do is create a new Word2Vec model, then patch some/all of the GoogleNews vectors into it before doing your own training. This is an error-prone process with no real best-practices and many caveats about the interpretation of final results. (For example, if you bring in all the vectors, but then only re-train a subset using only your own corpus & word-frequencies, the more training you do – making the word-vectors better fit your corpus – the less such re-trained words will have any useful comparability to retained untrained words.)
Essentially, if you can look at the gensim Word2Vec source & work-out how to patch-together such a frankenstein-model, it may be appropriate. But there's no built-in support or handy off-the-shelf recipes that make it easy, because it's an inherently murky process.
I have already answered it here .
Save the google news model as text file in wor2vec format using gensim.
Refer this answer to save it as text file
Then try this code .
import os
import pickle
import numpy as np
import gensim
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.callbacks import CallbackAny2Vec
import operator
os.mkdir("model_dir")
# class EpochSaver(CallbackAny2Vec):
# '''Callback to save model after each epoch.'''
# def __init__(self, path_prefix):
# self.path_prefix = path_prefix
# self.epoch = 0
# def on_epoch_end(self, model):
# list_of_existing_files = os.listdir(".")
# output_path = 'model_dir/{}_epoch{}.model'.format(self.path_prefix, self.epoch)
# try:
# model.save(output_path)
# except:
# model.wv.save_word2vec_format('model_dir/model_{}.bin'.format(self.epoch), binary=True)
# print("number of epochs completed = {}".format(self.epoch))
# self.epoch += 1
# list_of_total_files = os.listdir(".")
# saver = EpochSaver("my_finetuned")
# function to load vectors from existing model.
# I am loading glove vectors from a text file, benefit of doing this is that I get complete vocab of glove as well.
# If you are using a previous word2vec model I would recommed save that in txt format.
# In case you decide not to do it, you can tweak the function to get vectors for words in your vocab only.
def load_vectors(token2id, path, limit=None):
embed_shape = (len(token2id), 300)
freqs = np.zeros((len(token2id)), dtype='f')
vectors = np.zeros(embed_shape, dtype='f')
i = 0
with open(path, encoding="utf8", errors='ignore') as f:
for o in f:
token, *vector = o.split(' ')
token = str.lower(token)
if len(o) <= 100:
continue
if limit is not None and i > limit:
break
vectors[token2id[token]] = np.array(vector, 'f')
i += 1
return vectors
# path of text file of your word vectors.
embedding_name = "word2vec.txt"
data = "<training data(new line separated tect file)>"
# Dictionary to store a unique id for each token in vocab( in my case vocab contains both my vocab and glove vocab)
token2id = {}
# This dictionary will contain all the words and their frequencies.
vocab_freq_dict = {}
# Populating vocab_freq_dict and token2id from my data.
id_ = 0
training_examples = []
file = open("{}".format(data),'r', encoding="utf-8")
for line in file.readlines():
words = line.strip().split(" ")
training_examples.append(words)
for word in words:
if word not in vocab_freq_dict:
vocab_freq_dict.update({word:0})
vocab_freq_dict[word] += 1
if word not in token2id:
token2id.update({word:id_})
id_ += 1
# Populating vocab_freq_dict and token2id from glove vocab.
max_id = max(token2id.items(), key=operator.itemgetter(1))[0]
max_token_id = token2id[max_id]
with open(embedding_name, encoding="utf8", errors='ignore') as f:
for o in f:
token, *vector = o.split(' ')
token = str.lower(token)
if len(o) <= 100:
continue
if token not in token2id:
max_token_id += 1
token2id.update({token:max_token_id})
vocab_freq_dict.update({token:1})
with open("vocab_freq_dict","wb") as vocab_file:
pickle.dump(vocab_freq_dict, vocab_file)
with open("token2id", "wb") as token2id_file:
pickle.dump(token2id, token2id_file)
# converting vectors to keyedvectors format for gensim
vectors = load_vectors(token2id, embedding_name)
vec = KeyedVectors(300)
vec.add(list(token2id.keys()), vectors, replace=True)
# setting vectors(numpy_array) to None to release memory
vectors = None
params = dict(min_count=1,workers=14,iter=6,size=300)
model = Word2Vec(**params)
# using build from vocab to build the vocab
model.build_vocab_from_freq(vocab_freq_dict)
# using token2id to create idxmap
idxmap = np.array([token2id[w] for w in model.wv.index2entity])
# Setting hidden weights(syn0 = between input layer and hidden layer) = your vectors arranged accoring to ids
model.wv.vectors[:] = vec.vectors[idxmap]
# Setting hidden weights(syn0 = between hidden layer and output layer) = your vectors arranged accoring to ids
model.trainables.syn1neg[:] = vec.vectors[idxmap]
model.train(training_examples, total_examples=len(training_examples), epochs=model.epochs)
output_path = 'model_dir/final_model.model'
model.save(output_path)

Cosine Similarity score in scikit learn for two different vectorization technique is same

I am recently working on an assignment where the task is to use 20_newgroups dataset and use 3 different vectorization technique (Bag of words, TF, TFIDF) to represent documents in vector format and then trying to analyze the difference between average cosine similarity between each class in 20_Newsgroups data set. So here is what I am trying to do in python. I am reading data and passing it to sklearn.feature_extraction.text.CountVectorizer class's fit() and transform() function for Bag of Words technique and TfidfVectorizer for TFIDF technique.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity,cosine_distances
import numpy
import math
import csv
===============================================================================================================================================
categories = ['alt.atheism','comp.graphics','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware','comp.sys.mac.hardware', 'comp.windows.x','misc.forsale','rec.autos','rec.motorcycles','rec.sport.baseball','rec.sport.hockey',
'sci.crypt','sci.electronics','sci.med','sci.space','soc.religion.christian','talk.politics.guns',
'talk.politics.mideast','talk.politics.misc','talk.religion.misc']
twenty_newsgroup = fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'),shuffle=True, random_state=42)
dataset_groups = []
for group in range(0,20):
category = []
category.append(categories[group])
dataset_groups.append(fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'),shuffle=True,random_state=42,categories=category))
===============================================================================================================================================
bag_of_word_vect = CountVectorizer(stop_words='english',analyzer='word') #,min_df = 0.09
bag_of_word_vect = bag_of_word_vect.fit(twenty_newsgroup.data,twenty_newsgroup.target)
datamatrix_bow_groups = []
for group in dataset_groups:
datamatrix_bow_groups.append(bag_of_word_vect.transform(group.data))
similarity_matrix = []
for i in range(0,20):
means = []
for j in range(i,20):
result_of_group_ij = cosine_similarity(datamatrix_bow_groups[i], datamatrix_bow_groups[j])
means.append(numpy.mean(result_of_group_ij))
similarity_matrix.append(means)
===============================================================================================================================================
tf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',use_idf=False) #,sublinear_tf=True
tf_vectorizer = tf_vectorizer.fit(twenty_newsgroup.data)
datamatrix_tf_groups = []
for group in dataset_groups:
datamatrix_tf_groups.append(tf_vectorizer.transform(group.data))
similarity_matrix = []
for i in range(0,20):
means = []
for j in range(i,20):
result_of_group_ij = cosine_similarity(datamatrix_tf_groups[i], datamatrix_tf_groups[j])
means.append(numpy.mean(result_of_group_ij))
similarity_matrix.append(means)
Both should technically give different similarity_matrix but they are yeilding the same. More precisiosly tf_vectorizer should create similarity_matrix which have values more closed to 1.
The problem here is, Vector created by both technique for the same document of the same class for example (alt.atheism) is different and it should be. but when I calculating a similarity score between documents of one class and another class, Cosine similarity scorer giving me same value. If we understand theoretically then TFIDF is representing a document in a more finer sense in vector space so cosine value should be more near to 1 then what I get from BAG OF WORD technique right? But it is giving same similarity score. I tried by printing values of matrices created by BOW & TFIDF technique. It would a great help if somebody can give me a good reason to resolve this issue or strong argument in support what is happening?
I am new to this platform so please ignore any mistakes and let me know if you need more info.
Thanks & Regards,
Darshan Sonagara
The problem is this line in your code.
tf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',use_idf=False) #,sublinear_tf=True
You have set use_idf to False. This means the inverse document frequency is not calculated.So only the term frequency is calculated. Basicaly you are using the TfidfVectorizer like a CountVectorizer. Hence the output of both is the same: resulting in the same cosine distances.
using tf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',use_idf=True) Will result in a cosine similarity matrix for tfidf that is different from the countvectorizer.

How to find key trees/features from a trained random forest?

I am using Scikit-Learn Random Forest Classifier and trying to extract the meaningful trees/features in order to better understand the prediction results.
I found this method which seems relevant in the documention (http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.get_params), but couldn't find an example how to use it.
I am also hoping to visualize those trees if possible, any relevant code would be great.
Thank you!
I think you're looking for Forest.feature_importances_. This allows you to see what the relative importance of each input feature is to your final model. Here's a simple example.
import random
import numpy as np
from sklearn.ensemble import RandomForestClassifier
#Lets set up a training dataset. We'll make 100 entries, each with 19 features and
#each row classified as either 0 and 1. We'll control the first 3 features to artificially
#set the first 3 features of rows classified as "1" to a set value, so that we know these are the "important" features. If we do it right, the model should point out these three as important.
#The rest of the features will just be noise.
train_data = [] ##must be all floats.
for x in range(100):
line = []
if random.random()>0.5:
line.append(1.0)
#Let's add 3 features that we know indicate a row classified as "1".
line.append(.77)
line.append(.33)
line.append(.55)
for x in range(16):#fill in the rest with noise
line.append(random.random())
else:
#this is a "0" row, so fill it with noise.
line.append(0.0)
for x in range(19):
line.append(random.random())
train_data.append(line)
train_data = np.array(train_data)
# Create the random forest object which will include all the parameters
# for the fit. Make sure to set compute_importances=True
Forest = RandomForestClassifier(n_estimators = 100, compute_importances=True)
# Fit the training data to the training output and create the decision
# trees. This tells the model that the first column in our data is the classification,
# and the rest of the columns are the features.
Forest = Forest.fit(train_data[0::,1::],train_data[0::,0])
#now you can see the importance of each feature in Forest.feature_importances_
# these values will all add up to one. Let's call the "important" ones the ones that are above average.
important_features = []
for x,i in enumerate(Forest.feature_importances_):
if i>np.average(Forest.feature_importances_):
important_features.append(str(x))
print 'Most important features:',', '.join(important_features)
#we see that the model correctly detected that the first three features are the most important, just as we expected!
To get the relative feature importances, read the relevant section of the documentation along with the code of the linked examples in that same section.
The trees themselves are stored in the estimators_ attribute of the random forest instance (only after the call to the fit method). Now to extract a "key tree" one would first require you to define what it is and what you are expecting to do with it.
You could rank the individual trees by computing there score on held out test set but I don't know what expect to get out of that.
Do you want to prune the forest to make it faster to predict by reducing the number of trees without decreasing the aggregate forest accuracy?
Here is how I visualize the tree:
First make the model after you have done all of the preprocessing, splitting, etc:
# max number of trees = 100
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
Make predictions:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
Then make the plot of importances. The variable dataset is the name of the original dataframe.
# get importances from RF
importances = classifier.feature_importances_
# then sort them descending
indices = np.argsort(importances)
# get the features from the original data set
features = dataset.columns[0:26]
# plot them with a horizontal bar chart
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
This yields a plot as below:

Resources