Getting Feature Names - scikit-learn

Suppose I have 4 feature(names)s ['year2000', 'year2001','year2002','year2003'], used during learning with Decision Tree classifier.
How can I obtain the names of the important feature from feature_importances_since it directly gives me some numbers rather than the feature names

Suppose you put feature names in a list
feature_names = ['year2000', 'year2001','year2002','year2003']
Then the problem is just to get the indices of features with top k importance
feature_importances = clf.feature_importances_
k = 3
top_k_idx = feature_importances.argsort()[-k:][::-1]
print feature_names[top_k_idx]

Related

How to use extract the hidden layer features in H2ODeepLearningEstimator?

I found H2O has the function h2o.deepfeatures in R to pull the hidden layer features
https://www.rdocumentation.org/packages/h2o/versions/3.20.0.8/topics/h2o.deepfeatures
train_features <- h2o.deepfeatures(model_nn, train, layer=3)
But I didn't find any example in Python? Can anyone provide some sample code?
Most Python/R API functions are wrappers around REST calls. See http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/model/model_base.html#ModelBase.deepfeatures
So, to convert an R example to a Python one, move the model to be the this, and all other args should shuffle along. I.e. the example from the manual becomes (with dots in variable names changed to underlines):
prostate_hex = ...
prostate_dl = ...
prostate_deepfeatures_layer1 = prostate_dl.deepfeatures(prostate_hex, 1)
prostate_deepfeatures_layer2 = prostate_dl.deepfeatures(prostate_hex, 2)
Sometimes the function name will change slightly (e.g. h2o.importFile() vs. h2o.import_file() so you need to hunt for it at http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html

Get feature names for dataframe.corr

I am using the cancer data set from sklearn and I need to find the correlations between features. I am able to find the correlated columns, but I am not able to present them in a "nice" way, so that they will be an input for Dataframe.drop.
Here is my code:
cancer_data = load_breast_cancer()
df=pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
corr = df.corr()
#filter to find correlations above 0.6
corr_triu = corr.where(~pd.np.tril(pd.np.ones(corr.shape)).astype(pd.np.bool))
corr_triu = corr_triu.stack()
corr_result = corr_triu[corr_triu > 0.6]
print(corr_result)
df.drop(columns=[?])
IIUC, you want the columns that correlate with some other column in the dataset, ie drop columns that don't appear in corr_result. So you'll want to get the unique variables from the index of corr_result, from each level. There may be repeats so take care of that as well, such as with sets:
corr_result.index = corr_result.index.remove_unused_levels()
corr_vars = set()
corr_vars.update(corr_result.index.unique(level=0))
corr_vars.update(corr_result.index.unique(level=1))
all_vars = set(df.columns)
df.drop(columns=all_vars - corr_vars)

Finding the top three relevant category and its corresponding probabilities

From the below script, I find the highest probability and its corresponding category in a multi class text classification problem. How do I find the highest top 3 predicted probability and its corresponding category in a best efficient way without using loops.
probabilities = classifier.predict_proba(X_test)
max_probabilities = probabilities.max(axis=1)
order=np.argsort(probabilities, axis=1)
classification=(classifier.classes_[order[:, -1:]])
print(accuracy_score(classification,y_test))
Thanks in advance.
( I have around 50 categories, I want to extract the top 3 best relevant category among 50 categories for each of my narrations and display them in a dataframe)
You've done most of the hard work here, just missing a bit of numpy foo to finish it off. Your line
order = np.argsort(probabilities, axis=1)
Contains the indices of the sorted probabilities, so [[lowest_prob_class_1, ..., highest_prob_class_1]...] for each of your samples. Which you have used to give your classification with order[:, -1:], i.e. the index of the highest probability class. So to get the top three classes we can just make a simple change
top_3_classes = classifier.classes_[order[:, -3:]]
Then to get the corresponding probabilities we can use
top_3_probabilities = probabilities[np.repeat(np.arange(order.shape[0]), 3),
order[:, -3:].flatten()].reshape(order.shape[0], 3)

How to merger NaiveBayesClassifier object in NLTK

I am working on a project using the NLTK toolkit. With the hardware I have, I am able to run the classifier object on a small data set. So, I divided the data into smaller chunks and running the classifier object in them while storing all these individual object in a pickle file.
Now for testing I need to have the whole object as one to get better result. So my question is how can I combine these objects into one.
objs = []
while True:
try:
f = open(picklename,"rb")
objs.extend(pickle.load(f))
f.close()
except EOFError:
break
Doing this does not work. And it gives the error TypeError: 'NaiveBayesClassifier' object is not iterable.
NaiveBayesClassifier code :
classifier = nltk.NaiveBayesClassifier.train(training_set)
I am not sure about the exact format of your data, but you can not simply merge different classifiers. The Naive Bayes classifier stores a probability distribution based on the data it was trained on, and you can not merge probability distributions without access to the original data.
If you look at the source code here: http://www.nltk.org/_modules/nltk/classify/naivebayes.html
an instance of the classifier stores:
self._label_probdist = label_probdist
self._feature_probdist = feature_probdist
these are calculated in the train method using relative frequency counts. (e.g P(L_1) = (# of L1 in training set) / (# labels in training set). To combine the two, you would want to get (# of L1 in Train 1 + Train 2)/(# of labels in T1 + T2).
However, the naive bayes procedure isn't too hard to implement from scratch, especially if you follow the 'train' source code in the link above. Here is an outline, using the NaiveBayes source code
Store 'FreqDist' objects for each subset of the data for the labels and features.
label_freqdist = FreqDist()
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
fnames = set()
# Count up how many times each feature value occurred, given
# the label and featurename.
for featureset, label in labeled_featuresets:
label_freqdist[label] += 1
for fname, fval in featureset.items():
# Increment freq(fval|label, fname)
feature_freqdist[label, fname][fval] += 1
# Record that fname can take the value fval.
feature_values[fname].add(fval)
# Keep a list of all feature names.
fnames.add(fname)
# If a feature didn't have a value given for an instance, then
# we assume that it gets the implicit value 'None.' This loop
# counts up the number of 'missing' feature values for each
# (label,fname) pair, and increments the count of the fval
# 'None' by that amount.
for label in label_freqdist:
num_samples = label_freqdist[label]
for fname in fnames:
count = feature_freqdist[label, fname].N()
# Only add a None key when necessary, i.e. if there are
# any samples with feature 'fname' missing.
if num_samples - count > 0:
feature_freqdist[label, fname][None] += num_samples - count
feature_values[fname].add(None)
# Use pickle to store label_freqdist, feature_freqdist,feature_values
Combine those using their built-in 'add' method. This will allow you to get the relative frequency across all the data.
all_label_freqdist = FreqDist()
all_feature_freqdist = defaultdict(FreqDist)
all_feature_values = defaultdict(set)
for file in train_labels:
f = open(file,"rb")
all_label_freqdist += pickle.load(f)
f.close()
# Combine the default dicts for features similarly
Use the 'estimator' to create a probability distribution.
estimator = ELEProbDist()
label_probdist = estimator(all_label_freqdist)
# Create the P(fval|label, fname) distribution
feature_probdist = {}
for ((label, fname), freqdist) in all_feature_freqdist.items():
probdist = estimator(freqdist, bins=len(all_feature_values[fname]))
feature_probdist[label, fname] = probdist
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
The classifier will not combine the counts across all the data and produce what you need.

Scikit-Learn Linear Regression how to get coefficient's respective features?

I'm trying to perform feature selection by evaluating my regressions coefficient outputs, and select the features with the highest magnitude coefficients. The problem is, I don't know how to get the respective features, as only coefficients are returned form the coef._ attribute. The documentation says:
Estimated coefficients for the linear regression problem. If multiple
targets are passed during the fit (y 2D), this is a 2D array of
shape (n_targets, n_features), while if only one target is passed,
this is a 1D array of length n_features.
I am passing into my regression.fit(A,B), where A is a 2-D array, with tfidf value for each feature in a document. Example format:
"feature1" "feature2"
"Doc1" .44 .22
"Doc2" .11 .6
"Doc3" .22 .2
B are my target values for the data, which are just numbers 1-100 associated with each document:
"Doc1" 50
"Doc2" 11
"Doc3" 99
Using regression.coef_, I get a list of coefficients, but not their corresponding features! How can I get the features? I'm guessing I need to modfy the structure of my B targets, but I don't know how.
What I found to work was:
X = your independent variables
coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1)
The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)
You can do that by creating a data frame:
cdf = pd.DataFrame(regression.coef_, X.columns, columns=['Coefficients'])
print(cdf)
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(logistic.coef_)})
I suppose you are working on some feature selection task. Well using regression.coef_ does get the corresponding coefficients to the features, i.e. regression.coef_[0] corresponds to "feature1" and regression.coef_[1] corresponds to "feature2". This should be what you desire.
Well I in its turn recommend tree model from sklearn, which could also be used for feature selection. To be specific, check out here.
Coefficients and features in zip
print(list(zip(X_train.columns.tolist(),logreg.coef_[0])))
Coefficients and features in DataFrame
pd.DataFrame({"Feature":X_train.columns.tolist(),"Coefficients":logreg.coef_[0]})
This is the easiest and most intuitive way:
pd.DataFrame(logisticRegr.coef_, columns=x_train.columns)
or the same but transposing index and columns
pd.DataFrame(logisticRegr.coef_, columns=x_train.columns).T
Suppose your train data X variable is 'df_X' then you can map into a dictionary and feed into pandas dataframe to get the mapping:
pd.DataFrame(dict(zip(df_X.columns,model.coef_[0])),index=[0]).T
Try putting them in a series with the data columns names as index:
coeffs = pd.Series(model.coef_[0], index=X.columns.values)
coeffs.sort_values(ascending = False)

Resources