python feature selection in pipeline: how determine feature names?

python feature selection in pipeline: how determine feature names? - scikit-learn

i used pipeline and grid_search to select the best parameters and then used these parameters to fit the best pipeline ('best_pipe'). However since the feature_selection (SelectKBest) is in the pipeline there has been no fit applied to SelectKBest.
I need to know the feature names of the 'k' selected features. Any ideas how to retrieve them? Thank you in advance
from sklearn import (cross_validation, feature_selection, pipeline,
preprocessing, linear_model, grid_search)
folds = 5
split = cross_validation.StratifiedKFold(target, n_folds=folds, shuffle = False, random_state = 0)
scores = []
for k, (train, test) in enumerate(split):
X_train, X_test, y_train, y_test = X.ix[train], X.ix[test], y.ix[train], y.ix[test]
top_feat = feature_selection.SelectKBest()
pipe = pipeline.Pipeline([('scaler', preprocessing.StandardScaler()),
('feat', top_feat),
('clf', linear_model.LogisticRegression())])
K = [40, 60, 80, 100]
C = [1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001]
penalty = ['l1', 'l2']
param_grid = [{'feat__k': K,
'clf__C': C,
'clf__penalty': penalty}]
scoring = 'precision'
gs = grid_search.GridSearchCV(estimator=pipe, param_grid = param_grid, scoring = scoring)
gs.fit(X_train, y_train)
best_score = gs.best_score_
scores.append(best_score)
print "Fold: {} {} {:.4f}".format(k+1, scoring, best_score)
print gs.best_params_
best_pipe = pipeline.Pipeline([('scale', preprocessing.StandardScaler()),
('feat', feature_selection.SelectKBest(k=80)),
('clf', linear_model.LogisticRegression(C=.0001, penalty='l2'))])
best_pipe.fit(X_train, y_train)
best_pipe.predict(X_test)

You can access the feature selector by name in best_pipe:
features = best_pipe.named_steps['feat']
Then you can call transform() on an index array to get the names of the selected columns:
X.columns[features.transform(np.arange(len(X.columns)))]
The output here will be the eighty column names selected in the pipeline.

Jake's answer totally works. But depending on what feature selector your using, there's another option that I think is cleaner. This one worked for me:
X.columns[features.get_support()]
It gave me an identical answer to Jake's answer. And you can see more about it in the docs, but get_support returns an array of true/false values for whether or not the column was used. Also, it's worth noting that X must be of identical shape to the training data used on the feature selector.

This could be an instructive alternative: I encountered a similar need as what was asked by OP. If one wants to get the k best features' indices directly from GridSearchCV:
finalFeatureIndices = gs.best_estimator_.named_steps["feat"].get_support(indices=True)
And via index manipulation, can get your finalFeatureList:
finalFeatureList = [initialFeatureList[i] for i in finalFeatureIndices]

Related

ValueError: y should be a 1d array, got an array of shape (74216, 2) instead

I am trying to apply Logistic Regression Models with text.
I Vectorized my data by TFIDF:
vectorizer = TfidfVectorizer(max_features=1500)
x = vectorizer.fit_transform(df['text_column'])
vectorizer_df = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names())
df.drop('text_column', axis=1, inplace=True)
result = pd.concat([df, vectorizer_df], axis=1)
I split my data:
x = result.drop('target', 1)
y = result['target']
and finally:
x_raw_train, x_raw_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
I build a classifier:
classifier = Pipeline([('clf', LogisticRegression(solver="liblinear"))])
classifier.fit(x_raw_train, y_train)
And I get this error:
ValueError: y should be a 1d array, got an array of shape (74216, 2) instead.
This is a strange thing because when I assign max_features=1000 it is working well, but when max_features=1500 I got an error.
Someone can help me please?

Basically, the text_column column in df contains at least one occurrence of the word target. This word becomes a column name when you convert the TF-IDF feature matrix to a dataframe with the parameter columns=vectorizer.get_feature_names(). Lastly, when you concatenate df with vectorized_df, you add both the target columns into the final dataframe.
Therefore, result['target'] will return two columns instead of one as there are effectively two target columns in the result dataframe. This will naturally lead to a ValueError, because, as specified in the error description, you need a 1d target array to fit your estimator, whereas your target array has two columns.
The reason why you are encountering this for a high max_features threshold is simply because the word target isn't making the cut with the lower threshold allowing the process to run as it should.
Unless you have a reason to vectorize separately, the best solution for this is to combine all your steps in a pipeline. It's as simple as:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=1500)),
('clf', LogisticRegression(solver="liblinear")),
])
pipeline.fit(x_train.text_column, y_train.target)

ValueError: Number of features of the model must match the input. Model n_features is 464 and input n_features is 2

I am tryin to predict the cost of perfumes but I got error in the line "answer = (clf.predict(result))"
cursor.execute('SELECT * FROM info')
info = cursor.fetchall()
for line in info:
z.append(line[0:2])
y.append(line[2])
enc.fit(z)
x = enc.transform(z).toarray()
result = []
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x, y)
new = input('enter the Model, Volume of your perfume to see the cost: example = Midnighto,50 ').split(',')
enc.fit([new])
result = enc.transform([new]).toarray()
answer = (clf.predict(result))
print(answer)

You don't have to fit enc again with your new input, only transform with the function fitted with X (what you do is OneHotEncoding the new sample taking in consideration only one possible value of the feature, you have to consider all the possible categories of the feature you have in your X data). So delete the next row:
enc.fit([new])
After that, please check if X and results has the same number of features. You can use function shape.
Furthermore, I recommend you use training and test data to see if your model is overfitted or not. Then, you can apply your personal predict.

Is it possible to get back the list in stratifiedKFold?

I'd like to do something like this :
Skf = sklearn.model_selection.StratifiedKFold(n_splits = 5, shuffle = True)
ALPHA,BETA = Skf.split(data_X, data_Y)
and then :
for train_index, test_index in ALPHA,BETA
However, it isn't working, why and how to bypass that problem ?
My idea is that I want to use the same split a few times at different part of my code... I don't know how to "stock" the split.

Yes, you can. You can specify the seed used by the random number generator, so that you obtain the same split over different runs. Just specify the random_state parameter!
SEED = 42
Skf = sklearn.model_selection.StratifiedKFold(n_splits=5,
shuffle=True,
random_state=SEED)

PCA for features selection

i'm dealing with Isolation Forest for APT classification and it gets encouraging results.
Now i'm trying to improve the performance with feature selection, and i consider PCA approach.
df = pd.read_csv("features_labeled_PULITOIsolationFor.csv")
apt_list = list(set(df.apt))[1:] ## Il primo è nan quindi lo scarto
for apt_name in apt_list:
out_file = open("test.txt","a")
out_file.write(apt_name + "\n" + "\n")
out_file.close()
df_apt17 = df[df["apt"] == apt_name]
df_other = df[df["apt"] != apt_name]
nf = 150
pca = PCA(n_components=nf)
df_apt17 = pca.fit_transform(df_apt17.drop("apt", 1))
print(df_apt17.shape)
kf = KFold(n_splits=10, random_state=1, shuffle=True)
df_apt17 = df_apt17.reset_index()
df_other = df_other.reset_index()
After this, i perform Cross Validation dividing the apt in question in Kfolder and i use df_other dataframe for testing folder merged with some elements of current APT.
However, although PCA seems to work since the reduction of features (seen by .shape on dataframe), it gives me error on the reset_index() function:
df_apt17 = df_apt17.reset_index()
AttributeError: 'numpy.ndarray' object has no attribute 'reset_index'
how can i deal with this problem?
Thanks to everyone

By reading the docs in sklearn (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) it say that the method pca.fit_transform() return an array.
Arrays don't have reset_index() method, only pandas.DataFrame.
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters: X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features
is the number of features.
y : Ignored.
Returns: X_new : array-like, shape (n_samples, n_components)
If you need to use reset_index(), you need transform it back to a pandas.DataFrame.

Problems using poly kernel in GridSearchCV and SVM classifier

I am trying to do a grid search using a SVM classifier.
Consider my data and target that have been parsed from file and input to numpy arrays.
I then preprocess them.
# Transform the data to have zero mean and unit variance.
zeroMeanUnitVarianceScaler = preprocessing.StandardScaler().fit(data)
zeroMeanUnitVarianceScaler.transform(data)
scaledData = data
# Transform the target to have range [-1, 1].
scaledTarget = np.empty([161L,], dtype=int)
for i in range(len(target)):
if(target[i] == 'Malignant'):
scaledTarget[i] = 1
if(target[i] == 'Benign'):
scaledTarget[i] = -1
I now try to set up my grid and fit the scaled data to targets.
# Generate parameters for parameter grid.
CValues = np.logspace(-3, 3, 7)
GammaValues = np.logspace(-3, 3, 7)
kernelValues = ('poly', 'sigmoid')
# kernelValues = ('linear', 'rbf', 'sigmoid')
degreeValues = np.array([0, 1, 2, 3, 4])
coef0Values = np.logspace(-3, 3, 7)
# Generate the parameter grid.
paramGrid = dict(C=CValues, gamma=GammaValues, kernel=kernelValues,
coef0=coef0Values)
# Create and train a SVM classifier using the parameter grid and with
stratified shuffle split.
stratifiedShuffleSplit = StratifiedShuffleSplit(n_splits = 10, test_size =
0.25, train_size = None, random_state = 0)
clf = GridSearchCV(estimator=svm.SVC(), param_grid=paramGrid,
cv=stratifiedShuffleSplit, n_jobs=1)
clf.fit(scaledData, scaledTarget)
If I uncomment the line kernelValues = ('linear', 'rbf', 'sigmoid'), then the code runs in approximately 50 seconds on my 16 GB i7-4950 3.6 GHz machine running windows 10.
However, if I try to run the code as is with 'poly' as a possible kernel value, then the code hangs forever. For example, I ran it yesterday overnight and it did not return anything when I got back in the office today.
Interestingly enough, if I try to create a SVM classifier with a poly kernel, it returns a result immediately
clf = svm.SVC(kernel='poly',degree=2)
clf.fit(data, target)
It hangs up when I do the above code. I have not tried other cv methods to see if that changes anything.
Is this a bug in sci-kit learn? Am I doing things properly? On a side note, is my method of doing gridsearch/cross validation using GridSearchCV and StratifiedShuffleSplit sensible? It seems to me the most brute force (i.e. time consuming) but robust method.
Thank you!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

python feature selection in pipeline: how determine feature names? - scikit-learn

Related

ValueError: y should be a 1d array, got an array of shape (74216, 2) instead

ValueError: Number of features of the model must match the input. Model n_features is 464 and input n_features is 2

Is it possible to get back the list in stratifiedKFold?

PCA for features selection

Problems using poly kernel in GridSearchCV and SVM classifier

Categories

Resources