Process training and test sets in GridSearchCV - scikit-learn

I want to evaluate different classifiers in performing the link-prediction task by using node embedding algorithms. More specifically, I want to evaluate if node embedding can improve the accuracy of different classifiers predicting new links between nodes.
My idea is the following:
I create a dataset containing both positive and negative samples (real links and non-existing links)
I split the dataset in Development Test (DS) and Evaluation Test (ES).
I use the DS to perform the Grid Search cross-validation (CV) to find the best model
I train the best model on the entire DS, and then I evaluate its performance on ES.
The problem is the following: I cannot use node embedding algorithms on the entire dataset because, in this case, ES will contain information related to the original graph topology. Therefore, I need to extract node embeddings from the training and test sets generated during the Grid Search CV, but how can I do it by using the sklearn.model_selection.GridSearchCV class?

Related

How to deal with the unlabeled nodes in Pytorch Geometric?

I have a dataset on my own, and the dataset contains two classes, let's say 0 and 1. Besides, there is a large part of nodes which class is unlabeled. My goal is to predict these unlabeled nodes using GCN. But I am confused about how to deal with these unlabeled nodes in Pytorch Geometric.
As far as I can think about, I can label the nodes into 3 classes, 0, 1 and unknown. But if I do it this way, that means I am trying to classify the dataset into three classes? (But that's not what I want since unknown is not a class).
And another way to deal with these node is to ignore them, simply run PyG on the labeled node. But in this way, it seems that these unlabeled node(with feature) is useless in the dataset?
That very much depends on your use case and the data!
Case 1 - Graph Autoencoder
For this case let's assume the task is to find similar tweets. A way of doing this is to train a Graph Autoencoder (see example).
This approach is completely unsupervised and thus does not need any data to be labeled.
The resulting model should be able to generate an embedding for each node (in this case each tweet) so that the distance between similar tweets is lower than between non-similar (measured e. g. by cosine distance).
Case 2 - Semi-Supervised GCN
Another case would be to classify tweets as advertisement vs. non-advertisement. Since the idea behind GCNs is to train in a semi-supervised manner it would be no problem to only have labels for some of the tweets.
In order to tell PYG which ones have labels and should be used for training you can define a train_mask. All nodes with missing labels will still technically need a y-value which can be set to -1.
Source & Credits

Is it possible to customise tf.estimator for Unsupervised Learning with self-defined evaluation graph not having loss?

I am implementing a item2vec model using the idea of word2vec
with tf.estimator API for product recommendation.
There's no problem implementing training part with tf.estimator. The process is same as word2vec, and I see each transactions as a sentence. Only difference is how to generate training input:(target_item, context_item) pairs. After training the pseudo-classification problem, I could use trained embedding vector for each items to measure relationship between them.
The problem is, for evaluation part, it is not a typical supervised learning evaluation, ie. with eval data as input, going through the same graph, we obtain predictions and accuracy.
The evaluation input data I would like to use, is in a totally different format from training input data.
Format of Eval input data: (target_item, {context_item1, context_item2, ...}). With this, I could obtain top_k nearest items for each context_items and then see if the target_item is in the collection of these nearest items, so that I could obtain a hit-ratio from it.
However, tf.estimator.EstimatorSpec() for mode = MODE.EVAL requires a loss as input. So, does it mean evaluation can only reuse part of the training graph? What could I do if I don't have a loss function for evaluation in my case, as the evaluation does not go through the classification anymore?
Many thanks.

Feature selection on a keras model

I was trying to find the best features that dominate for the output of my regression model, Following is my code.
seed = 7
np.random.seed(seed)
estimators = []
estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, epochs=3,
batch_size=20)))
pipeline = Pipeline(estimators)
rfe = RFE(estimator= pipeline, n_features_to_select=5)
fit = rfe.fit(X_set, Y_set)
But I get the following runtime error when running.
RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
How to overcome this issue and select best features for my model? If not, Can I use algorithms like LogisticRegression() provided and supported by RFE in Scikit to achieve the task of finding best features for my dataset?
I assume your Keras model is some kind of a neural network. And with NN in general it is kind of hard to see which input features are relevant and which are not. The reason for this is that each input feature has multiple coefficients that are linked to it - each corresponding to one node of the first hidden layer. Adding additional hidden layers makes it even more complicated to determine how big of an impact the input feature has on the final prediction.
On the other hand, for linear models it is very straightforward since each feature x_i has a corresponding weight/coefficient w_i and its magnitude directly determines how big of an impact it has in prediction (assuming that features are scaled of course).
The RFE estimator (Recursive feature elimination) assumes that your prediction model has an attribute coef_ (linear models) or feature_importances_(tree models) that has the length of input features and that it represents their relevance (in absolute terms).
My suggestion:
Feature selection: (Option a) Run the RFE on any linear / tree model to reduce the number of features to some desired number n_features_to_select. (Option b) Use regularized linear models like lasso / elastic net that enforce sparsity. The problem here is that you cannot directly set the actual number of selected features. (Option c) Use any other feature selection technique from here.
Neural Network: Use only features from (1) for your neural network.
Suggestion:
Perform the RFE algorithm on a sklearn-based algorithm to observe feature importance. Finally, you use the most importantly observed features to train your algorithm based on Keras.
To your question: Standardization is not required for logistic regression

Confusion Matrix - Not changing with predictive models (Sklearn)

I have 3 predictive models and I am evaluating there performance with a confusion matrix.
I am getting the same results for the confusion matrix for each of the 3 models.
I expect that the different models would perform differently and produce different confusion matrices. I am new to predictive modelling, so I suspect I am making a "Rooky mistake" . The full script I am using is sitting in a Jupyter notebook on GiThub here
A screenshot of the code for the 3 models is below
Can some one point out what is going wrong?
Cheers
Mike
As mentioned: make predictions on the test data. But keep in mind that your targets are skewed! So use StratifiedKFolds or something like this.
Also I guess that your data is a bit corrupted. While all models show the same result it may be a big mistake underneath.
Few questions/advises:
1. Did you scale your data?
2. Did you use one-hot-encoding?
2. Use don't Decision Trees but Forests/XGBoost. Easy to overfit with DT.
3. Don't use >2 hidden layers in NN because it's easy to overfit too. Use 2 firstly. And your architecture (30, 30, 30) with 2 target classes seems weird.
4. And if you wish to use >2 hidden layers - go to Keras or TF. You'll find there many features that can help you to not overfit.
That is simply because you are using the same Training data to make predictions. Since your models are already trained on the same data that you are making the predictions on, they will return the same results (and ultimately the same confusion matrix). You need to split your dataset into training and test sets. Then train your classifier on training set and make predictions on test set.
You can use train_test_split in Sklearn to split your dataset into training or test set.

advanced feature extraction for cross-validation using sklearn

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle
Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.

Resources