Using a different test/train split for each target - scikit-learn

I plan on using a data set which contains 3 target values of interest. Ultimately I will be trying classification methods on a binary target and also plan on using regression methods for two separate continuous targets.
Is is it a bad practice to do a different train/test split for each target variable?
Otherwise, I am not sure how to split the data in a way that will allow me to predict each target, separately.

If they're effectively 3 different models trained and evaluated separately then for the purposes of scientifically evaluating each model's performance it doesn't matter if you use different test-train splits for each model, as no information will be leaking from model to model. But if you plan on comparing the results of the 3 models or combining all 3 scores into some aggregate metric then you would want to use the same test-train split so that all 3 models are working from the same training data, as otherwise the performance of each model will likely depend to some extent on the test data for the other models, and therefore your combined score will to some extent be a function of your test data.

Related

Using GAN to Model Posteriors with PyTorch

I have a 5-dimensional dataset and I'm interested in using a neural network to model the posterior distributions from which the data was drawn. I decided to implement a GAN to do this, and have been familiarizing myself with PyTorch.
I'm wondering how one should go about restricting what values the generator can produce for the parameters. For one of the parameters, the values must be nonnegative real values. For another case, the values must be nonnegative integer values. For the other three cases, the parameters can take on any real value.
My first idea was to control this through the transfer function applied to the nodes in the output layer of my neural network. But all of the PyTorch examples I've seen so far apply the same transfer function to all of the output nodes, which is not what I want to do. Is there a way to apply a different transfer function to each output node? Or is there maybe a better way to approach this problem?

Average the same model over multiple runs or take an ensamble of a multitude of the same model?

Currently I'm averaging predictions using the same model, I re-initialize the model and fit to a different test set of data from the same data sample, then I average over all of the prediction accuracies
Seems there are multiple possible ways to do this.
Average together the predictions such as in this question.
Average all the final model weights together such as here.
I could find an average ensamble (but using the same model for all of the input models), or go a step further and make it a weighted average ensamble.
I could stack ensambles to create a model that learns which models are the best predictors.
While 1 and 2 deal with the same type of model e.g. keras models, in 3 and 4 I can use multiple different models. But is it a good approach to use 3 and 4 instead of 1 and 2 by just making all the models in the ensamble the same (though acting on different training sets). Furthermore, the ensamble approach allows for different types of models. It seems that 3 and 4 could be used instead of 1 or 2 as they are more general? That is, for example, using 3, finding a weighted average of N copies of the same model? If so, would stacking ensambles (in 4) be better than just weighting them (in 3), that is, creating a higher level model to learn which of the lower level models make better predictions?

combine multiple spacy textcat_multilabel models into a single textcat_multilabel model

Problem: I have millions of records that need to be transformed using a bunch of spacy textcat_multilabel models.
// sudo code
for model in models:
nlp = spacy.load(model)
for groups_of_records in records: // millions of records
new_data = nlp.pipe(groups_of_records) // data is getting processed bulk
// process data
bulk_create_records(new_data)
My current loop is as follows:
load a model
loop through records / transform data using model / save
As you can imagine, the more records i process, and the more models i include, the longer this entire process will take. The idea is to make a single model, and just process my data once, instead of (n * num_of_models)
Question: is there a way to combine multiple textcat_multilabel models created from the same spacy config, into a single textcat_multilabel model?
There is no basic feature to just combine models, but there are a couple of ways you can do this.
One is to source all your components into the same pipeline. This is very easy to do, see the double NER project for an example. The disadvantage is that this might not save you much processing time, since separately trained models will still have their own tok2vec layers.
You could combine your training data and train one big model. But if your models are actually separate that would almost certainly cause a reduction in accuracy.
If speed is the primary concern, you could train each of your textcats separately while freezing your tok2vec. That would result in decreased accuracy, though maybe not too bad, and it would allow you to then combine the textcat models in the same pipeline while removing a bunch of tok2vec processing. (This is probably the method I've listed with the best balance of implementation complexity, speed advantage, and accuracy sacrificed.)
One thing that I don't think has been tested is that you could try training separate textcat models at the same time with separate sets of labels by manually specifying the labels to each component in their configs. I am not completely sure that would work but you could try it.

Neural network regression evaluation based on target range

I am currently fitting a neural network to predict a continuous target from 1 to 10. However, the samples are not evenly distributed over the entire data set: samples with target ranging from 1-3 are quite underrepresented (only account for around 5% of the data). However, they are of big interest, since the low range of the target is kind of the critical range.
Is there any way to know how my model predicts these low range samples in particular? I know that when doing multiclass classification I can examine the recall to get a taste of how well the model performs on a certain class. For classification use cases I can also set the class weight parameter in Keras to account for class imbalances, but this is obviously not possible for regression.
Until now, I use typical metrics like MAE, MSE, RMSE and get satisfying results. I would however like to know how the model performs on the "critical" samples.
From my point of view, I would compare the test measurements (classification performance, MSE, RMSE) for the whole test step that corresponds to the whole range of values (1-10). Then, of course, I would do it separately to the specific range that you are considering critical (let's say between 1-3) and compare the divergence of the two populations. You can even perform some statistics about the significance of the difference between the two populations (Wilcoxon tests etc.).
Maybe this link could be useful for your comparisons. Since you can regression you can even compare for MSE and RMSE.
What you need to do is find identifiers for these critical samples. Often times row indices are used for this. Once you have predicted all of your samples, use those stored indices to find the critical samples in your predictions and run whatever automatic metric over those filtered samples. I hope this answers your question.

Train multiple models with various measures and accumulate predictions

So I have been playing around with Azure ML lately, and I got one dataset where I have multiple values I want to predict. All of them uses different algorithms and when I try to train multiple models within one experiment; it says the “train model can only predict one value”, and there are not enough input ports on the train-model to take in multiple values even if I was to use the same algorithm for each measure. I tried launching the column selector and making rules, but I get the same error as mentioned. How do I predict multiple values and later put the predicted columns together for the web service output so I don’t have to have multiple API’s?
What you would want to do is to train each model and save them as already trained models.
So create a new experiment, train your models and save them by right clicking on each model and they will show up in the left nav bar in the Studio. Now you are able to drag your models into the canvas and have them score predictions where you eventually make them end up in the same output as I have done in my example through the “Add columns” module. I made this example for Ronaldo (Real Madrid CF player) on how he will perform in match after training day. You can see my demo on http://ronaldoinform.azurewebsites.net
For more detailed explanation on how to save the models and train multiple values; you can check out Raymond Langaeian (MSFT) answer in the comment section on this link:
https://azure.microsoft.com/en-us/documentation/articles/machine-learning-convert-training-experiment-to-scoring-experiment/
You have to train models for each variable that you going to predict. Then add all those predicted columns together and get as a single output for the web service.
The algorithms available in ML are only capable of predicting a single variable at a time based on the inputs it's getting.

Resources