Multi task learning in tensorflow with different learning rates - python-3.x

I want to implement a multi task learning framework in tensorflow. I'm currently implementing something like this (pseudo code)
define_reg_cost()
define_clas_cost()
reg_optimizer.(reg_learning_rate).min(reg_cost)
clas_optimizer.(clas_learning_rate).min(clas_cost)
for num_iterations:
reg_optimizer.run()
clas_optimizer.run()
This dosen't seem the right approach to me.
I have two loss functions and I want to learn the weights to minimize both losses with different learning rates in the same optimizer.
I basically want the gradient optimizer to do updates like this:
weight_updated = weight - learning_rate_1 * d(loss1)/dw - learning_rate_2 * d(loss2)/dw
Any tips on how to do this (possibly using optimizer.apply_gradients())?

Late but maybe helpfull for others:
The solution you have shown would be exactly the right one.
If you take a look at: this post and:
this stack overflow answer
If you were using something like GradientDescentOptimizer, these would
be very similar operations. That's because taking gradients is a
linear operation, and the gradient of a sum is the same as the sum of
the gradients.
But, ADAM does something special: regardless of the scale of your
loss, it scales the gradients so that they're always on the order of
your learning rate. If you multiplied your loss by 1000, it wouldn't
affect ADAM, because the change it would be normalized away.
using two distinct optimisers has a lot of advantages.
You also could split the update in gradient calculation and update and apply the lerningrate directly to the diffrent parts of gradients.
see here

Related

When and Whether should we normalize the ground-truth labels in the multi-task regression models?

I am trying a multi-task regression model. However, the ground-truth labels of different tasks are on different scales. Therefore, I wonder whether it is necessary to normalize the targets. Otherwise, the MSE of some large-scale tasks will be extremely bigger. The figure below is part of my overall targets. You can certainly find that columns like ASA_m2_c have much higher values than some others.
First, I have already tried some weighted loss techniques to balance the concentration of my model when it does gradient backpropagation. The result shows it didn't perform well.
Secondly, I have seen tremendous discussions regarding normalizing the input data, but hardly discovered any particular talking about normalizing the labels. It's partly because most of the people's problems are classification type and a single task. I do know pytorch provides a convenient approach to normalize the vision dataset by transform.normalize, which is still operated on the input rather than the labels.
Similar questions: https://forums.fast.ai/t/normalizing-your-dataset/49799
https://discuss.pytorch.org/t/ground-truth-label-normalization/26981/19
PyTorch - How should you normalize individual instances
Moreover, I think it might be helpful to provide some details of my model architecture. The input is first fed into a feature extractor and then several generators use the shared output representation from that extractor to predict different targets.
I've been working on a Multi-Task Learning problem where one head has an output of ~500 and another between 0 and 1.
I've tried Uncertainty Weighting but in vain. So I'd be grateful if you could give me a little clue about your studies.(If there is any progress)
Thanks.

How to handle shared data between samples and batches in Keras

I'm using Keras for timeseries prediction and I want to create a model that is based on the self-attention mechanism that will not use any RNNs. For each sample we look at the last x timesteps of samples to predict the next sample.
In other words I want to feed the network (num_batches, num_samples, timesteps, features) and get (num_batches, predictions).
There is 1 problems with this.
There is a lot of unnecessary duplication of data where sample n has basically the same timesteps and features as sample n+1, only shifted 1 to the left.
How would you handle this assuming you dataset is very large?
I am not very familiar with this, but if your issue is "I have too many replicated data" I think you can solve your problem devising a generator for your data, and then pass the generator as input for the Keras/TensorFlow fit function (according to TensorFlow APIs specification, it is stated that it supports generators as input).
If your question is related to the logic behind the model, I do not see the issue. It is like that you have a sliding window, for each window you predict one value, and then you move the window by a certain amount (in your case, one). Could you argue a little more about your concern?

Using SVM to perform classification on multi-dimensional time series datasets

I would like to use scikit-learn's svm.SVC() estimator to perform classification tasks on multi-dimensional time series - that is, on time series where the points in the series take values in R^d, where d > 1.
The issue with doing this is that svm.SVC() will only take ndarray objects of dimension at most 2, whereas the dimension of such a dataset would be 3. Specifically, the shape of a given dataset would be (n_samples, n_features, d).
Is there a workaround available? One simple solution would just be to reshape the dataset so that it is 2-dimensional, however I imagine this would lead to the classifier not learning from the dataset properly.
Without any further knowledge about the data reshaping is the best you can do. Feature engineering is a very manual art that depends heavily on domain knowledge.
As a rule of thumb: if you don't really know anything about the data throw in the raw data and see if it works. If you have an idea what properties of the data may be beneficial for classification, try to work it in a feature.
Say we want to classify swiping patterns on a touch screen. This closely resembles your data: We acquired many time series of such patterns by recording the 2D position every few milliseconds.
In the raw data, each time series is characterized by n_timepoints * 2 features. We can use that directly for classification. If we have additional knowledge we can use that to create additional/alternative features.
Let's assume we want to distinguish between zig-zag and wavy patterns. In that case smoothness (however that is defined) may be a very informative feature that we can add as a further column to the raw data.
On the other hand, if we want to distinguish between slow and fast patterns, the instantaneous velocity may be a good feature. However, the velocity can be computed as a simple difference along the time axis. Even linear classifiers can model this easily so it may turn out that such features, although good in principle, do not improve classification of raw data.
If you have lots and lots and lots and lots of data (say an internet full of good examples) Deep Learning neural networks can automatically learn features to some extent, but let's say this is rather advanced. In the end, most practical applications come down to try and error. See what features you can come up with and try them out in practice. And beware the overfitting gremlin.

How to get started with Tensorflow

I am pretty new to Tensorflow, and I am currently learning it through given website https://www.tensorflow.org/get_started/get_started
It is said in the manual that:
We've created a model, but we don't know how good it is yet. To evaluate the model on training data, we need a y placeholder to provide the desired values, and we need to write a loss function.
A loss function measures how far apart the current model is from the provided data. We'll use a standard loss model for linear regression, which sums the squares of the deltas between the current model and the provided data. linear_model - y creates a vector where each element is the corresponding example's error delta. We call tf.square to square that error. Then, we sum all the squared errors to create a single scalar that abstracts the error of all examples using tf.reduce_sum:"
q1."we don't know how good it is yet.", I didn't understand this
quote as the simple model created is a simple slope equation and on
what it should train for?, as the model is a simple slope. Is it
require an perfect slope or what? why am I training that model and
for what?
q2.what is a loss function? Is loss function is used to determine the
accuracy of the model? Why is it required?
q3. I didn't understand " 'sums the squares of the deltas' between
the current model and the provided data."
q4.I didn't understood this part of code,"squared_deltas =
tf.square(linear_model - y)
this is the code:
y = tf.placeholder(tf.float32)
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
print(sess.run(loss, {x:[1,2,3,4], y:[0,-1,-2,-3]}))
this may be simple questions, but I am a beginner to Tensorflow and having a hard time understanding it.
1) So you're kind of right about "Why should we train for a simple problem" but this is just an introduction piece. With any machine learning task you need to evaluate your model to see how good it is. In this case you are just trying to train to find the coefficients for the line of best fit.
2) A loss function in any machine learning context represents your error with your model. This usually means a function of your "distance" of your calculated value to the ground truth value. Think of it as an internal evaluation score. You want to minimise your loss so the gradients and parameter changes are based on your loss.
3/4) Your question here is more to do with least square regression. It's a statistical method to create lines of best fit between points. The deltas represent the differences between your calculated values and the truth values. The aim is to minimise the area of the squares and hence minise the error and have a better line of best fit.
What you are doing in this Tensorflow example is creating a machine learning model that will learn the coefficients for the line of best fit automatically using a least squares based system.
Pretty much all of your question have to-do with the loss function.
The loss function is a function that determines how far apart your output are from the expected (correct) output.
It has two usages:
Help the algorithm determine if the tweaking of the weight is helping going in the good or bad direction
Determinate the accuracy (~the number of time your system guesses the correct answer)
The loss function is the sum of the deltas witch is: the addition of the diff (delta) between the expected output and the actual output.
I think It's squared to magnifies the error the algorithm makes.

find important features for classification

I'm trying to classify some EEG data using a logistic regression model (this seems to give the best classification of my data). The data I have is from a multichannel EEG setup so in essence I have a matrix of 63 x 116 x 50 (that is channels x time points x number of trials (there are two trial types of 50), I have reshaped this to a long vector, one for each trial.
What I would like to do is after the classification to see which features were the most useful in classifying the trials. How can I do that and is it possible to test the significance of these features? e.g. to say that the classification was drive mainly by N-features and these are feature x to z. So I could for instance say that channel 10 at time point 90-95 was significant or important for the classification.
So is this possible or am I asking the wrong question?
any comments or paper references are much appreciated.
Scikit-learn includes quite a few methods for feature ranking, among them:
Univariate feature selection (http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html)
Recursive feature elimination (http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_digits.html)
Randomized Logistic Regression/stability selection (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RandomizedLogisticRegression.html)
(see more at http://scikit-learn.org/stable/modules/feature_selection.html)
Among those, I definitely recommend giving Randomized Logistic Regression a shot. In my experience, it consistently outperforms other methods and is very stable.
Paper on this: http://arxiv.org/pdf/0809.2932v2.pdf
Edit:
I have written a series of blog posts on different feature selection methods and their pros and cons, which are probably useful for answering this question in more detail:
http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/
http://blog.datadive.net/selecting-good-features-part-ii-linear-models-and-regularization/
http://blog.datadive.net/selecting-good-features-part-iii-random-forests/
http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/

Resources