I'm playing with keras+TF. I have a model which is composed by a
4 LSTM layers + 2 dense layers.
I have 3 features which are 3 sin sequences and a target which is the multiplication of the 3 sin sequences.
The LSTM layers are configured with 30 backlog time-steps.
I train the RNN with 80% of the features and than I request it to predict
the learned data (80% of the total data) I obtain a very good prediction.
Next I proceed with the last 20% of data splitting it in 10 sub-parts and
looping in predict(part_x[0]), fit(part_y[0]), predict(part_x[1]), fit(part_y[1])... But the quality of prediction dramatically drop down.
Is correct to expect that a predict(x[i])/fit(x[i],y[i]) loop should produce a decent outcome for every x[i+1] block?
Yet another question: is possible to train an RNN with 4 features and predict it with 3 features? If yes, how can I "blind" the unavailable features on prediction phase?
TIA
Roberto C.
Related
I'm trying to train a multilabel text classification model using BERT. Each piece of text can belong to 0 or more of a total of 485 classes. My model consists of a dropout layer and a linear layer added on top of the pooled output from the bert-base-uncased model from Hugging Face. The loss function I'm using is the BCEWithLogitsLoss in PyTorch.
I have millions of labeled observations to train on. But the training data are highly unbalanced, with some labels appearing in less than 10 observations and others appearing in more than 100K observations! I'd like to get a "good" recall.
My first attempt at training without adjusting for data imbalance produced a micro recall rate of 70% (good enough) but a macro recall rate of 45% (not good enough). These numbers indicate that the model isn't performing well on underrepresented classes.
How can I effectively adjust for the data imbalance during training to improve the macro recall rate? I see we can provide label weights to BCEWithLogitsLoss loss function. But given the very high imbalance in my data leading to weights in the range of 1 to 1M, can I actually get the model to converge? My initial experiments show that a weighted loss function is going up and down during training.
Alternatively, is there a better approach than using BERT + dropout + linear layer for this type of task?
In your case it might be helpful to balance the labels in the training data. You have a lot of data, so you could afford to loose a part of it by balancing. But before you do this, I recommend to read this answer about balancing classes in traing data.
If you really only care about recall, you could try to tune your model maximizing recall.
I am trying to train a CNN on football games audio to predict highlights. The data is composed of MFCC Spectrograms (https://librosa.org/doc/main/generated/librosa.feature.mfcc.html) of duration t=1s, and rate 10Hz. These MFCC spectrograms (~3000 per game, t=300s of labelled footage) are all labelled: 1 if corresponds to a highlight situation, 0 if corresponds to a lowlight. They are all 32x40 matrices: 40 high for the number of MFC coefficients (see librosa doc) and 32 wide for 32 samples per second.
I am training a CNN on this data. Here's its architecture:
CNN architecture
I have a balanced set taken from PSGvMU game, composed of 50% highlight/50% lowlight MFCC spectrograms. This set is split into a 80% balanced training dataset and 20% balanced validation dataset.
I am training my model with 10 epochs, 32 batch size, adam optimizer with lr=0.001. Here are the trainign epochs:
Training epochs accuracies and validation accuracies
Every time I test my model on new MFCC spectrograms, the predictions (between 0 and 1) have a very high mean (~0.99+) and the optimal classification threshold (calculated by doing argmax(accuracy | threshold)) is often also very high, typically around 0.99-0.999.
Accuracy as function of classification threshold graph
The problem is that I need to know the true labels to get a good classification threshold and hence good results.
What do you think about my approach? Is there something wrong with my model? Or am I just lacking data/overfitting a lot ?
I am new at deep learning.
I have a dataset with 1001 values of human pose upper body. The model I trained for that has 4 Conv layers and 2 fully connected layers with ReLu and Dropout. This is the result I got after 200 iterations. Does anyone have any ideas about why the curve of training loss decreases in a sharp way?
I think probably I need more data, as my dataset concludes numerical values what do you think is the best data augmentation method I have to use here?
I would like to predict multiple timesteps into the future. My current NN outputs a sparse classification of 0, 1 or 2.
Sparse classifications outputs via a SoftMax Dense layer with 3 neurons to correspond to the three categories mentioned above.
How would I shape the output layer (softmaxed Dense) to give me the ability to predict two timesteps into the future, while keeping the sparse categorical classes to only 3?
Right now if I set that Dense layer to have 6 neurons (3 classes * 2 timesteps) I get an output of a sparse categorical classification with 6 classes and 1 timestep.
I got a little bit lost while studying loss functions for multi-task learning.
For instance, in binary classification with only one task, for example classifying emails as spam or not, the sum of probabilities for each label (spam/not spam) would be 1 using softmax activation + softmax_crossentropy loss function. How does that apply to multi-task learning?
Let's consider the case with 5 tasks and each of them is a binary problem. Is the softmax function applied to each task independently (e.g. for task 1: probability of label 1 = 0.7 and label 2 = 0.3; for task 2: probability of label 1 = 0.2 and label 2 = 0.8 and so on) or it considers the tasks jointly (e.g. if label 1 of task 1 has a probability of 0.80 all other labels of all other tasks will sum to 0.20)?
Some notes:
Nitpicking: you should not use softmax for binary classification, but rather a regular sigmoid (which is kind of the 2d reduction of the softmax), followed by a log-loss (same).
for mulit-task that involves classification, you would probably use multiple binary classifications. Say you have an image and you want an output to say if there are pedestrians, cars and road signs in it. This is not a multi-class classification, as an image can have all of the above in it. So instead you'd define your output as 3 nodes, and you would calculate the binary classification for each node. This is done in one multi-task NN instead of running 3 different NN's, with the assumption that all 3 classification problems can benefit from the same latent layer or embedding that is created in that one NN.
Primarily, the loss function that is calculated can be different for different tasks in the case of multi-task(I would like to comment that it is not MULTI-LABEL classification) classification.
For example, Task 1 can be binary classification; Task 2 can be next sentence prediction and so on. Therefore, since different tasks involves learning different Loss function, you can attribute to the first part of your assumption, i.e, Softmax is applied only to the labels of the first task, while learning the first task.