I am trying to implement the AntisymmetricRNN described in this paper: https://arxiv.org/abs/1902.09689.
Working in Keras, I guess I have to implement my own layer so I have read https://keras.io/layers/writing-your-own-keras-layers/. Instead of starting from a plain layer as explained there, I reckon the best would probably be to extend one of the existing RNN, but Keras has
RNN
SimpleRNNCell
SimpleRNN
The documentation isn't verbose enough for someone my level about what these classes do/are, and consequently I am having a hard time figuring out what should be my starting point.
Any help, both in terms of where to start and what to actually look out for, and all sorts of suggestions are greatly appreciated. Thank you.
In Keras, all recurrent layers are RNN layers with a certain Cell.
The definition is RNN(cell=someCell)
So, the LSTM layer follows the same principle, an LSTM(units=...) layer is equal to an RNN(cell=LSTMCell(units=...), ...) layer.
That said, to implement your recurrent layer (if it doesn't break the recurrent flow step by step or jump steps), you need to implement your own cell. You can study what is happening in the LSTMCell code, compare it with the papers and adjust the weights and formulas to your need.
So you will have your own RNN(cell=yourCell).
Related
I am currently pursuing undergraduation, I am working on CNN model to recognize Telegu characters.
This Questions has two parts,
I have a (32,32,1) shape Telegu character images, I want to train my CNN model to recognize the character. So, what should be my model architecture and how to decide the architecture, no of parameters and hidden layers. I know that my case is exactly same as handwritten digit recognition, but I want to know how to decide those parameters. Is there any common practice in building such architecture.
Operation Conv2D (32, (5,5)) means 32 filters of size 5x5 are applied on to the input, my question is are these filters all same or different, if different what kind of filters are initialized and who decides them?
I tried to surf internet but everywhere I go, the answer I get is Conv2D operation applies filters on input and does the convolution operation.
To decide which model architecture would be best, you need to experiment. Thats the only way. As you want to classify, VGG architecture would be a good starting point I believe. You need to experiment with number of parameters as it depends on your problem. You can use Keras Tuner for it: https://keras.io/keras_tuner/
For kernel initialization, as far as I know convolutional layers in Keras uses Glorot Uniform Initialization but you can change that by using kernel_initializer parameter. Long story short, convolutional layers are initialized with a distribution function and as training goes filters change the values inside, which is learning process. https://keras.io/api/layers/initializers
Edit: I forgot to inform you that I suggest VGG architecture but in a way you downsize the models a lot. Your input shape is little so if your model is too much deep, you will overfit really quickly.
I was not able to understand one thing , when it says "fine-tuning of BERT", what does it actually mean:
Are we retraining the entire model again with new data.
Or are we just training top few transformer layers with new data.
Or we are training the entire model but considering the pretrained weights as initial weight.
Or there is already few layers of ANN on top of transformer layers which is only getting trained keeping transformer weight freeze.
Tried Google but I am getting confused, if someone can help me on this.
Thanks in advance!
I remember reading about a Twitter poll with similar context, and it seems that most people tend to accept your suggestion 3. (or variants thereof) as the standard definition.
However, this obviously does not speak for every single work, but I think it's fairly safe to say that 1. is usually not included when talking about fine-tuning. Unless you have vast amounts of (labeled) task-specific data, this step would be referred to as pre-training a model.
2. and 4. could be considered fine-tuning as well, but from personal/anecdotal experience, allowing all parameters to change during fine-tuning has provided significantly better results. Depending on your use case, this is also fairly simple to experiment with, since freezing layers is trivial in libraries such as Huggingface transformers.
In either case, I would really consider them as variants of 3., since you're implicitly assuming that we start from pre-trained weights in these scenarios (correct me if I'm wrong).
Therefore, trying my best at a concise definition would be:
Fine-tuning refers to the step of training any number of parameters/layers with task-specific and labeled data, from a previous model checkpoint that has generally been trained on large amounts of text data with unsupervised MLM (masked language modeling).
I am trying to solve the problem of sequence completion. Let's suppose we have ground truth sequence (1,2,4,7,6,8,10,12,18,20)
The input to our model is an incomplete sequence. i.e (1,2,4, _ , _ ,_,10,12,18,20). From this incomplete sequence, we want to predict the original sequence (Ground Truth sequence). Which deep learning models can be used to solve this problem?
Is this the problem of encoder-decoder LSTM architecture?
Note: we have thousands of complete sequences to train and test the model.
Any help is appreciated.
This not exactly sequence-to-sequence problem, this is a sequence labeling problem. I would suggest either stacking bidirectional LSTM layers followed by a classifier or Transformer layers followed by a classifier.
Encoder-decoder architecture requires plenty of data to train properly and is particularly useful if the target sequence can be of arbitrary length, only vaguely depending on the source sequence length. It would eventually learn to do the job with enough, but sequence labeling is a more straightforward problem.
With sequence labeling, you can set a custom mask over the output, so the model will only predict the missing numbers. An encoder-decoder model would need to learn to copy most of the input first.
In your sequence completion task, are you trying to predict next items in a sequence or learn only the missing values?
Training a neural network with missing data is an issue on its own terms.
If you're using Keras and LSTM-type NN for solving your problem, you should consider masking, you can refer to this stackoverflow thread for more details: Multivariate LSTM with missing values
Regarding predicting the missing values, why not try auto-encoders?
I want to get a deep idea about how this keras layers works in a model. What does each layer doing in the model etc. I followed kers documentation and information isn't enough. If any of you know place to get more knowledge let me know.Thanks in advance
Keras layers are widely used CNN, DNN and RNN layers. There is atleast one research paper for each of them and there is a lot of educational material out there. If you are really curious you could look at keras' code. Some links for you:
https://github.com/keras-team/keras/tree/master/keras/layers
http://cs231n.github.io/convolutional-networks/
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence
http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
I am a newby to the convolutional neural nets... so this may be an ignorant question.
I have followed many examples and tutorials now on the MNIST example in TensforFlow. In the CNN examples, all authors talk bout using the 'input filters' to run in the CNN. But no one that I can find mentions WHERE they come from. Can anyone answer where these come from? Or are they magically obtained from the input images.
Thanks! Chris
This is an image that one professor uses, be he does not exaplain if he made them or TensorFlow auto-extracts these somehow.
Disclaimer: I am not an expert, more of an enthusiast.
To cut a long story short: filters are the CNN equivalent of weights, and all a neural network essentially does is learning their optimal values.
Which it does by iterating through a training dataset, making predictions, comparing them to the label/value already assigned to each training unit (usually an image in case of a CNN) and adjusting weights to minimize the error function (the difference between the predicted value and the actual value).
Initial values of filters/weights do not matter that much, so although they might affect the speed of convergence to a small degree, I believe they are often assigned random values.
It is the job of the neural network to figure out the optimal weights, not of the person implementing it.