I am trying to use UNSW-NB15 to train a model. After the model is trained, I would like to use the model on live network data. I began creating this using a supervised LSTM but started wondering about handling the data from the network and the necessity to create a data pipeline that preprocesses network data to get it in a manner similar to the UNSW-nb15 dataset. This seemed impractical to me as this would most likely mean going through data manually with each network data source. I am thinking that an unsupervised model may be better for my purposes. I still wanted to use LSTM but I'm finding very little in terms of information for creating an unsupervised lstm model in keras. Read a paper suggesting using BINGO (Binary Information gain optimization) or NEO (nonparametric entropy optimization) to train the lstm model. I am not certain how this can be done in keras. I am unable to find such functions there. (I will search python libraries though). Any suggestions?
I am still researching.
I am applying deep learning algorithms to the speech commands dataset.
I am curious if normalization of the audio is needed before turning them into spectrograms or any other feature engineering thing?
I've gone through some notebooks on github that use this dataset and haven't found any clues, but as we use neural networks i think we need some normalization.
I have never worked with audio data so i am not very experienced.
Yes, normalizing data is recommended for neural network training.
Good explanation here - https://stats.stackexchange.com/q/458579/131706.
I have a Neural Network with five inputs for a classification task. Two inputs out of those five are very important and have a direct relationship to the classification task. Therefore, I need to prioritize those two inputs within the network and give less priority to the other three. Is there a way in the neural network to facilitate my requirement?
If training works well, the NN should automatically pick up what's most important for your classification. That's the entire point of a NN (or ML in general); so that you don't have to manually tell it what's more important and what's not. After learning, you can verify that the model indeed does learn the correct order of importance between the features.
You can use any model explanation technique for this. ELI5, SHAP or LIME are some examples. All these will tell you if your model did indeed learn that the features that you know are important is actually important to the network.
You probably shouldn't try to manually incorporate such biases into the network (unless you have a very good reason for doing so, like incorporating spatial information of images via CNNs). Trust the learning xD
BERT pre-training of the base-model is done by a language modeling approach, where we mask certain percent of tokens in a sentence, and we make the model learn those missing mask. Then, I think in order to do downstream tasks, we add a newly initialized layer and we fine-tune the model.
However, suppose we have a gigantic dataset for sentence classification. Theoretically, can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
Thanks.
BERT can be viewed as a language encoder, which is trained on a humongous amount of data to learn the language well. As we know, the original BERT model was trained on the entire English Wikipedia and Book corpus, which sums to 3,300M words. BERT-base has 109M model parameters. So, if you think you have large enough data to train BERT, then the answer to your question is yes.
However, when you said "still achieve a good result", I assume you are comparing against the original BERT model. In that case, the answer lies in the size of the training data.
I am wondering why do you prefer to train BERT from scratch instead of fine-tuning it? Is it because you are afraid of the domain adaptation issue? If not, pre-trained BERT is perhaps a better starting point.
Please note, if you want to train BERT from scratch, you may consider a smaller architecture. You may find the following papers useful.
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
I can give help.
First of all, MLM and NSP (which are the original pre-training objectives from NAACL 2019) are meant to train language encoders with prior language knowledge. Like a primary school student who read many books in the general domain. Before BERT, many neural networks would be trained from scratch, from a clean slate where the model doesn't know anything. This is like a newborn baby.
So my question is, "is it a good idea to start teaching a newborn baby when you can begin with a primary school student?" My answer is no. This is supported by numerous State-of-The-Arts achieved by the pre-trained models, compared to the old methods of training a neural network from scratch.
As someone who works in the field, I can assure you that it is a much better idea to fine-tune a pre-trained model. It doesn't matter if you have a 200k dataset or a 1mil datapoints. In fact, more fine-tuning data will only make the downstream results better if you use the right hyperparameters.
Though I recommend the learning rate between 2e-6 ~ 5e-5 for sentence classification tasks, you can explore. If your dataset is very, very domain-specific, it's up to you to fine-tune with a higher learning rate, which will deviate the model further away from its "pre-trained" knowledge.
And also, regarding your question on
can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
I'm negative about this idea. Even though you have a dataset with 200k instances, BERT is pre-trained on 3300mil words. BERT is too inefficient to be trained with 200k instances (both size-wise and architecture-wise). If you want to train a neural network from scratch, I'd recommend you look into LSTMs or RNNs.
I'm not saying I recommend LSTMs. Just fine-tune BERT. 200k is not even too big anyways.
All the best luck with your NLP studies :)
I have a question about the nlp tagger called SENNA, that is developed by Collbert and his colleagues based on their paper: Natural Language Processing (almost) from Scratch.
Does SENNA (it's code which available at this address: http://ronan.collobert.com/senna/download.html) contain any code for training the neural network?
Or it just uses information that is obtained by training the network (it is trained beforehand and its code is not in SENNA)?
Yes, SENNA contains code for training neural network. Take a look at SENNA_nn.h & SENNA_nn.c for reference, it shows the implementation of different layers mentioned in the paper.
No I don't think so. The functions #illuminatus refers to are there because they are needed at inference time. AFAIK, you have to write the learning functions by yourself.