Using LSTM for large text - python-3.x

I have a dataset for detecting fake news that i got from kaggle( https://www.kaggle.com/c/fake-news/data ).
I want to use LSTM for the classification
The mean length of words in a single article is about 750 words. I have tried to remove punctuation, stop words, removed numbers. Preprocessing the text is also taking a very long time.
I'd like a method to feed large text into the LSTM using keras. What should i do to reduce computation time and not lose a lot of accuracy.

There are some things you could try to speed things up:
1. Use CUDNN version of LSTM
It is usually faster, check available layers here keras.layers.CuDNNLSTM is what you are after.
2. Use Conv1d to create features
You can use 1 dimensional convolution with kernel_size specifying how many words should be taken into account and stride specifying the jump of moving window. For kernel_size=3 and stride=3, padding="SAME" it would drop your dimensionality three times.
You may stack more convolutional layers.
On top of that you can still employ LSTM normally.
3. Drop LSTM altogether
You may go with 1d convolutions and pooling for classification, RNNs are not the only way.
On the upside: you will not encounter vanishing gradients (could be mitigated a little by Bidirectional LSTM as well).
On the downside: you will lose strict dependence between words, though it shouldn't be much of a problem for binary classification (I suppose it's your goal).

Related

Conv2D filters and CNN architecture

I am currently pursuing undergraduation, I am working on CNN model to recognize Telegu characters.
This Questions has two parts,
I have a (32,32,1) shape Telegu character images, I want to train my CNN model to recognize the character. So, what should be my model architecture and how to decide the architecture, no of parameters and hidden layers. I know that my case is exactly same as handwritten digit recognition, but I want to know how to decide those parameters. Is there any common practice in building such architecture.
Operation Conv2D (32, (5,5)) means 32 filters of size 5x5 are applied on to the input, my question is are these filters all same or different, if different what kind of filters are initialized and who decides them?
I tried to surf internet but everywhere I go, the answer I get is Conv2D operation applies filters on input and does the convolution operation.
To decide which model architecture would be best, you need to experiment. Thats the only way. As you want to classify, VGG architecture would be a good starting point I believe. You need to experiment with number of parameters as it depends on your problem. You can use Keras Tuner for it: https://keras.io/keras_tuner/
For kernel initialization, as far as I know convolutional layers in Keras uses Glorot Uniform Initialization but you can change that by using kernel_initializer parameter. Long story short, convolutional layers are initialized with a distribution function and as training goes filters change the values inside, which is learning process. https://keras.io/api/layers/initializers
Edit: I forgot to inform you that I suggest VGG architecture but in a way you downsize the models a lot. Your input shape is little so if your model is too much deep, you will overfit really quickly.

How should I change the model if accuracy is very low?

I am new to deep learning and I want to build an image classifier using CNN(keras). I have built a model with 2 convolution layers (filters = 32 , kernel = 3x3) followed by a MaxPooling layer(2x2) and this repeated 2 times. Finally 2 fully connected layers. I am getting an accuracy of 50%. My question is how do we choose the model to begin with. Like how do we decide that there should be 2 convolution layers followed by a MaxPooling layer or 1 convolution and 1 MaxPooling layer. Also how do we choose the number of filters in each convolution layer and the kernel size.
If my model is not working then how to decide what changes to be made to the model .
model = Sequential()
model.add(Convolution2D(32,3,3,input_shape=
(280,280,3),activation='relu'))
model.add(Convolution2D(32,3,3,activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
#model.add(Dropout(0.25))
model.add(Convolution2D(64,3,3,activation='relu'))
model.add(Convolution2D(64,3,3,activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
#model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(output_dim=256 , activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(output_dim=5,activation='softmax'))
I am getting an accuracy of 50% after 5 epochs. What changes should i make in my model?
Let us first start with the more straightforward part. Knowing the number of input and output layers and the number of their neurons is the easiest part. Every network has a single input layer and a single output layer. The number of neurons in the input layer equals the number of input variables in the data being processed. The number of neurons in the output layer equals the number of outputs associated with each input.
But the challenge is knowing the number of hidden layers and their neurons.
The answer is you cannot analytically calculate the number of layers or the number of nodes to use per layer in an artificial neural network to address a specific real-world predictive modeling problem.
The number of layers and the number of nodes in each layer are model hyperparameters that you must specify and learn.
You must discover the answer using a robust test harness and controlled experiments. Regardless of the heuristics, you might encounter, all answers will come back to the need for careful experimentation to see what works best for your specific dataset.
Again the filter size is one such hyperparameter you should specify before training your network.
For an image recognition problem, if you think that a big amount of pixels are necessary for the network to recognize the object you will use large filters (as 11x11 or 9x9). If you think what differentiates objects are some small and local features you should use small filters (3x3 or 5x5).
These are some tips but do not exist any rules.
There are many tricks to increase the accuracy of your deep learning model. Kindly refer to this link Improve deep learning model performance.
Hope this will help you.

TimeDistributed Layers vs. ConvLSTM-2D

Could anyone explains for me the differences between Time-Distributed Layers (from Keras Wrapper) and ConvLSTM-2D (Convolutional LSTM), for purposes, usage, etc.?
Both applies to a sequence of data.
Time Distributed is a very straightforward layer wrapper which only applies a layer (usually dense layer) on each time point. You need it when you need to change the shape of output tensor, especially the dimension of features, instead of sample size and time step.
ConvLSTM2D, is much more complex. You need to understand cnn and rnn layer first, where LSTM is one of most popular rnn. LSTM itself is applied on a sequence of of tensor, which is used for NLP, time series and for each time step the input is 1-dimension. cnn, the conv part, is usually used to learn from image, which is 2-dimension but don't have a sequence (time step). Combined together, convLSTM is used to learn image in a sequence, like video.

Is there a relation between the number of LSTM units and the length of the sequence to be trained?

I have programmed keras neural network to train on sequences. Does choosing the LSTM units in keras depend on length of the sequence?
There isn't a set way of determining how many units you should have based on your input.
More units are a way of making the model more complex. Generally speaking, if the look back period for your neural network is longer, then you have more features to train on, which means a more complex model would be better suited for learning your data.
Personally, I like to use the number of timesteps in each sample as my number of units, and I decrease this number as I move deeper into the network.
I have encountered the problem when I designed sports betting prediction engine with LSTM RNN.
There's a rule of thumb that helps for supervised learning problems. Please check this link. Here
But in my opinion, there is still no correct method or formulus to calculate the number of neurons per layer and the number of hidden layers according to the training dataset yet.

Audio classification with Keras: presence of human voice

I'd like to create an audio classification system with Keras that simply determines whether a given sample contains human voice or not. Nothing else. This would be my first machine learning attempt.
This audio preprocessor exists. It claims not to be done, but it's been forked a few times:
https://github.com/drscotthawley/audio-classifier-keras-cnn
I don't understand how this one would work, but I'm ready to give it a try:
https://github.com/keunwoochoi/kapre
But let's say I got one of those to work, would the rest of the process be similar to image classification? Basically, I've never fully understood when to use Softmax and when to use ReLu. Would this be similar with sound as it would with images once I've got the data mapped as a tensor?
Sounds can be seen as a 1D image and be worked with with 1D convolutions.
Often, dilated convolutions may do a good work, see Wave Nets
Sounds can also be seen as sequences and be worked with RNN layers (but maybe they're too bulky in amount of data for that)
For your case, you need only one output with a 'sigmoid' activation at the end and a 'binary_crossentropy' loss.
Result = 0 -> no voice
Result = 1 -> there's voice
When to use 'softmax'?
The softmax function is good for multiclass problems (not your case) where you want only one class as a result. All the results of a softmax function will sum 1. It's intended to be like a probability of each class.
It's mainly used at the final layer, because you only get classes as the final result.
It's good for cases when only one class is correct. And in this case, it goes well with the loss categorical_crossentropy.
Relu and other activations in the middle of the model
These are not very ruled. There are lots of possibilities. I often see relu in image convolutional models.
Important things to know are they "ranges". What are the limits of their outputs?
Sigmoid: from 0 to 1 -- at the end of the model this will be the best option for your presence/abscence classification. Also good for models that want many possible classes together.
Tanh: from -1 to 1
Relu: from 0 to limitless (it simply cuts negative values)
Softmax: from 0 to 1, but making sure the sum of all values is 1. Good at the end of models that want only 1 class among many classes.
Oftentimes it is useful to preprocess the audio to a spectrogram:
Using this as input, you can use classical image classification approaches (like convolutional neural networks). In your case you could divide the input audio in frames of around 20ms-100ms (depending on the time resolution you need) and convert those frames to spectograms. Convolutional networks can also be combined with recurrent units to take a larger time context into account.
It is also possible to train neural networks on raw waveforms using 1D Convolutions. However research has shown that preprocessing approaches using a frequency transformation achieve better results in general.

Resources