Action-selection for dqn with pytorch - pytorch

I’m a newbie in DQN and try to understand its coding. I am trying the code below as epsilon greedy action selection but I am not sure how it works
 
if sample > eps_threshold:
with torch.no_grad():
# t.max(1) will return largest column value of each row.
# second column on max result is index of where max element was
# found, so we pick action with the larger expected reward.
return policy_net(state).max(1)[1].view(1, 1)
else:
return torch.tensor([[random.randrange(n_actions)]], device=device, dtype=torch.long)
Could you please let me know what are indices in max(1)[1] and what is view(1, 1) and it’s indices. Also why “with torch.no_grad():” has been used

When you train a model, torch has to store all the tensors involved in computing the output into a graph, to then be able to make a backward pass during training; this is computationally expensive, and considering that after selecting the action you don't have to train the network, because your only goal here it to pick one using the current weights, then it's just better to use torch.no_grad(). Note that without that part the code would still work the same way, maybe just a bit slower.
About the max(1)[1] part, I'm not really sure how the inputs and outputs are taken considering that there's only a small portion of code here, but I guess that the model takes as input batches of data and outputs a Q-value for each action; then, for each of this outputs you have to take the action that gives you the highest value, so you basically need a max at each row, and that's done by specifying as axis (or dim as torch calls it) the first one, which represents the columns (at every row you take the max of the corresponding columns, which are the actions in this case).

Related

Ensuring that optimization does not find the trivial solution by setting weights to 0

I am trying to train a neural network which takes as input (input_t0) and an initial hidden state (call it s_t0) and produces a new hidden state (s_t1) by transforming the input via a series of transformations (neural network layers). At the next time step, a transformed input (input_t1) and the hidden state from the previous time step (s_t1) is passed to the same model. This process keeps repeating for a couple of steps.
The goal of optimization is to ensure the distance between s_t0 and s_t1 is small through self-supervision, as s_t1 is supposed to be an transformed version of s_t0. In other words, I want s_t1 to only carry new information in the new input. My intuition tells me taking the norm of the weights and ensuring the norm does not go to zero (is this even possible?) would be one way to achieve this. However, I'm afraid won't be the best thing to do necessarily, as it might not encourage the model to update the state vector with new information.
Currently the way I train the model is by taking the absolute distance between s_t0 and s_t1 via loss = torch.abs(s_t1 - s_t0).mean(dim=1). Then I call loss.backward() and optimizer.step() which changes the weights. Note that the reason that I use abs() is that the hidden states are produced after applying ReLU, so the only hold positive values. So what is the best way to achieve this and ensure the weights don't go to 0? Would I be able to somehow use mutual information for this?
However, I noticed that optimization quickly finds the trivial solution by setting weights to 0. This causes both s_t0 and s_t1 get smaller and smaller until their difference is 0, which satisfies the constraint but does not yield the behavior I expect. Is there a way to ensure weights do not go to zero during optimization?

Relationship between memory cell and time step in LSTM

i'm studying LSTM model.
Does one memory cell of hidden layer in LSTM correspond to one timestep?
example code) model.add(LSTM(128, input_shape = (4, 1)))
When implementing LSTMs in Keras, can set the number of memory cells, as in the example code, regardless of the time step. In the example it is 128.
but, A typical LSTM image is shown to correspond 1: 1 with the number of time steps and the number of memory cells. What is the correct answer?
enter image description here
In LSTM, we supply input in the following manner
[samples,timesteps,features]
samples is for number of training examples you want to feed at a time
timesteps is how many values you want to use
Say you mention timesteps=3
So values at t,t-1 and t-2 are used to predict the data at t+1
features is how many dimensions you want to supply at a time
LSTM has memory cells but I am explaining the code part so as not to confuse you
I hope this helps
as I understand timestep is a length of Sequence per each processing (=Window_Size)... that (dependently on parameter "return_sequences=True/False") will return either multi- or single- output per each step of data processed... like here explained & showed ...
explanation here seems to be better
concerning memory cell - here "A part of a NN that preserves some state across time steps is called a memory cell." - make me consider memory cell to be, probably, a "container" - each for temporal weights per vars in window series, till update of them during further backpropagation (when statefull=True) --
BETTER TO SEE ONCE - pic here memory cell & the logics of its work here
KNOW usage of the whole shape - here - time_steps for backpropagation

Understanding Input Sequences of Unlimited Length for RNNs in Keras

I have been looking into an implementation of a certain architecture of deep learning model in keras when I came across a technicality that I could not grasp. In the code, the model is implemented as having two inputs; the first is the normal input that goes through the graph (word_ids in the sample code below), while the second is the length of that input, which seems to be involved nowhere other than the inputs argument in the keras Model instant (sequence_lengths in the sample code below).
word_ids = Input(batch_shape=(None, None), dtype='int32')
word_embeddings = Embedding(input_dim=embeddings.shape[0],
output_dim=embeddings.shape[1],
mask_zero=True,
weights=[embeddings])(word_ids)
x = Bidirectional(LSTM(units=64, return_sequences=True))(word_embeddings)
x = Dense(64, activation='tanh')(x)
x = Dense(10)(x)
sequence_lengths = Input(batch_shape=(None, 1), dtype='int32')
model = Model(inputs=[word_ids, sequence_lengths], outputs=[x])
I think this is done to make the network accept a sequence of any length. My questions are as follow:
Is what I think correct?
If yes, then, I feel like there is a bit of
magic going on under the hood. Any suggestions on how to wrap
one's head around this?
Does this mean that using this method, one doesn't need to pad his sequences (neither in training nor in prediction), and that keras will somehow know how to pad them automatically?
Do you need to pass sequence_lengths as an input?
No, it's absolutely not necessary to pass the sequence lengths as inputs, either if you're working with fixed or with variable length sequences.
I honestly don't understand why that model in the code uses this input if it's not sent to any of the model layers to be processed.
Is this really the complete model?
Why would one pass the sequence lengths as an input?
Well, maybe they want to perform some custom calculations with those. It might be an interesting option, but none of these calculations are present (or shown) in the code you posted. This model is doing absolutely nothing with this input.
How to work with variable sequence length?
For that, you've got two options:
Pad the sequences, as you mentioned, to a fixed size, and add Masking layers to the input (or use the mask_zeros=True option in the embedding layer).
Use the length dimension as None. This is done with one of these:
batch_shape=(batch_size, None)
input_shape=(None,)
PS: these shapes are for Embedding layers. An input that goes directly into recurrent networks would have an additional last dimension for input features
When using the second option (length = None), you should process each batch separately, because you are not able to put all sequences with different lengths in the same numpy array. But there is no limitation in the model itself, and no padding is necessary in this case.
How to work with "unlimited" length
The only way to work with unlimited length is using stateful=True.
In this case, every batch you pass will not be seen as "another group of sequences", but "additional steps of the previous batch".

why is sklearn.feature_selection.RFECV giving different results for each run

I tried to do feature selection with RFECV but it is giving out different results each time, does cross-validation divide the sample X into random chunks or into sequential deterministic chunks?
Also, why is the score different for grid_scores_ and score(X,y)? why are the scores sometimes negative?
Does cross-validation divide the sample X into random chunks or into sequential deterministic chunks?
CV divides the data into deterministic chunks by default. You can change this behaviour by setting the shuffle parameter to True.
However, RFECV uses sklearn.model_selection.StratifiedKFold if the y is binary or multiclass.
This means that it will split the data such that each fold has the same (or nearly the same ratio of classes). In order to do this, the exact data in each fold can change slightly in different iterations of CV. However, this should not cause major changes in the data.
If you are passing a CV iterator using the cv parameter, then you can fix the splits by specifying a random state. The random state is linked to random decisions made by the algorithm. Using the same random state each time will ensure the same behaviour.
Also, why is the score different for grid_scores_ and score(X,y)?
grid_scores_ is an array of cross-validation scores. grid_scores_[i] is the cross-validation score for the i-th iteration. This means that the first score is the score for all features, the second is the score when one set of features is removed and so on. The number of features removed in each is equal to the value of the step parameter. This is = 1 by default.
score(X, y) selects the optimal number of features and returns the score for those features.
why are the scores sometimes negative?
This depends on the estimator and scorer you are using. If you have set no scorer RFECV will use the default score function for the estimator. Generally, this is accuracy, but in your particular case, might be something that returns a negative value.

Keras Lambda Layer for Custom Loss

I am attempting to implement a Lambda layer that will produce a custom loss function. In the layer, I need to be able to compare every element in a batch to every other element in the batch in order to calculate the cost. Ideally, I want code that looks something like this:
for el_1 in zip(y_pred, y_true):
for el_2 in zip(y_pred, y_true):
if el_1[1] == el_2[1]:
# Perform a calculation
else:
# Perform a different calculation
When I true this, I get:
TypeError: TensorType does not support iteration.
I am using Keras version 2.0.2 with a Theano version 0.9.0 backend. I understand that I need to use Keras tensor functions in order to do this, but I can't figure out any tensor functions that do what I want.
Also, I am having difficulty understanding precisely what my Lambda function should return. Is it a tensor of the total cost for each sample, or is it just a total cost for the batch?
I have been beating my head against this for days. Any help is deeply appreciated.
A tensor in Keras commonly has at least 2 dimensions, the batch and the neuron/unit/node/... dimension. A dense layer with 128 units trained with a batch size of 64 would therefore yields a tensor with shape (64,128).
Your LambdaLayer processes tensors as any other layer does, plugging it in after your dense layer from before will give you a tensor with shape (64,128) to process. Processing a tensor works similar to how calculations on numpy arrays works (or any other vector processing library really): you specify one operation to broadcast over all elements in the data structure.
For example, your custom cost is the difference for each value in the batch, you would implement it like so:
cost_layer = LambdaLayer(lambda a,b: a - b)
The - operation is broadcasted over a and b and will return a suitable result provided the dimensions match. The takeaway is that you really only can specify one operation for every value. If you want to do more complex tasks, for example computations based on the value you need single operations that take two operations and apply the correct one accordingly, i.e. the switch operation.
The syntax for K.switch is
K.switch(condition, then_expression, else_expression)
For example, if you want to subtract both values when a != b but add them when they are equal, you would write:
import keras.backend as K
cost_layer = LambdaLayer(lambda a,b: K.switch(a != b, a - b, a + b))

Resources