Getting the query, key and value matrices from PyTorch with self_attn.in_proj_weight - pytorch

We have implemented a transformer based on the tutorial here.
We need to access the weights of the query, key and value matrices and were planning on doing this with model.state_dict(). However the model stores these matrices as a concatenation in this shared matrix.
model.state_dict()['transformer_encoder.layers.0.self_attn.in_proj_weight']
We would assume that they are concatenated in the order query, key, value. If so we can just split the tensor manually. However, we were unable to verify in the PyTorch documentation whether this is the actual order. Is there an easy way to verify whether this is the case? Or any other way to get the query, key and value matrices individually for this transformer model?

The implementation of MultiHeadAttention in the pytorch codebase follows a simple check:
if not self._qkv_same_embed_dim:
self.q_proj_weight = Parameter(torch.empty((embed_dim, embed_dim), **factory_kwargs), requires_grad = not self.freeze_proj_mat['q'])
self.k_proj_weight = Parameter(torch.empty((embed_dim, self.kdim), **factory_kwargs), requires_grad = not self.freeze_proj_mat['k'])
self.v_proj_weight = Parameter(torch.empty((embed_dim, self.vdim), **factory_kwargs), requires_grad = not self.freeze_proj_mat['v'])
self.register_parameter('in_proj_weight', None)
else:
self.in_proj_weight = Parameter(torch.empty((3 * embed_dim, embed_dim), **factory_kwargs))
self.register_parameter('q_proj_weight', None)
self.register_parameter('k_proj_weight', None)
self.register_parameter('v_proj_weight', None)
where,
self._qkv_same_embed_dim = self.kdim == embed_dim and self.vdim == embed_dim
Here, kdim, embed_dim, vdim all have their usual meanings as per the function definition, check here.
This is an implementation detail that is abstracted away from the user. But as you mentioned, to get access to the Q, K, V matrices when self._qkv_same_embed_dim is True, you can extract this Tensor and call the method _in_projection_packed that is available in the nn.functional API source.
You can check all the provided links to these function implementations for your reference.
TLDR
You can use the torch.split function to split the projection weights into query, key and value matrices. Like this,
in_proj_weight = model.state_dict()['transformer_encoder.layers.0.self_attn.in_proj_weight']
q, k, v = torch.split(in_proj_weight, [embed_dim, embed_dim, embed_dim])
Hope this helps.

Related

How to update PyTorch model parameters (tensors) after averaging them?

I'm currently working on a distributed federated learning infrastructure and am trying to implement PyTorch. For this I also need federated averaging which averages the retrieved parameters from all the nodes and then passes those to a next training round.
The gathering of the parameters looks like this:
def RPC_get_parameters(data, model):
"""
Get parameters from nodes
"""
with torch.no_grad():
for parameters in model.parameters():
# store parameters in dict
return {"parameters": parameters}
The averaging function which happens at the central server looks like this:
# stores results from RPC_get_parameters() in results
results = client.get_results(task_id=task.get("id"))
# averaging of returned parameters
global_sum = 0
global_count = 0
for output in results:
global_sum += output["parameters"]
global_count += len(global_sum)
#
averaged_parameters = global_sum/global_count
#
new_params = {'averaged_parameters': averaged_parameters}
Now my question is, how do you update all the parameters (tensors) in Pytorch from this? I tried a few things and they usually returned errors like "Value Error: can't optimize a non-leaf tensor" when inserting new_params into the optimizer where usually model.parameters() go optimizerD = optim.SGD(new_params, lr=0.01, momentum = 0.5). So how do I actually update the model so it uses the averaged parameters?
Thank you!
https://github.com/simontkl/torch-vantage6/blob/fed_avg-w/local_dp/v6-ppsdg-py/master.py
I think the most convenient way to work with parameters (outside the SGD context) is using the state_dict of the model.
new_params = OrderedDict()
n = len(clients) # number of clients
for client_model in clients:
sd = client_model.state_dict() # get current parameters of one client
for k, v in sd.items():
new_params[k] = new_params.get(k, 0) + v / n
After that new_params is a state_dict (you can load it using .load_state_dict) with the average weights of the clients.

Is it possible to modify a cp_sat model after construction?

I have a model for finding a particular class of integer numbers (the "Keith numbers"), which works well, but is quite slow as it requires constructing a new model many times. Is there a way to update a model, in particular to change the coefficient in the constraint. In other words, change the model to match a different mat, without reconstructing the whole thing?
def _construct_model(self, mat):
model = cp_model.CpModel()
digit = [model.NewIntVar(0, 9, f'digit[{i}]') for i in range(self.k)]
# Creates the constraint.
model.Add(sum([mat[i] * digit[i] for i in range(self.k)]) == 0)
model.Add(digit[0] != 0)
return model, digit
Yes, but you are on your own.
You can access the underlying cp_model_proto protobuf from the model, and modify it directly.
He have no plan currently to add a modification API on top of the cp_model API.

Feeding list of tf.truncated_normal() or list of dictionaries into a Tensorflow Model

I am new to tensorflow and I am trying to learn how to use the tool efficiently.
I expand on the question below but here is the tldr:
I am wondering what is the best way to feed the following weights and biases into my model with feed_dict:
def generate_initial_population(my_population_size):
my_weights = []
my_biases = []
for _ in range(my_population_size):
my_weights.append({
'h1': tf.Variable(tf.truncated_normal([n_inputs, n_hidden_1])),
'h2': tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.truncated_normal([n_hidden_2, n_class]))
})
my_biases.append({
'b1': tf.Variable(tf.truncated_normal([n_hidden_1])),
'b2': tf.Variable(tf.truncated_normal([n_hidden_2])),
'out': tf.Variable(tf.truncated_normal([n_class]))
})
return my_weights, my_biases
weights, biases = generate_initial_population(population_size)
I cannot simply use feed_dict={weights_ph: weights} because it will generate errors. I do not know how to deal with this problem efficiently
Examining the code at the end might help with understanding what i am talking about.
I am wondering if there is any way I could feed a list containing tf.truncated_normals to my model.
I get the ValueError: setting an array element with a sequence. error because I believe it is trying to convert to np.array but has issues with the dimensions
I have found an easy workout where i figure out the values of all the tensors first with session run and then feed that into my model.
I am just confused if this is the right solution since I would be inclined to believe it is slower because you have to execute session twice?
This solution also doesnt work however if my original list is not perfect shape
like [ [1, [1,2]]] or when my truncated_normals do not have the same shapes
I was thinking im just going to feed my weird shape list into my model and then use tf.gather to get the specific indexes I want to work on.
Since i cannot do that is my solution the proper way to deal with this... simply calculating the truncated_normals first and then feeding that into the model. then reshaping the list while inside the model if you need to?
I also am having a very similar problems because I wanted to feed in a list of dictionaries into the model as well. Is the proper way of dealing with that to extract the data from dictionaries and then just feed in each value from each key separately.
I am trying to learn and i couldnt find this information elsewhere
Here is a code snippet i designed to fail to explain what i mean
import tensorflow as tf
list_ph = tf.placeholder(dtype=tf.float32)
index_ph = tf.placeholder(dtype=tf.int32)
def model(my_list, index):
value = tf.gather(my_list, index, axis=0)
return value
my_model = model(list_ph, index_ph)
with tf.Session() as sess:
var_list = []
truncated_normal = tf.Variable(tf.truncated_normal(shape=[5, 3]))
for i in range(4):
var_list.append(truncated_normal)
# for i in range(4):
# var_list.append({i: i*2})
sess.run(tf.global_variables_initializer())
#will work but will not work for dictionaries
val = sess.run(var_list)
# will not work, but will work if you feed val
var = sess.run(my_model, feed_dict={list_ph: var_list, index_ph: 1})

Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:
inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims]
print(rank)
Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.
Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?
Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks #gojomo).
def seeded_vector(self, seed_string):
"""Create one 'random' vector (but deterministic by seed_string)"""
# Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
return (once.rand(self.vector_size) - 0.5) / self.vector_size
Set negative=0 to avoid randomization:
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20, window=5, min_count=1, negative=0, workers=6, epochs=10)
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
v1 = model.infer_vector(s)
for i in range(100):
v2 = model.infer_vector(s)
assert np.all(v1 == v2), "Failed on %s" % (''.join(s))

Implementing a complicated activation function in keras

I just read an interesting paper: A continuum among logarithmic, linear, and exponential functions, and its potential to improve generalization in neural networks.
I'd like to try to implement this activation function in Keras. I've implemented custom activations before, e.g. a sinusoidal activation:
def sin(x):
return K.sin(x)
get_custom_objects().update({'sin': Activation(sin)})
However, the activation function in this paper has 3 unique properties:
It doubles the size of the input (the output is 2x the input)
It's parameterized
It's parameters should be regularized
I think once I have a skeleton for dealing with the above 3 issues, I can work out the math myself, but I'll take any help I can get!
Here, we will need one of these two:
A Lambda layer - If your parameters are not trainable (you don't want them to change with backpropagation)
A custom layer - If you need custom trainable parameters.
The Lambda layer:
If your parameters are not trainable, you can define your function for a lambda layer. The function takes one input tensor, and it can return anything you want:
import keras.backend as K
def customFunction(x):
#x can be either a single tensor or a list of tensors
#if a list, use the elements x[0], x[1], etc.
#Perform your calculations here using the keras backend
#If you could share which formula exactly you're trying to implement,
#it's possible to make this answer better and more to the point
#dummy example
alphaReal = K.variable([someValue])
alphaImag = K.variable([anotherValue]) #or even an array of values
realPart = alphaReal * K.someFunction(x) + ...
imagPart = alphaImag * K.someFunction(x) + ....
#You can return them as two outputs in a list (requires the fuctional API model
#Or you can find backend functions that join them together, such as K.stack
return [realPart,imagPart]
#I think the separate approach will give you a better control of what to do next.
For what you can do, explore the backend functions.
For the parameters, you can define them as keras constants or variables (K.constant or K.variable), either inside or outside the function above, or even transform them in model inputs. See details in this answer
In your model, you just add a lambda layer that uses that function.
In a Sequential model: model.add(Lambda(customFunction, output_shape=someShape))
In a functional API model: output = Lambda(customFunction, ...)(inputOrListOfInputs)
If you're going to pass more inputs to the function, you'll need the functional model API.
If you're using Tensorflow, the output_shape will be computed automatically, I believe only Theano requires it. (Not sure about CNTK).
The custom layer:
A custom layer is a new class you create. This approach will only be necessary if you're going to have trainable parameters in your function. (Such as: optimize alpha with backpropagation)
Keras teaches it here.
Basically, you have an __init__ method where you pass the constant parameters, a build method where you create the trainable parameters (weights), a call method that will do the calculations (exactly what would go in the lambda layer if you didn't have trainable parameters), and a compute_output_shape method so you can tell the model what the output shape is.
class CustomLayer(Layer):
def __init__(self, alphaReal, alphaImag):
self.alphaReal = alphaReal
self.alphaImage = alphaImag
def build(self,input_shape):
#weights may or may not depend on the input shape
#you may use it or not...
#suppose we want just two trainable values:
weigthShape = (2,)
#create the weights:
self.kernel = self.add_weight(name='kernel',
shape=weightShape,
initializer='uniform',
trainable=True)
super(CustomLayer, self).build(input_shape) # Be sure to call this somewhere!
def call(self,x):
#all the calculations go here:
#dummy example using the constant inputs
realPart = self.alphaReal * K.someFunction(x) + ...
imagPart = self.alphaImag * K.someFunction(x) + ....
#dummy example taking elements of the trainable weights
realPart = self.kernel[0] * realPart
imagPart = self.kernel[1] * imagPart
#all the comments for the lambda layer above are valid here
#example returning a list
return [realPart,imagPart]
def compute_output_shape(self,input_shape):
#if you decide to return a list of tensors in the call method,
#return a list of shapes here, twice the input shape:
return [input_shape,input_shape]
#if you stacked your results somehow in a single tensor, compute a single tuple, maybe with an additional dimension equal to 2:
return input_shape + (2,)
You need to implement a "Layer", not a common activation function.
I think the implementation of pReLU in Keras would be a good example for your task. See pReLU
A lambda function in the activation worked for me. Maybe not what you want but it's one step more complicated than the simple use of a built-in activation function.
encoder_outputs = Dense(units=latent_vector_len, activation=k.layers.Lambda(lambda z: k.backend.round(k.layers.activations.sigmoid(x=z))), kernel_initializer="lecun_normal")(x)
This code changes the output of a Dense from Reals to 0,1 (ie, binary).
Keras throws a warning but the code still proves to work.

Resources