Constraint on the sum of parameters in Keras Layer - keras

I want to add custom constraints on the parameters of a layer.
I write a custom activation layer with two trainable parameters a and b s.t:
activation_fct = a*fct() + b*fct().
I need to have the sum of the parameters (a+b) equal to 1 but I don't know how to write such a constraint.
Can you give me some advices ?
Thanks in advance.

Two approaches come to my mind.
First one is to lock one of the parameters, let's say b and make only the other one (a in this case) trainable. Then you can compute b as follows
b = 1 - a
The second approach could be making both a and b trainable and transform them via softmax function. Softmax function will make sure that their sum is always 1.
from scipy.special import softmax
a = 0.12
b = 0.3
w1, w2 = softmax([a, b])
print(f'w1: {w1}, w2: {w2}, w1 + w2: {w1 + w2}')
This will produce
w1: 0.45512110762641994, w2: 0.5448788923735801, w1 + w2: 1.0
And once you have w1 and w2, you can use them in the mentioned formula instead of a and b.
activation_fct = w1 * fct() + w2 * fct()

You can have a single weight instead of two, and use this custom constraint:
import keras.backend as K
class Between_0_1(keras.constraints.Constraint):
def __call__(self, w):
return K.clip(w, 0, 1)
Then when building the weights, build only a and use the constraints.
def build(self, input_shape):
self.a = self.add_weight(name='weight_a',
shape=(1,),
initializer='uniform',
constraint = Between_0_1(),
trainable=True)
#if you want to start as 0.5
K.set_value(self.a, [0.5])
self.built = True
In call, b = 1-a:
def call(self, inputs, **kwargs):
#do stuff
....
return (self.a * something) + ((1-self.a)*another_thing)
You can alsto try #MatusDubrava softmax approach, but in this case your weights need to have shape (2,), and no constraint:
def build(self, input_shape):
self.w = self.add_weight(name='weights',
shape=(2,),
initializer='zeros',
trainable=True)
self.build = True
def call(self, inputs, **kwargs):
w = K.softmax(self.w)
#do stuff
....
return (w[0] * something ) + (w[1] * another_thing)

Related

What is the "correct" or best way in jax to implement a Dense layer where each layer might or might not have a bias?

For example in jax.experimental.stax there is an Dense layer implemented like this:
def Dense(out_dim, W_init=glorot_normal(), b_init=normal()):
"""Layer constructor function for a dense (fully-connected) layer."""
def init_fun(rng, input_shape):
output_shape = input_shape[:-1] + (out_dim,)
k1, k2 = random.split(rng)
W, b = W_init(k1, (input_shape[-1], out_dim)), b_init(k2, (out_dim,))
return output_shape, (W, b)
def apply_fun(params, inputs, **kwargs):
W, b = params
return jnp.dot(inputs, W) + b
return init_fun, apply_fun
If we implement bias as being allowed to be None for example, or params having length 1, there are implications for how grad works.
What is the pattern that one should aim for here? jax.jit has a static_argnums that I suppose could be used with some has_bias param but book-keeping this is involved and I am sure there must be some examples somewhere.
Wouldn't this work ?
def Dense(out_dim, W_init=glorot_normal(), b_init=normal()):
"""Layer constructor function for a dense (fully-connected) layer."""
def init_fun(rng, input_shape):
output_shape = input_shape[:-1] + (out_dim,)
k1, k2 = random.split(rng)
W = W_init(k1, (input_shape[-1], out_dim))
if b_init:
b = b_init(k2, (out_dim,)
return output_shape, (W, b)
return output_shape, W
def apply_fun(params, inputs, **kwargs):
if len(params) == 1:
W = params
return jnp.dot(inputs, W)
else:
W, b = params
return jnp.dot(inputs, W) + b
return init_fun, apply_fun

Product feature optimization with constraints

I have trained a Lightgbm model on learning to rank dataset. The model predicts relevance score of a sample. So higher the prediction the better it is. Now that the model has learned I would like to find the best values of some features that gives me the highest prediction score.
So, lets say I have features u,v,w,x,y,z and the features I would like to optimize over are x,y,z.
maximize f(u,v,w,x,y,z) w.r.t features x,y,z where f is a lightgbm model
subject to constraints :
y = Ax + b
z = 4 if y < thresh_a else 4-0.5 if y >= thresh_b else 4-0.3
thresh_m < x <= thresh_n
The numbers are randomly made up but constraints are linear.
Objective function with respect to x looks like the following :
So the function is very spiky, non-smooth. I also don't have the gradient information as f is a lightgbm model.
Using Nathan's answer I wrote down the following class :
class ProductOptimization:
def __init__(self, estimator, features_to_change, row_fixed_values,
bnds=None):
self.estimator = estimator
self.features_to_change = features_to_change
self.row_fixed_values = row_fixed_values
self.bounds = bnds
def get_sample(self, x):
new_values = {k:v for k,v in zip(self.features_to_change, x)}
return self.row_fixed_values.replace({k:{self.row_fixed_values[k].iloc[0]:v}
for k,v in new_values.items()})
def _call_model(self, x):
pred = self.estimator.predict(self.get_sample(x))
return pred[0]
def constraint1(self, vector):
x = vector[0]
y = vector[2]
return # some float value
def constraint2(self, vector):
x = vector[0]
y = vector[3]
return #some float value
def optimize_slsqp(self, initial_values):
con1 = {'type': 'eq', 'fun': self.constraint1}
con2 = {'type': 'eq', 'fun': self.constraint2}
cons = ([con1,con2])
result = minimize(fun=self._call_model,
x0=np.array(initial_values),
method='SLSQP',
bounds=self.bounds,
constraints=cons)
return result
The results that I get are always around the initial guess. And I think its because of non-smoothness of the function and absence of any gradient information which is important for the SLSQP optimizer. Any advices how should I deal with this kind of problem ?
It's been a good minute since I last wrote some serious code, so I appologize if it's not entirely clear what everything does, please feel free to ask for more explanations
The imports:
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
from scipy.optimize import minimize
from copy import copy
First I define a new class that allows me to easily redefine values. This class has 5 inputs:
value: this is the 'base' value. In your equation y=Ax + b it's the b part
minimum: this is the minimum value this type will evaluate as
maximum: this is the maximum value this type will evaluate as
multipliers: the first tricky one. It's a list of other InputType objects. The first is the input type and the second the multiplier. In your example y=Ax +b you would have [[x, A]], if the equation was y=Ax + Bz + Cd it would be [[x, A], [z, B], [d, C]]
relations: the most tricky one. It's also a list of other InputType objects, it has four items: the first is the input type, the second defines if it's an upper boundary you use min, if it's a lower boundary you use max. The third item in the list is the value of the boundary, and the fourth the output value connected to it
Watch out if you define your input values too strangely I'm sure there's weird behaviour.
class InputType:
def __init__(self, value=0, minimum=-1e99, maximum=1e99, multipliers=[], relations=[]):
"""
:param float value: base value
:param float minimum: value can never be lower than x
:param float maximum: value can never be higher than y
:param multipliers: [[InputType, multiplier], [InputType, multiplier]]
:param relations: [[InputType, min, threshold, output_value], [InputType, max, threshold, output_value]]
"""
self.val = value
self.min = minimum
self.max = maximum
self.multipliers = multipliers
self.relations = relations
def reset_val(self, value):
self.val = value
def evaluate(self):
"""
- relations to other variables are done first if there are none then the rest is evaluated
- at most self.max
- at least self.min
- self.val + i_x * w_x
i_x is input i, w_x is multiplier (weight) of i
"""
for term, min_max, value, output_value in self.relations:
# check for each term if it falls outside of the expected terms
if min_max(term.evaluate(), value) != term.evaluate():
return self.return_value(output_value)
output_value = self.val + sum([i[0].evaluate() * i[1] for i in self.multipliers])
return self.return_value(output_value)
def return_value(self, output_value):
return min(self.max, max(self.min, output_value))
Using this, you can fix the input types sent from the optimizer, as shown in _call_model:
class Example:
def __init__(self, lst_args):
self.lst_args = lst_args
self.X = np.random.random((10000, len(lst_args)))
self.y = self.get_y()
self.clf = GradientBoostingRegressor()
self.fit()
def get_y(self):
# sum of squares, is minimum at x = [0, 0, 0, 0, 0 ... ]
return np.array([[self._func(i)] for i in self.X])
def _func(self, i):
return sum(i * i)
def fit(self):
self.clf.fit(self.X, self.y)
def optimize(self):
x0 = [0.5 for i in self.lst_args]
initial_simplex = self._get_simplex(x0, 0.1)
result = minimize(fun=self._call_model,
x0=np.array(x0),
method='Nelder-Mead',
options={'xatol': 0.1,
'initial_simplex': np.array(initial_simplex)})
return result
def _get_simplex(self, x0, step):
simplex = []
for i in range(len(x0)):
point = copy(x0)
point[i] -= step
simplex.append(point)
point2 = copy(x0)
point2[-1] += step
simplex.append(point2)
return simplex
def _call_model(self, x):
print(x, type(x))
for i, value in enumerate(x):
self.lst_args[i].reset_val(value)
input_x = np.array([i.evaluate() for i in self.lst_args])
prediction = self.clf.predict([input_x])
return prediction[0]
I can define your problem as shown below (be sure to define the inputs in the same order as the final list, otherwise not all the values will get updated correctly in the optimizer!):
A = 5
b = 2
thresh_a = 5
thresh_b = 10
thresh_c = 10.1
thresh_m = 4
thresh_n = 6
u = InputType()
v = InputType()
w = InputType()
x = InputType(minimum=thresh_m, maximum=thresh_n)
y = InputType(value = b, multipliers=([[x, A]]))
z = InputType(relations=[[y, max, thresh_a, 4], [y, min, thresh_b, 3.5], [y, max, thresh_c, 3.7]])
example = Example([u, v, w, x, y, z])
Calling the results:
result = example.optimize()
for i, value in enumerate(result.x):
example.lst_args[i].reset_val(value)
print(f"final values are at: {[i.evaluate() for i in example.lst_args]}: {result.fun)}")

Pytorch autograd: Make gradient of a parameter a function of another parameter

In Pytorch, how can I make the gradient of a parameter a function itself?
Here is a simple code snippet:
import torch
def fun(q):
def result(w):
l = w * q
l.backward()
return w.grad
return result
w = torch.tensor((2.), requires_grad=True)
q = torch.tensor((3.), requires_grad=True)
f = fun(q)
print(f(w))
In the code above, how can I make f(w) have gradient with respect to q?
EDIT: based on the accepted answer I was able to write a code that works. Essentially I am alternating between 2 optimization steps. For dim == 1 it works and for dim == 2 it does not. I get the error "RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time."
import torch
class f_class():
def __init__(self, dim):
self.dim = dim
if self.dim == 1:
self.w = torch.tensor((3.), requires_grad=True)
elif self.dim == 2:
self.w = [torch.tensor((3.), requires_grad=True), torch.tensor((5.), requires_grad=True)]
else:
raise ValueError("dim 1 or 2")
def forward(self, x):
if self.dim == 1:
return torch.mul(self.w, x)
elif self.dim == 2:
return torch.mul(torch.mul(self.w[0], self.w[1]), x)
def set_w(self, w):
self.w = w
def get_w(self):
return self.w
class g_class():
def __init__(self):
self.q = torch.tensor((4.), requires_grad=True)
def forward(self, f):
return torch.mul(self.q, f)
def set_q(self, q):
self.q = q
def get_q(self):
return self.q
def w_new(f, g, dim):
loss_g = g.forward(f.forward(xd))
if dim == 1:
grads = torch.autograd.grad(loss_g, f.get_w(), create_graph=True, only_inputs=True)[0]
temp = f.get_w().detach() + grads
else:
grads = torch.autograd.grad(loss_g, f.get_w(), create_graph=True, only_inputs=True)
temp = [wi.detach() + gi for wi, gi in zip(f.get_w(), grads)]
return temp
def q_new(f, g):
loss_f = 2 * f.forward(xd)
loss_f.backward()
temp = g.get_q().detach() + g.get_q().grad
temp.requires_grad = True
return temp
dim = 1
xd = torch.tensor((2.))
f = f_class(dim)
g = g_class()
for _ in range(3):
print(f.get_w(), g.get_q())
wnew = w_new(f, g, dim)
f.set_w(wnew)
print(f.get_w(), g.get_q())
qnew = q_new(f, g)
g.set_q(qnew)
print(f.get_w(), g.get_q())
When computing gradients, if you want to construct a computation graph for the gradient itself you need to specify create_graph=True to autograd.
A potential source of error in your code is using Tensor.backward within f. The problem here is that w.grad and q.grad will be populated with the gradient of l. This means that when you call f(w).backward(), the gradients of both f and l will be added to w.grad and q.grad. In effect you will end up with w.grad being equal to dl/dw + df/dw and similarly for q.grad. One way to get around this is to zero the gradients after f(w) but before .backward(). A better way is to use torch.autograd.grad within f. Using the latter approach, the grad attribute of w and q will not be populated when calling f, only when calling .backward(). This leaves room for things like gradient accumulation during training.
import torch
def fun(q):
def result(w):
l = w * q
return torch.autograd.grad(l, w, only_inputs=True, retain_graph=True)[0]
return result
w = torch.tensor((2.), requires_grad=True)
q = torch.tensor((3.), requires_grad=True)
f = fun(q)
f(w).backward()
print('w.grad:', w.grad)
print('q.grad:', q.grad)
which results in
w.grad: None
q.grad: tensor(1.)
Note that w.grad was not populated. This is because f(w) = dl/dw = q is not a function of w, and therefore w is not part of the computation graph. If you're using a standard pytorch optimizer this is fine since None gradients are implicitly assumed to be zero.
If l were instead a non-linear function of w, then w.grad would have been populated after f(w).backward(). For example
import torch
def fun(q):
def result(w):
# now dl/dw = 2 * w * q
l = w**2 * q
return torch.autograd.grad(l, w, only_inputs=True, create_graph=True)[0]
return result
w = torch.tensor((2.), requires_grad=True)
q = torch.tensor((3.), requires_grad=True)
f = fun(q)
f(w).backward()
print('w.grad:', w.grad)
print('q.grad:', q.grad)
which results in
w.grad: tensor(6.)
q.grad: tensor(4.)

Has anyone written weldon pooling for keras?

Has the Weldon pooling [1] been implemented in Keras?
I can see that it has been implemented in pytorch by the authors [2] but cannot find a keras equivalent.
[1] T. Durand, N. Thome, and M. Cord. Weldon: Weakly su-
pervised learning of deep convolutional neural networks. In
CVPR, 2016.
[2] https://github.com/durandtibo/weldon.resnet.pytorch/tree/master/weldon
Here is one based on the lua version (there is a pytorch impl but i think that has an error taking the average of max+min). I'm assuming the lua version's avg of top max and min values was still correct. I've not tested the whole custom layer aspects but close enough to get something going, comments welcomed.
Tony
class WeldonPooling(Layer):
"""Class to implement Weldon selective spacial pooling with negative evidence
"""
##interfaces.legacy_global_pooling_support
def __init__(self, kmax, kmin=-1, data_format=None, **kwargs):
super(WeldonPooling, self).__init__(**kwargs)
self.data_format = conv_utils.normalize_data_format(data_format)
self.input_spec = InputSpec(ndim=4)
self.kmax=kmax
self.kmin=kmin
def compute_output_shape(self, input_shape):
if self.data_format == 'channels_last':
return (input_shape[0], input_shape[3])
else:
return (input_shape[0], input_shape[1])
def get_config(self):
config = {'data_format': self.data_format}
base_config = super(_GlobalPooling2D, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
def call(self, inputs):
if self.data_format == "channels_last":
inputs = tf.transpose(inputs, [0, 3, 1, 2])
kmax=self.kmax
kmin=self.kmin
shape=tf.shape(inputs)
batch_size = shape[0]
num_channels = shape[1]
h = shape[2]
w = shape[3]
n = h * w
view = tf.reshape(inputs, [batch_size, num_channels, n])
sorted, indices = tf.nn.top_k(view, n, sorted=True)
#indices_max = tf.slice(indices,[0,0,0],[batch_size, num_channels, kmax])
output = tf.div(tf.reduce_sum(tf.slice(sorted,[0,0,0],[batch_size, num_channels, kmax]),2),kmax)
if kmin > 0:
#indices_min = tf.slice(indices,[0,0, n-kmin],[batch_size, num_channels, kmin])
output=tf.add(output,tf.div(tf.reduce_sum(tf.slice(sorted,[0,0,n-kmin],[batch_size, num_channels, kmin]),2),kmin))
return tf.reshape(output,[batch_size, num_channels])

the parameter of RNN in Theano tutorial

class RNNSLU(object):
''' elman neural net model '''
def __init__(self, nh, nc, ne, de, cs):
'''
nh :: dimension of the hidden layer
nc :: number of classes
ne :: number of word embeddings in the vocabulary
de :: dimension of the word embeddings
cs :: word window context size
'''
# parameters of the model
self.emb = theano.shared(name='embeddings',
value=0.2 * numpy.random.uniform(-1.0, 1.0,
(ne+1, de))
# add one for padding at the end
.astype(theano.config.floatX))
self.wx = theano.shared(name='wx',
value=0.2 * numpy.random.uniform(-1.0, 1.0,
(de * cs, nh))
.astype(theano.config.floatX))
self.wh = theano.shared(name='wh',
value=0.2 * numpy.random.uniform(-1.0, 1.0,
(nh, nh))
.astype(theano.config.floatX))
self.w = theano.shared(name='w',
value=0.2 * numpy.random.uniform(-1.0, 1.0,
(nh, nc))
.astype(theano.config.floatX))
self.bh = theano.shared(name='bh',
value=numpy.zeros(nh,
dtype=theano.config.floatX))
self.b = theano.shared(name='b',
value=numpy.zeros(nc,
dtype=theano.config.floatX))
self.h0 = theano.shared(name='h0',
value=numpy.zeros(nh,
dtype=theano.config.floatX))
# bundle
self.params = [self.emb, self.wx, self.wh, self.w, self.bh, self.b, self.h0]
def recurrence(x_t, h_tm1):
h_t = T.nnet.sigmoid(T.dot(x_t, self.wx)
+ T.dot(h_tm1, self.wh) + self.bh)
s_t = T.nnet.softmax(T.dot(h_t, self.w) + self.b)
return [h_t, s_t]
[h, s], = theano.scan(fn=recurrence,
sequences=x,
outputs_info=[self.h0, None],
n_steps=x.shape[0])
I am following this Theano tutorial about RNN.(http://deeplearning.net/tutorial/rnnslu.html) But I have two questions about it.
First. In this tutorial, recurrence function like this:
def recurrence(x_t, h_tm1):
h_t = T.nnet.sigmoid(T.dot(x_t, self.wx) + T.dot(h_tm1, self.wh) + self.bh)
s_t = T.nnet.softmax(T.dot(h_t, self.w) + self.b)
return [h_t, s_t]
I wounder why do not plus h0 in h_t ? (i.e. h_t = T.nnet.sigmoid(T.dot(x_t, self.wx) + T.dot(h_tm1, self.wh) + self.bh + self.h0))
Second, why outputs_info=[self.h0, None]? I know outputs_info is the Initialization result. So I think outputs_info=[self.bh+self.h0, T.nnet.softmax(T.dot(self.bh+self.h0, self.w_h2y) + self.b_h2y)]
def recurrence(x_t, h_tm1):
h_t = T.nnet.sigmoid(T.dot(x_t, self.wx)
+ T.dot(h_tm1, self.wh) + self.bh)
s_t = T.nnet.softmax(T.dot(h_t, self.w) + self.b)
return [h_t, s_t]
So, first you ask why we don't use h0 in the recurrence function. Let's breakdown this part,
h_t = T.nnet.sigmoid(T.dot(x_t, self.wx)+ T.dot(h_tm1, self.wh) + self.bh)
What we expect is 3 terms.
The first term is the input layer multiplied by the weighting matrix T.dot(x_t, self.wx).
The second term is the hidden layer muliplied by another weighting matrix (this is what makes it recurrent) T.dot(h_tm1, self.wh). Note that you must have a weighting matrix, you proposed to add self.h0 as a bias basically.
The third term is the bias of the hidden layer, self.bh.
Now, after every iteration we want to keep track of the hidden layer activations, contained in self.h0. However, self.h0 is meant to contain the CURRENT activations and what we need is the previous activations.
[h, s], _ = theano.scan(fn=recurrence,
sequences=x,
outputs_info=[self.h0, None],
n_steps=x.shape[0])
So, look at the scan function again. You are right that outputs_info=[self.h0, None] initializes the values, but the values are also linked to the outputs. There are two outputs from recurrence(), namely [h_t, s_t].
So what the outputs_info does as well is that after every iteration, the value of self.h0 is overwritten with the value of h_t (the first returned value). The second element of outputs_info is None, because we do not save or initialize the value for s_t anywhere (the second argument of outputs_info is linked to the returned values of the recurrence function this way.)
In the next iteration, the first argument of outputs_info is used again as input, such that h_tm1 is the same value as self.h0. But, since we must have an argument for h_tm we must initialize this value. Since we don't need to initialize a second argument in outputs_info, we leave the second term as None.
Granted, the theano.scan() function is very confusing at times and I'm new at it too. But, this is what I understood from doing this same tutorial.

Resources