what's the difference between "self-attention mechanism" and "full-connection" layer? - pytorch

I am confused with these two structures. In theory, the output of them are all connected to their input. what magic make 'self-attention mechanism' is more powerful than the full-connection layer?

Ignoring details like normalization, biases, and such, fully connected networks are fixed-weights:
f(x) = (Wx)
where W is learned in training, and fixed in inference.
Self-attention layers are dynamic, changing the weight as it goes:
attn(x) = (Wx)
f(x) = (attn(x) * x)
Again this is ignoring a lot of details but there are many different implementations for different applications and you should really check a paper for that.

Related

How to perform de-normalization of last layer into labels in Keras, analogous to the preprocessing normalization layer (but inversed)?

It is my understanding that Artificial Neural Networks work best on normalized data, ie typically inputs and outputs should have, ideally, a mean of 0 and a variance of 1 (and even, if possible, a "near gaussian", or at least, "well behaved", distribution).
Therefore, I have seen / written quite a few Keras-using scripts when I first do some feature-wise normalization of the predictors and labels. This is a pain, as this means the need to keep track of a number of mean and std values, applying them correctly later at inference, etc.
I found out recently that there is now out-of-the-box functionality for doing the predictors normalization in Keras in an "adaptable, not trainable" way, which is very convenient, as all the normalization information gets stored and used out-of-the-box in the network object: see: https://keras.io/guides/preprocessing_layers/ , https://keras.io/api/layers/preprocessing_layers/numerical/normalization/#normalization-class . This makes use / bookkeeping much simpler.
My question is: would it make sense / is there a simple way to similarly do in-Keras an "outputs de-normalization", i.e., assuming that the outputs from the network have mean 0 and variance 1, add an adaptable (adaptable not trainable; similar to the preprocessing normalization layer) layer that de-normalize these outputs into the correct mean and variance for each label?
I guess this is quite similar to the preprocessing normalization layer, except that what we would like is the "inverse transformation" of what would be obtained by applying the preprocessing normalization layer on the labels. I.e., when adapting the layer to labels, one gets a layer that "de-normalizes" a 0-mean 1-std distribution into a distribution with feature-wise mean and std corresponding to the labels.
I do not see some way to get this "inverse layer" or "de-normalization layer", am I missing something / is there a simple way to do it?
The normalization layer has an invert parameter:
If True, this layer will apply the inverse transformation to its
inputs: it would turn a normalized input back into its original form.
So, in theory you could use:
layer = tf.keras.layers.Normalization(invert=True)
to de-normalize. Currently, this is wrongly implemented and will not work (but seems like the bug is already fixed in the next keras version)

Is there a mean-variance normalization layer in PyTorch?

I am new to PyTorch and I would like to add a mean-variance normalization layer to my network that will normalize features to zero mean and unit standard deviation. I got a bit confused reading the documentation, could anyone give me some leads?
As #Ivan commented, the normalization can be done on many levels. However, as You say
normalize features to zero mean and unit standard deviation
I suppose You just want to input unbiased data to the network. If that's the case, You should treat it as data preprocessing step rather than a layer of Your model and basically do:
X = (X - torch.mean(X, dim=0))/torch.std(X, dim=0)
As an alternative, You can use torchvision.transforms:
preprocess = torchvision.transforms.Normalize(mean=torch.mean(X, dim=0), std=torch.std(X, dim=0))
X = preprocess(X)
as in this ResNet native example. Note how it is reasonably assumed that the future data would always have roughly the same mean and std_dev as the set that is used for their initial calculation (supposedly the training set). For this reason, we should preserve the initially calculated values and use them for preprocessing in any future inference scenario.

Use pytorch optimizer to fit a user defined function

I have read many tutorials on how to use PyTorch to make a regression over a data set, using, for instance, a model composed of several linear layers and the MSE loss.
Well, imagine that I know the function F depends on a variable x and some unknown parameters (p_j: j=0,..., P-1) with P relatively small, but the function is a composition of special function. So, my problem is the classical minimization knowing the data {x_i,y_i}_i<=N
Min_{ {p_j} } Sum_i (F(x_i;{p_j}) - y_i)^2
I would like to know if I can use the PyTorch optimizers and if yes how I can do it?
Thanks.
In fact, PyTorch experts answer that the function to minimized must be expressed in terms of torch.tensors to let the minimizers computing the gradients. So, it is not possible in my case.

Specify log-normal family in BAMBI model

I'm trying to fit a simple Bayesian regression model to some right-skewed data. Thought I'd try setting family to a log-normal distribution. I'm using pymc3 wrapper BAMBI. Is there a way to build a custom family with a log-normal distribution?
It depends on what you want the mean function of the model to look like.
If you want a model like
then Yes, this is easily achieved by simply log transforming Y and then estimating the usual linear model with Normal response. Notice that in this model Y is an exponential function of the predictor X, so when plotting Y against X (both untransformed), the regression line can curve up or down. It also has a multiplicative error term so that the variance is greater for larger predicted Y values. We can say that such a model has a log link function and a lognormal response.
But if you want a model like
then No, this kind of model is not currently supported by bambi*. This is a model with a lognormal response but an identity link function. The regression of Y on X is a straight line, but the errors have the same lognormal distribution at every point along X, so that the variance does not increase for larger predicted Y values. Note that this is an unusual model that I personally have never actually seen used.
* It's possible in theory to roll your own custom Families (although it would require some slight hacking), but the way this is designed in bambi ultimately depends on the families implemented in statsmodels.genmod, which does not currently include lognormal.
Unless I'm misunderstanding something, I think all you need to do is specify link='log' in the fit() call. If your assumption is correct, the exponentiated linear prediction will be normally distributed, and the default error distribution is gaussian, so I don't think you need to build a custom family for this—the default gaussian family with a log link should work fine. But feel free to clarify if this doesn't address your question.

Does this neural network model exist?

I'm looking for a neural network model with specific characteristics. This model may not exist...
I need a network which doesn't use "layers" as traditional artificial neural networks do. Instead, I want [what I believe to be] a more biological model.
This model will house a large cluster of interconnected neurons, like the image below. A few neurons (at bottom of diagram) will receive input signals, and a cascade effect will cause successive, connected neurons to possibly fire depending on signal strength and connection weight. This is nothing new, but, there are no explicit layers...just more and more distant, indirect connections.
As you can see, I also have the network divided into sections (circles). Each circle represents a semantic domain (a linguistics concept) which is the core information surrounding a concept; essentially a semantic domain is a concept.
Connections between nodes within a section have higher weights than connections between nodes of different sections. So the nodes for "car" are more connected to one another than nodes connecting "English" to "car". Thus, when a neuron in a single section fires (is activated), it is likely that the entire (or most of) the section will also be activated.
All in all, I need output patterns to be used as input for further output, and so on. A cascade effect is what I am after.
I hope this makes sense. Please ask for clarification where needed.
Are there any suitable models in existence that model what I've described, already?
Your neural network resembles a neural network which is created using Evolutionary Algorithms for example genetic algorithm.
See following articles for details.
Han - Evolutionary neural networks for anomaly detection based on the behavior of a program
WHITLEY - Genetic Algorithms and Neural Networks
For a summary in this type of neural network. Neurons and their connections are created using evolutionary techniques. Therefore they do not have strict layer approach. Hans uses following technique:
"Genetic Operations:
The crossover operator produces a new descendant by exchanging partial sections between two neural networks. It selects two distinct neural networks randomly and chooses one hidden node as the pivot point.Then, they exchange the connection links and the corresponding weight based on the selected pivot point.
The mutation operator changes a connection link and the corresponding weight of a randomly selected neural network. It performs one of two operations: addition of a new connection or deletion of an existing connection.
The mutation operator selects two nodes of a neural network randomly.
If there is no connection between them, it connects two nodes with random weights.
Otherwise, it removes the connection link and weight information.
"
Following figure from Whitley's article.
#ARTICLE{Han2005Evolutionary,
author = {Sang-Jun Han and Sung-Bae Cho},
title = {Evolutionary neural networks for anomaly detection based on the behavior
of a program},
journal = {Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions
on},
year = {2005},
volume = {36},
pages = {559 -570},
number = {3},
month = {june },
}
#article{whitley1995genetic,
title={Genetic algorithms and neural networks},
author={Whitley, D.},
journal={Genetic algorithms in engineering and computer science},
pages={203--216},
year={1995},
publisher={Citeseer}
}
All in all, I need output patterns to be used as input for further output, and so on. A cascade effect is what I am after.
That sounds like a feed-forward net with multiple hidden layers. Don't be scared of the word "layer" here, with multiple ones it would be just like you have drawn there.. something like a 5-5-7-6-7-6-6-5-6-5 -structured net (5 inputs, 8 hidden layers with varying amount of nodes in each and 5 outputs).
You can connect the nodes to each other any way you like from layer to another. You can leave some unconnected by simple using constant zero as a weight between them, or if object oriented programming is used, simply leave unwanted connections out of connection phase. Skipping the layers might be harder with a standard NN-model, but one way could be using a dummy node for each layer a weight needs to cross. Just copying the original output*weight -value from node to dummy would be same as skipping a layer and this would also keep the standard NN-model intact.
If you want the net just to output some 1's and 0's, a simple step-function can be used as an activation function in each node: 1 for values more than 0.5, 0 otherwise.
I'm not sure if this is want you want, but this way you should be able to build a net you described. However, I have no idea how are you planning to teach your net to produce some semantic domains. Why not just let the net learn its own weights? This can be achieved with simple input-output -examples and a backpropagation -algorithm. If you use standard model to do build your net, also the mathematics of the learning wouldn't be any different from any other feed-forward net. Last but not least, you can probably find a library that is suitable for this task with only minor or with no change at all to the code.
The answers involving genetic algorithms sound fine (especially the one citing Darrell Whitley's work).
Another alternative would be to simply randomly connect nodes? This is done, more or less, with recurrent neural networks.
You could also take a look at LeCun's highly successful convolutional neural networks for an example of an ANN with a lot of layers that is somewhat like what you've described here that was designed for a specific purpose.
your network also mimics this
http://nn.cs.utexas.edu/?fullmer:evolving
but doesn't really allow the network to learn, but be replaced.
which may be covered here
http://www.alanturing.net/turing_archive/pages/reference%20articles/connectionism/Turing%27s%20neural%20networks.html

Resources