Patch Merging and Pooling layer difference - conv-neural-network

I want to ask what the difference between Patch Merging in Swin Transformer and Pooling layer (e.x. Max Pooling) in CNNs. Why do they use Patch Merging instead of Pooling layer.
I understand that Patch Merging will reduce the spatial dimension in half and increase the channel dimension, so there will no information loss when using Patch Merging, while Pooling layer will cause a loss of information of input future

Related

Does pooling layers skew images?

Let's say we are just testing out simple single convolution layer with any kind of pooling. Now once we do that operation does that skew image itself?

Where is the suitable location to use Upsampling llayer in CNN?

I work on multitask learning convolutional neural netowrk. It performs the semantic segmentationa and pixel-wise depth estimation task. After decoder output, feature maps resolution is (240,240) which i want to upsample to achieve input resolution of (480,480). Can anyone help to understand that which of the following locations of upsampling layer could results into better performance ? or does appropriate location of upsampling layer has any significant impact on result ?if yes then could you please elaborate it ?
Apply Umsampling before final output layer and use strided or padded convolution layer to mantain resolution as input image (480,480)?
Apply upsampling after final output layer
I have trained network, in which, I have used upsampling layer before the final prediction and then used padding in convolution layer to maintain desired resolution. I read that if we use upsampling before final prediction, then it enables network to learn a more spatial information and make finer predictions since it deals with higher resolution feature maps. But it will increase computation burden.

pytorch batch normalization in distributed train

wondering how distributed pytorch handle batch norm, when I add a batch norm layer, will pytorch engine use the same allreduce call to sync the data cross node? or the batch norm only happen on local node.
Similarly to DataParallel (check the first Warning box). It will compute the norm separately for each node (or, more precisely, each GPU). It will not sync the rolling estimates of the norm either, but it will keep the values from one of the GPUs in the end. So assuming the examples are distributed across your cluster randomly, your BatchNorm will work roughly as expected, except its estimates of the normalization factors will have higher variance due to smaller effective sample sizes.

Pruning in Keras

I'm trying to design a neural network using Keras with priority on prediction performance, and I cannot get sufficiently high accuracy by further reducing the number of layers and nodes per layer. I have noticed that very large portion of my weights are effectively zero (>95%). Is there a way to prune dense layers in hope of reducing prediction time?
Not a dedicated way :(
There's currently no easy (dedicated) way of doing this with Keras.
A discussion is ongoing at https://groups.google.com/forum/#!topic/keras-users/oEecCWayJrM.
You may also be interested in this paper: https://arxiv.org/pdf/1608.04493v1.pdf.
Take a look at Keras Surgeon:
https://github.com/BenWhetton/keras-surgeon
I have not tried it myself, but the documentation claims that it has functions to remove or insert nodes.
Also, after looking at some papers on pruning, it seems that many researchers create a new model with less channels (or less layers), and then copy the weights from the original model to the new model.
See this dedicated tooling for tf.keras. https://www.tensorflow.org/model_optimization/guide/pruning
As the overview suggests, support for latency improvements is a work in progress
Edit: Keras -> tf.keras based on LucG's suggestion.
If you set an individual weight to zero won't that prevent it from being updated during back propagation? Shouldn't thatv weight remain zero from one epoch to the next? That's why you set the initial weights to nonzero values before training. If you want to "remove" an entire node, just set all of the weights on that node's output to zero and that will prevent that nodes from having any affect on the output throughout training.

Temporal convolution for NLP

I'm trying to follow Kalchbrenner et al. 2014 (http://nal.co/papers/Kalchbrenner_DCNN_ACL14) (and basically most of the papers in the last 2 years which applied CNNs to NLP tasks) and implement the CNN model they describe. Unfortunately, although getting the forward pass right, it seems like I have a problem with the gradients.
What I'm doing is a full convolution of the input with W per row, per kernel, per input in the forward pass (not rotated, so it's actually a correlation).
Then, for the gradients wrt W, a valid convolution of the inputs with the previous delta per row, per kernel, per input (again, not rotated).
And finally, for the gradients wrt x, another valid convolution of the pervious delta with W, again, per row, per kernel, per input (no rotation).
This returns the correct size and dimensionality but the gradient checking is really off when connecting layers. When testing a single conv layer the results are correct, when connecting 2 conv layers - also correct, but then, when adding MLP, Pooling, etc. it starts looking bad. All other types of layers were also tested separately and they are also correct, thus, I'd assume the problem starts with the calculation of the grad. wrt W_conv.
Does anyone have an idea or a useful link to a similar implementation?

Resources