Why DETR need to set a empty class?
It has set a "Background" class, which means non-object, why?
TL;DR
By default DETR always predict 100 bounding boxes. Empty class is used as a condition to filter out meaningless bounding boxes.
Full explanation
If you look at the source code, the transformer decoder transforms each query from self.query_embed.weight into the output hs:
hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]
Then a linear layer self.class_embed maps hs into the object class outputs_class. Another linear layer self.bbox_embed maps the same hs into bounding box outputs_coord:
outputs_class = self.class_embed(hs)
outputs_coord = self.bbox_embed(hs).sigmoid()
out = {'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}
The number of bounding boxes is set to num_queries (by default 100).
detr = DETR(backbone_with_pos_enc, transformer, num_classes=num_classes, num_queries=100)
As you can see now, without the empty class, DETR will always predict 100 bounding boxes (DETR will always try to bound this and that 100 times), even though when there is only one object in the image.
Now, let us consider the example below. There are only two meaningful objects (two birds). But DETR still predicts 100 bounding boxes. Thankfully 98 of the boxes corresponding to "empty class" are discarded (the green box and the blue box below, plus the remaining 96 boxes not shown in the pic below). Only the red box and yellow box having the output class of "bird" are meaningful and hence considered as the prediction.
That is how DETR makes dynamic object prediction. It can predict any number of objects less than or equal to num_queries, but not more than this. If you want DETR to predict more than 100 objects, say 500. Then you can set num_queries to 500 or above.
I think the cross-attention at the first decoder layer will update the class embeddings of the queries based on the learned positional embeddings.
The cross-attention weights used in DETR are computed as:
(query + query_pos) # (key + key_pos)^T
here query is the class embeddings of queries, at the first layer they were meaningless (initialized as all zeros), but the query_pos are learned representing the rough detection region of the queries. After the first layer, the class embeddings are updated mainly based on the similarity between query_pos and key_pos. Therefore, the class embeddings after the first layer are focusing mainly on the features around the position of the queries.
Related
Just for context, I am working on a nerf-like model for dynamic scenes that, for a given point in space, outputs how occupied the space is at this point in space for all the frames in the recorded scene. That means that, if the recorded scene is a set of 300-frames videos (from different viewpoints) then, doing inference on a given XYZ coordinate will return 300 scalar values
Now, here comes the problem. I need to get the normal vector of the surface at a given point in space. This is usually done by computing the gradient of the occupancy with respect to the coordinates.
This is simple when working with models that for each xyz coordinate produce only one scalar value (models for static scenes), but I can’t come up with a way of doing this for my current scenario.
What I would usually do is something like
normal_vectors=-1*torch.autograd.grad(occupancies.sum(), coordinates)
where occupancies is a tensor of size [batch_size,3] and coordinates is a tensor of size [batch_size,300]. It produces a tensor of the same size as coordinates. I would need it to produce a tensor of size [batch_size,300,3], as for each item in coordinates, 300 items are produced and I want the gradient of each one of those with respect to the 3 components of the given item.
The only approach I think may work (untested yet) is something like
normal_vectors=[]
for i in range(300):
normal_vectors.append(-1*torch.autograd.grad(occupancies[:,i].sum(),coordinates))
But this will be painfully slow.
Any alternative approach?
Most recent object detection methods rely on a convolutional neural network. They create a feature map by running input data through a feature extraction step. They then add more convolutional layers to output a set of values like so (this set is from YOLO, but other architectures like SSD differ slightly):
pobj: probability of being an object
c1, c2 ... cn: indicating which class the object belongs to
x, y, w, h: bounding box of the object
However, one particular box cannot be multiple objects. As in, wouldn't having a high value for, say, c1 mean that the values for all the others c2 ... cn would be low? So why use different values for c1, c2 ... cn? Couldn't they all be represented by a single value, say 0-1, where each object has a certain range within the 0-1, say 0-0.2 is c1, 0.2-0.4 is c2 and so on...
This would reduce the dimension of the output from NxNx(5+C) (5 for the probability and bounding box, +C one for each class) to NxNx(5+1) (5 same as before and 1 for the class)
Thank you
Short answer, NO! That is almost certainly not an acceptable solution. It sounds like your core question is: Why is a a single value in the range [0,1] not a sufficient, compact output for object classification? As a clarification, I'd say this doesn't really have to do with single-shot detectors; the outputs from 2-stage detectors and most all classification networks follows this same 1D embedding structure. As a secondary clarification, I'd say that many 1-stage networks also don't output pobj in their original implementations (YOLO is the main one that does but Retinanet and I believe SSD does not).
An object's class is a categorical attribute. Assumed within a standard classification problem is that the set of possible classes is flat (i.e. no class is a subclass of any other), mutually exclusive (each example falls into only a single class), and unrelated (not quite the right term here but essentially no class is any more or less related to any other class).
This assumed attribute structure is well represented by an orthonormal encoding vector of the same length as the set of possible attributes. A vector [1,0,0,0] is no more similar to [0,1,0,0] than it is to [0,0,0,1] in this space.
(As an aside, a separate branch of ML problems called multilabel classification removes the mutual exclusivity constrain (so [0,1,1,0] and [0,1,1,1] would both be valid label predictions. In this space class or label combinations COULD be construed as more or less related since they share constituent labels or "basis vectors" in the orthonormal categorical attribute space. But enough digression..)
A single, continuous variable output for class destroys the assumption that all classes are unrelated. In fact, it assumes that the relation between any two classes is exact and quantifiable! What an assumption! Consider attempting to arrange the classes of, let's say, the ImageNet classification task, along a single dimension. Bus and car should be close, no? Let's say 0.1 and 0.2, respectively in our 1D embedding range of [0,1]. Zebra must be far away from them, maybe 0.8. But should be close to zebra fish (0.82)? Is a striped shirt closer to a zebra or a bus? Is the moon more similar to a bicycle or a trumpet? And is a zebra really 5 times more similar to a zebra fish than a bus is to a car? The exercise is immediately, patently absurd. A 1D embedding space for object class is not sufficiently rich to capture the differences between object classes.
Why can't we just place object classes randomly in the continuous range [0,1]? In a theoretical sense nothing is stopping you, but the gradient of the network would become horrendously, unmanageably non-convex and conventional approaches to training the network would fail. Not to mention the network architecture would have to encode extremely non-linear activation functions to predict the extremely hard boundaries between neighboring classes in the 1D space, resulting in a very brittle and non-generalizable model.
From here, the nuanced reader might suggest that in fact, some classes ARE related to one another (i.e. the unrelated assumption of the standard classification problem is not really correct). Bus and car are certainly more related than bus and trumpet, no? Without devolving into a critique on the limited usefulness of strict ontological categorization of the world, I'll simply suggest that in many cases there is an information embedding that strikes a middle ground. A vast field of work has been devoted to finding embedding spaces that are compact (relative to the exhaustive enumeration of "everything is its own class of 1") but still meaningful. This is the work of principal component analysis and object appearance embedding in deep learning.
Depending on the particular problem, you may be able to take advantage of a more nuanced embedding space better suited towards the final task you hope to accomplish. But in general, canonical deep learning tasks such as classification / detection ignore this nuance in the hopes of designing solutions that are "pretty good" generalized over a large range of problem spaces.
For object classification head, usually cross-entropy loss function is used which operates on the probability distribution to compute the difference between ground-truth(a one hot encoded vector) and prediction class scores.
On the otherhand, you are proposing a different way of encoding the ground-truth class labels which can be further used with certain custom loss function say L1/l2 loss function, which looks theoretically correct but it might not be as good as cross-entropy function in terms of model convergence/optimization.
I assume that some labeled data are unlabeled, which labels are set as -1.
Using Label propagation in scikit-learning leads to assign label.
labelpropagation.fit(x_feature,y_class) with X_feature (Color, hog, gist, sift Feature).
Questions:
Is my understanding right?
Yes, In sci-kit, the unlabelled data is indicated using the label -1. This may involve:
using actual unlabelled-data (In case of semi-supervised learning)
Using test/dev set of the dataset as unlabelled data. (To verify model performaces)
In either of the cases, the data is appended as part of training-set by setting the labels as -1.
Thus, label propagation algorithm propagates the labels from the labelled nodes to these unlabelled nodes.
I'm trying to develop a fully-convolutional neural net to estimate the 2D locations of keypoints in images that contain renders of known 3D models. I've read plenty of literature on this subject (human pose estimation, model based estimation, graph networks for occluded objects with known structure) but no method I've seen thus far allows for estimating an arbitrary number of keypoints of different classes in an image. Every method I've seen is trained to output k heatmaps for k keypoint classes, with one keypoint per heatmap. In my case, I'd like to regress k heatmaps for k keypoint classes, with an arbitrary number of (non-overlapping) points per heatmap.
In this toy example, the network would output heatmaps around each visible location of an upper vertex for each shape. The cubes have 4 vertices on top, the extruded pentagons have 2, and the pyramids just have 1. Sometimes points are offscreen or occluded, and I don't wish to output heatmaps for occluded points.
The architecture is a 6-6 layer Unet (as in this paper https://arxiv.org/pdf/1804.09534.pdf). The ground truth heatmaps are normal distributions centered around each keypoint. When training the network with a batch size of 5 and l2 loss, the network learns to never make an estimate whatsoever, just outputting blank images. Datatypes are converted properly and normalized from 0 to 1 for input and 0 to 255 for output. I'm not sure how to solve this, are there any red flags with my general approach? I'll post code if there's no clear problem in general...
I've had an interest for neural networks for a while now and have just started following the deep learning tutorials. I have what I hope is a relatively straight forward question that I am hoping someone may answer.
In the multilayer perception tutorial, I am interested in seeing the state of the network at different layers (something similar to what is seen in this paper: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/247 ). For instance, I am able to write out the weights of the hidden layer using:
W_open = open('mlp_w_pickle.pkl','w')
cPickle.dump(classifier.hiddenLayer.W.get_value(borrow=True), W_open, -1)
When I plot this using the utils.py tile plotting, I get the following pretty plot [edit: pretty plot rmoved as I dont have enough rep].
If I wanted to plot the weights at the logRegressionLayer, such that
cPickle.dump(classifier.logRegressionLayer.W.get_value(borrow=True), W_open, -1)
what would I actually have to do? The above doesn't seem to work - it returns a 2darray of shape (500,10). I understand that the 500 relates to the number of hidden units. The paragraph on the Miscellaneous page:
Plotting the weights is a bit more tricky. We have n_hidden hidden
units, each of them corresponding to a column of the weight matrix. A
column has the same shape as the visible, where the weight
corresponding to the connection with visible unit j is at position j.
Therefore, if we reshape every such column, using numpy.reshape, we
get a filter image that tells us how this hidden unit is influenced by
the input image.
confuses me alittle. I am unsure exactly how I would string it together.
Thanks to all - sorry if the question is confusing!
You could plot them just the like the weights in the first layer but they will not necessarily make much sense.
Consider the weights in the first layer of a neural network. If the inputs have size 784 (e.g. MNIST images) and there are 2000 hidden units in the first layer then the first layer weights are a matrix of size 784x2000 (or maybe the transpose depending on how it's implemented). Those weights can be plotted as either 784 patches of size 2000 or, more usually, 2000 patches of size 784. In this latter case each patch can be plotted as a 28x28 image which directly ties back to the original inputs and thus is interpretable.
For you higher level regression layer, you could plot 10 tiles, each of size 500 (e.g. patches of size 22x23 with some padding to make it rectangular), or 500 patches of size 10. Either might illustrate some patterns that are being found but it may be difficult to tie those patterns back to the original inputs.