Why a CNN learns different feature maps

I understand (and please correct me if my understanding is wrong) that the primary purpose of a CNN is to reduce the number of parameters from what you would need if you were to use a fully connected NN. And CNN achieves this by extracting "features" of images.
CNN can do this because in a natural image, there are small features such as lines and elementary curves that may occur in an "invariant" fashion, and constitute the image much like elementary building blocks.
My question is: when we create layers of feature maps, say, 5 of them, and we get these by using the sliding window of a size, say, 5x5 on an image that has pixels of, say, 100x100, Initially, these feature maps are initialized as random number weight matrices, and must progressively adjust the weights with gradient descent right? But then, if we are getting these feature maps by using the exactly same sized windows, sliding in exactly the same ways (sharing the same starting point and the same stride value), on the exactly same image, how can these maps learn different features of the image? Won't they all come out the same, say, a line or a curve?
Is it due to the different initial values of the weight matrices? (I.e. some weight matrices are more receptive to learning a certain particular feature than others?)
why do object detection methods have an output value for every class

Most recent object detection methods rely on a convolutional neural network. They create a feature map by running input data through a feature extraction step. They then add more convolutional layers to output a set of values like so (this set is from YOLO, but other architectures like SSD differ slightly):
pobj: probability of being an object
c1, c2 ... cn: indicating which class the object belongs to
x, y, w, h: bounding box of the object
However, one particular box cannot be multiple objects. As in, wouldn't having a high value for, say, c1 mean that the values for all the others c2 ... cn would be low? So why use different values for c1, c2 ... cn? Couldn't they all be represented by a single value, say 0-1, where each object has a certain range within the 0-1, say 0-0.2 is c1, 0.2-0.4 is c2 and so on...
This would reduce the dimension of the output from NxNx(5+C) (5 for the probability and bounding box, +C one for each class) to NxNx(5+1) (5 same as before and 1 for the class)
Short answer, NO! That is almost certainly not an acceptable solution. It sounds like your core question is: Why is a a single value in the range [0,1] not a sufficient, compact output for object classification? As a clarification, I'd say this doesn't really have to do with single-shot detectors; the outputs from 2-stage detectors and most all classification networks follows this same 1D embedding structure. As a secondary clarification, I'd say that many 1-stage networks also don't output pobj in their original implementations (YOLO is the main one that does but Retinanet and I believe SSD does not).
An object's class is a categorical attribute. Assumed within a standard classification problem is that the set of possible classes is flat (i.e. no class is a subclass of any other), mutually exclusive (each example falls into only a single class), and unrelated (not quite the right term here but essentially no class is any more or less related to any other class).
This assumed attribute structure is well represented by an orthonormal encoding vector of the same length as the set of possible attributes. A vector [1,0,0,0] is no more similar to [0,1,0,0] than it is to [0,0,0,1] in this space.
(As an aside, a separate branch of ML problems called multilabel classification removes the mutual exclusivity constrain (so [0,1,1,0] and [0,1,1,1] would both be valid label predictions. In this space class or label combinations COULD be construed as more or less related since they share constituent labels or "basis vectors" in the orthonormal categorical attribute space. But enough digression..)
A single, continuous variable output for class destroys the assumption that all classes are unrelated. In fact, it assumes that the relation between any two classes is exact and quantifiable! What an assumption! Consider attempting to arrange the classes of, let's say, the ImageNet classification task, along a single dimension. Bus and car should be close, no? Let's say 0.1 and 0.2, respectively in our 1D embedding range of [0,1]. Zebra must be far away from them, maybe 0.8. But should be close to zebra fish (0.82)? Is a striped shirt closer to a zebra or a bus? Is the moon more similar to a bicycle or a trumpet? And is a zebra really 5 times more similar to a zebra fish than a bus is to a car? The exercise is immediately, patently absurd. A 1D embedding space for object class is not sufficiently rich to capture the differences between object classes.
Why can't we just place object classes randomly in the continuous range [0,1]? In a theoretical sense nothing is stopping you, but the gradient of the network would become horrendously, unmanageably non-convex and conventional approaches to training the network would fail. Not to mention the network architecture would have to encode extremely non-linear activation functions to predict the extremely hard boundaries between neighboring classes in the 1D space, resulting in a very brittle and non-generalizable model.
From here, the nuanced reader might suggest that in fact, some classes ARE related to one another (i.e. the unrelated assumption of the standard classification problem is not really correct). Bus and car are certainly more related than bus and trumpet, no? Without devolving into a critique on the limited usefulness of strict ontological categorization of the world, I'll simply suggest that in many cases there is an information embedding that strikes a middle ground. A vast field of work has been devoted to finding embedding spaces that are compact (relative to the exhaustive enumeration of "everything is its own class of 1") but still meaningful. This is the work of principal component analysis and object appearance embedding in deep learning.
Depending on the particular problem, you may be able to take advantage of a more nuanced embedding space better suited towards the final task you hope to accomplish. But in general, canonical deep learning tasks such as classification / detection ignore this nuance in the hopes of designing solutions that are "pretty good" generalized over a large range of problem spaces.
For object classification head, usually cross-entropy loss function is used which operates on the probability distribution to compute the difference between ground-truth(a one hot encoded vector) and prediction class scores.
On the otherhand, you are proposing a different way of encoding the ground-truth class labels which can be further used with certain custom loss function say L1/l2 loss function, which looks theoretically correct but it might not be as good as cross-entropy function in terms of model convergence/optimization.

How can I build a good approximation of an unknown distribution when only having samples from it in order to draw from it in torch?

Say I just have random samples from the Distribution and no other data - e.g. a list of numbers - [1,15,30,4,etc.]. What's the best way to estimate the distribution to draw more samples from it in pytorch?
I am currently assuming that all samples come from a Normal distribution and just using the mean and std of the samples to build it and draw from it. The function, however, can be of any distribution.
samples = torch.Tensor([1,2,3,4,3,2,2,1])
Normal(samples.mean(), samples.std()).sample()
If you have enough samples (and preferably sample dimension is higher than 1), you could model the distribution using Variational Autoencoder or Generative Adversarial Networks (though I would stick with the first approach as it's simpler).
Basically, after correct implementation and training you would get deterministic decoder able to decode hidden code you would pass it (say vector of size 10 taken from normal distribution) into a value from your target distribution.
Note it might not be reliable at all though, it would be even harder if your samples are 1D only.
The best way depends on what you want to achieve. If you don't know the underlying distribution, you need to make assumptions about it and then fit a suitable distribution (that you know how to sample) to your samples. You may start with something simple like a Mixture of Gaussians (several normal distributions with different weightings).
Another way is to define a discrete distribution over the values you have. You will give each value the same probability, say p(x)=1/N. When you sample from it, you simply draw a random integer from [0,N) that points to one of your samples.

Gaussian Mixture Models for pixel clustering

I have a small set of aerial images where different terrains visible in the image have been have been labelled by human experts. For example, an image may contain vegetation, river, rocky mountains, farmland etc. Each image may have one or more of these labelled regions. Using this small labeled dataset, I would like to fit a gaussian mixture model for each of the known terrain types. After this is complete, I would have N number of GMMs for each N types of terrains that I might encounter in an image.
Now, given a new image, I would like to determine for each pixel, which terrain it belongs to by assigning the pixel to the most probable GMM.
Is this the correct line of thought ? And if yes, how can I go about clustering an image using GMMs
Its not clustering if you use labeled training data!
You can, however, use the labeling function of GMM clustering easily.
For this, compute the prior probabilities, mean and covariance matrixes, invert them. Then classify each pixel of the new image by the maximum probability density (weighted by prior probabilities) using the multivariate Gaussians from the training data.
Intuitively, your thought process is correct. If you already have the labels that makes this a lot easier.
For example, let's pick on a very well known and non-parametric algorithm like Known Nearest Neighbors https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
In this algorithm, you would take your new "pixels" which would then find the closest k-pixels like the one you are currently evaluating; where closest is determined by some distance function (usually Euclidean). From there, you would then assign this new pixel to the most frequently occurring classification label.
I am not sure if you are looking for a specific algorithm recommendation, but KNN would be a very good algorithm to begin testing this type of exercise out on. I saw you tagged sklearn, scikit learn has a very good KNN implementation I suggest you read up on.

Plotting Hidden Weights

I've had an interest for neural networks for a while now and have just started following the deep learning tutorials. I have what I hope is a relatively straight forward question that I am hoping someone may answer.
In the multilayer perception tutorial, I am interested in seeing the state of the network at different layers (something similar to what is seen in this paper: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/247 ). For instance, I am able to write out the weights of the hidden layer using:
W_open = open('mlp_w_pickle.pkl','w')
cPickle.dump(classifier.hiddenLayer.W.get_value(borrow=True), W_open, -1)
When I plot this using the utils.py tile plotting, I get the following pretty plot [edit: pretty plot rmoved as I dont have enough rep].
If I wanted to plot the weights at the logRegressionLayer, such that
cPickle.dump(classifier.logRegressionLayer.W.get_value(borrow=True), W_open, -1)
what would I actually have to do? The above doesn't seem to work - it returns a 2darray of shape (500,10). I understand that the 500 relates to the number of hidden units. The paragraph on the Miscellaneous page:
Plotting the weights is a bit more tricky. We have n_hidden hidden
units, each of them corresponding to a column of the weight matrix. A
column has the same shape as the visible, where the weight
corresponding to the connection with visible unit j is at position j.
Therefore, if we reshape every such column, using numpy.reshape, we
get a filter image that tells us how this hidden unit is influenced by
the input image.
confuses me alittle. I am unsure exactly how I would string it together.
You could plot them just the like the weights in the first layer but they will not necessarily make much sense.
Consider the weights in the first layer of a neural network. If the inputs have size 784 (e.g. MNIST images) and there are 2000 hidden units in the first layer then the first layer weights are a matrix of size 784x2000 (or maybe the transpose depending on how it's implemented). Those weights can be plotted as either 784 patches of size 2000 or, more usually, 2000 patches of size 784. In this latter case each patch can be plotted as a 28x28 image which directly ties back to the original inputs and thus is interpretable.
For you higher level regression layer, you could plot 10 tiles, each of size 500 (e.g. patches of size 22x23 with some padding to make it rectangular), or 500 patches of size 10. Either might illustrate some patterns that are being found but it may be difficult to tie those patterns back to the original inputs.

Canny algorithm is enough for creating a feature descriptor image and giving for SVM?

i retrieve contours from images by using canny algorithm. it's enough to have a descriptor image and put in SVM and find similarities? Or i need necessarily other features like elongation, perimeter, area ?
I talk about this, because inspired by this example: http://scikit-learn.org/dev/auto_examples/plot_digits_classification.html i give my image in greyscale first, in canny algorithm style second and in both cases my confusion matrix was plenty of 0 like precision, recall, f1-score, support measure
My advice is:
unless you have a low number of images in your database and/or the recognition is going to be really specific (not a random thing for example) I would highly recommend you to apply one or more features extractors such SIFT, Fourier Descriptors, Haralick's Features, Hough Transform to extract more details which could be summarised in a short vector.
Then you could apply SVM after all this in order to get more accuracy.
