Bayesian Network Key Benefit - statistics

I have some trouble understanding the benefits of Bayesian networks 100%.
Am I correct that the key benefit of the network is, that one does not need to use
chain rule of probability in order to calculate joint distributions?
So using the chain rule:
Leads to the same result as the following (assuming the nodes are structured by an Bayesian network)?

The benefit of using the Bayesian network is exactly so that we can use the chain rule. This network can be thought of as representing a huge lookup table that tells you the probability of all possible joint events that the network represents. It is because some events are conditionally independent of other events that we don't need to store this huge lookup table but can distribute it to the node level on the network.
If you consider the nodes of a Bayesian network to be stored as a probability lookup table (i.e., storing the probability of observing this event, represented by the node, given the possible values for its parent nodes), this table is fairly small in comparison to the size of the network as a whole. The entire network then just consists of these small tables that are linked by the parent-child relationships. When you perform a calculation to obtain a joint probability (i.e., P(A_1 ... A_n) from above) you can efficiently iterate (using the chain rule) to calculate the probability of seeing the observation given the structure of the network.
Note that it is the structure of the network that provides this saving. In your example, the "parents(A_1)" clause is just a subset of the entire set of nodes. The structure implicitly tells us that A_1 is conditionally independent of the other nodes in the network, given the values of its parents. So we only apply the chain rule to a small set of nodes that can effect the node in question.
This small amount of computation is generally just a fraction of the huge space saving that you obtain by using this structure.

Related

Are the labels-output of cluster algorithms ordered in a certain order? (python, scikit learn)

I'm using Shift-means clustering (https://scikit-learn.org/stable/modules/clustering.html#mean-shift) in which the labels of clusters are obtained from this source: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html
However,it's not clear how the labels of clusters (0,1,...) are generated. Appearly, it seems that label 0 is the cluster with more elements. It this a general rule?
How the others algorithms works? it's in a "random" sense? or the algorithms behind detecte the greater clusters for the 0 cluster?
Thanks!
PS: it's easy order the labels according this rule, my question is more theoretical.
In many cases, the cluster order depends on the initialization. If you provide the initial values, then this order will be preserved.
If you do not provide such initial values, the order will usually be based on the data order. The first item is likely to belong to the first cluster, for example (withholding noise in some algorithms, such as DBSCAN).
Now quantity (cluster size) has an interesting effect: assuming that your data is randomly ordered (and not, for example, ordered by some synthetic data generation process) then the first element is more likely to belong to the "largest" cluster, so this cluster is most likely to come first even with "random" order.
Now in sklearn's mean-shift (which in my opinion contains an error in the final assignment rule) the authors decided to sort by "intensity" apparently, but I don't remember any such rule in the original papers. https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/cluster/mean_shift_.py#L222

How to correct multiple gaussian variables, given the stability analysis of their structure?

We have multiple Gaussian variables, which could be the locations of 2-d points. Suppose the 2-d points are measured independently.
If we connect the adjacent points, then we will get a structure (graph). Suppose we have a model to compute whether the structure is stable or not.
How can we use the information about stable structure to correct the given measurements? We can anticipate that some positions of the 2-d points may cause an unstable structure. Hence, we can prune these positions and get a better estimate.

Clustering of facebook-users with k-means

i got a facebook-list of user-ids from following page:
Stanford Facebook-Data
If you look at the facebook_combined data, you can see that it is a list of user-connections (edges). So for instance user 0 has something to do with user 1,2,3 and so on.
Now my work is to find clusters in the dataset.
In the first step i used node.js to read the file and save the data in an array like this:
array=[[0,1],[0,2], ...]
In the second step i used a k-means plugin for node.js to cluster the data:
Cluster-Plugin
But i dont know if the result is right, because now i get clusters of edges and not clusters of users.
UPDATE:
I am trying out a markov implementation for node js. The Markov Plugin however needs an adjacency matrix to build clusters. I implemented an algorithm with java to save the matrix in a file.
Maybe you got any other suggestion how i could get clusters out of edges.
K-means assumes your input data issue an R^d vector space.
In fact, it requires the data to be this way, because it computes means as cluster centers, hence the name k-means.
So if you want to use k-means, then you need
One row per datapoint (not an edge list)
A fixed dimensionality data space where the mean is a useful center (usually, you should have continuous attributes, on binary data the mean does not make too much sense) and where least-squares is a meaningful optimization criterion (again, on binary data, least-squares does not have a strong theoretical support)
On your Faceboook data, you could try some embedding, but I'd have doubts about the trustworthiness.

Spark clustering - one RDD multiple clusters, optimal cluster size, controlling set sizes

What's the best way to achieve what follows:
(1)
My input data consists of three columns: Object, Category, Value. I need to cluster Objects based on Value but the clusters need to be Category specific i.e. I need a cluster for every Category. It's impractical to split a file and load category specific data individually.
Initially I thought it was simple (I was already able to cluster Objects for one specific Category) and loaded data into a pair RDD where the key was Category value. However, KMeans train method accepts RDD and I got stuck on trying to make RDD of the value for each key of original RDD.
(2)
Is there a method of clustering that returns optimal number of sets in the cluster except for starting with low K and iterating training while K increases until Within Set Sum of Squared Error stabilizes?
(3)
Is there a method of clustering where the size of cluster sets could be controlled (the goal being producing more balanced sizes of sets)?
Thank you.
Why is it impractical to split your data set?
this will not take longer than a single k-means iteration (1 pass over the data set)
it will untangle the multiple problems you have, so some subsets can converge earlier, thus speed zp the overall process.
Note that k-means is best on multivariate data. On 1-dimensional data it is much more efficient to sort the data and then do kernel density estimation (or even simply histograms and have the user intuitively decide). Then you can easily do all thes "extras" such as ensuring a minimum cluster size etc.

Distance dependent Chinese Restaurant Process maybe

I'm new to machine learning and want to implement the distance dependent Chinese Restaurant process in MATLAB for the clustering of audio tracks.
I'm looking to use the dd-CRP on 26 features. I'm guessing the process might go like this
Read in 1st feature vector and assign it a "table"
Read in 2nd feature vector and compare it to the 1st "table", maybe using the cosine angle(due to high dimension) of the two vectors and if it agrees within some defined theta, join that table, else start a new one.
Read in next feature and repeat step 2 for the new feature vector for each existing table.
While this is occurring, I will be keeping track of how many tables there are.
I will be running the algorithm over say for example 16 audio tracks. The way the audio will be fed into the algorithm is the first feature vector will be from say the first frame from audio track 1, the second feature vector from form the first frame in track 2 etc. as I'm trying to find out which audio tracks like to cluster together most, but I don't want to define how many centroids there are. Obviously I'll have to keep track of which audio track is at which "table".
Does this make sense?
This is not a Chinese Restaurant Process. This is a heuristic algorithm which has some similarity to a Chinese Restaurant Process. In a CRP everything is phrased in terms of priors over the assignments of items to clusters (the tables analogy), and these are combined with a likelihood function for each cluster (which formalises the similarity function you described). Inference is then done by Gibbs Sampling, which means non-deterministically sampling which cluster each track is assigned to in turn given all the other assignments. Variational methods for non-parametrics are still in a very preliminary state.
Why do you want to use a CRP? Do you think you'll get something out of it beyond more conventional clustering methods? The bar to entry for the implementation and proper understanding of non-parametrics is pretty high, and they're often of little practical use at the moment because of the constraints on inference I mentioned.
You can use the X-means algorithm, which automatically determines the optimal number of centroids (and hence number of clusters) based on the Bayesian Information Criterion (or BIC). In short, the algorithm looks for how dense each cluster is, and how far is each cluster from the other.

Resources