Comparing indices for factor analysis - statistics

I am comparing different models in Confirmatory Factor Analysis (CFA) to decide what my optimal number of factors and factor structure should be. The main indices I have been using are chi square , Comparative Fit Index (CFI) , RMSEA and Incremental Fit index (IFI. I am aware that there is cut off values available for these indices. However, I am unsure if there is any way of saying statistically if one of the models is better than the other.

Related

Determine the optimal number of biclusters

I have recently performed K-means biclustering on a matrix of absolute correlation coefficient values. However, the biclustering algorithm requires the number of biclusters (k) to be defined as an input. Is there any good method to determine the optimal number of biclusters(k)?
I know from before that many use a silhouette score to estimate the optimal number of clusters but I have only heard that people have used it when performing hierachical clustering. Can the silhouette score also be applied to biclusters as well? Is there any other method to define an optimal number of biclusters? Could a mean squared residue score be used for this?
The biclustering algorithm generated biclusters along the diagonal such that a row or column will never belong to more than one bicluster.

Desired distribution of weights in word embedding vectors

I am training my own embedding vectors as I'm focused on an academic dataset (WOS); whether the vectors are generated via word2vec or fasttext doesn't particularly matter. Say my vectors are 150 dimensions each. I'm wondering what the desired distribution of weights within a vector ought to be, if you averaged across an entire corpus's vectors?
I did a few experiments while looking at the distributions of a sample of my vectors and came to these conclusions (uncertain as to how absolutely they hold):
If one trains their model with too few epochs then the vectors don't change significantly from their initiated values (easy to see if you start you vectors as weight 0 in every category). Thus if my weight distribution is centered around some point (typically 0) then I've under-trained my corpus.
If one trains their model with too few documents/over-trains then the vectors show significant correlation with each other (I typically visualize a random set of vectors and you can see stripes where all the vectors have weights that are either positive or negative).
What I imagine is a single "good" vector has various weights across the entire range of -1 to 1. For any single vector it may have significantly more dimensions near -1 or 1. However, the weight distribution of an entire corpus would balance out vectors that randomly have more values towards one end of the spectrum or another, so that the weight distribution of the entire corpus is approximately evenly distributed across the entire corpus. Is this intuition correct?
I'm unfamiliar with any research or folk wisdom about the desirable "weights of the vectors" (by which I assume you mean the individual dimensions).
In general, since the individual dimensions aren't strongly interpretable, I'm not sure you could say much about how any one dimension's values should be distributed. And remember, our intuitions from low-dimensional spaces (2d, 3d, 4d) often don't hold up in high-dimensional spaces.
I've seen two interesting, possibly relevant observations in research:
some have observed that the raw trained vectors for words with singular meanings tend to have a larger magnitude, and those with many meanings have smaller magnitudes. A plausible explanation for this would be that word-vectors for polysemous word-tokenss are being pulled in different directions for the multiple contrasting meanings, and thus wind up "somewhere in the middle" (closer to the origin, and thus of lower magnitude). Note, though, that most word-vector-to-word-vector comparisons ignore the magnitudes, by using cosine-similarity to only compare angles (or largely equivalently, by normalizing all vectors to unit length before comparisons).
A paper "All-but-the-Top: Simple and Effective Postprocessing for Word Representations" by Mu, Bhat, & Viswanath https://arxiv.org/abs/1702.01417v2 has noted that the average of all word-vectors that were trained together tends to biased in a certain direction from the origin, but that removing that bias (and other commonalities in the vectors) can result in improved vectors for many tasks. In my own personal experiments, I've observed that the magnitude of that bias-from-origin seems correlated with the number of negative samples chosen - and that choosing the extreme (and uncommon) value of just 1 negative sample makes such a bias negligible (but might not be best for overall quality or efficiency/speed of training).
So there may be useful heuristics about vector quality from looking at the relative distributions of vectors, but I'm not sure any would be sensitive to individual dimensions (except insofar as those happen to be the projections of vectors onto a certain axis).

Find top K cosine similar vectors to a given vector efficiently

The problem:
Suppose I have a group of around 1,000,000 short documents D (no more than 50 words each), and I want to let users to supply a document from the same group D, and and get the top K similar documents from D.
My approach:
My first approach was to preprocess the group D by applying simple tf-idf, and after I have vector for each document, which is extremely sparse, to use a simple nearest neighbours algorithm based on cosine similarity.
Then, on query time, to justuse my static nearest neighbours table which its size is 1,000,000 x K, without any further calculations.
After applying tf-idf, I got vectors in size ~200,000, which means now I have a very sparse table (that can be stored efficiently in memory using sparse vectors) in size 1,000,000 x 200,000.
However, calculating the nearest neighbours model took me more than one day, and still haven't finished.
I tried to lower the vectors dimension by applying HashingTF, that utilizes the hasing trick, instead, so I can set the dimension to a constant one (in my case, i used 2^13 for uninfied hashing), but still I get the same bad performance.
Some technical information:
I use Spark 2.0 for the tf-idf calculation, and sklearn NearestNeighbours on the collected data.
Is thier any more efficient way to achieve that goal?
Thanks in advance.
Edit:
I had an idea to try a LSH based approximation similarity algorithm like those implemented in spark as described here, but could not find one that supports the 'cosine' similarity metric.
There were some requirements for the algorithm on the relation between training instances and the dimensions of your vectors , but you can try DIMSUM.
You can find the paper here.

how to calculate distance between any two elements in more than 10^8 data to Clustering them using spark?

I have more than 10^8 records stored in elasticSearch. Now I want to clustering them by writing a hierarchical algorithm or using PIC based on spark MLlib.
However, I can't use some efficient algorithm like K-means because every record is stored in the form of
{mainID:[subId1,subId2,subId3,...]}
which obviously is not in euclidean space.
I need to calculate the distance of every pair of records which will take a very LONG time I guess (10^8 * 10^8). I know the cartesian product in spark to do such computing , but there will appear the duplicated ones like (mainID1,mainID2) and (mainID2,mainID1), which is not suitable to PIC.
Does anyone know a better way to cluster these records? Or any method to delete the duplicated ones in the result RDD of cartesian product?
Thanks A lot!
First of all, don't take the full Cartesian product:
select where a.MainID > b.MainID
This doesn't reduce the complexity, but it does save about 2x in generation time.
That said, consider your data "shape" and select the clustering algorithm accordingly. K-means, HC, and PIC have three different applications. You know K-means already, I'm sure.
PIC basically finds gaps in the distribution of distances. It's great for well-defined sets (clear boundaries), even when those curl around each other or nest. However, if you have a tendril of connecting points (like a dumbbell with a long, thin bar), PIC will not separate the obvious clusters.
HC is great for such sets, and is a good algorithm in general. Most HC algorithms have an "understanding" of density, and tend to give clusterings that fit human cognition's interpretation. However, HC tends to be slow.
I strongly suggest that you consider a "seeded" algorithm: pick a random subset of your points, perhaps
sqrt(size) * dim
points, where size is the quantity of points (10^8) and dim is the number of dimensions. For instance, your example has 5 dimensions, so take 5*10^4 randomly selected points. Run the first iterations on those alone, which will identify centroids (K-means), eigenvectors (PIC), or initial hierarchy (HC). With those "seeded" values, you can now characterize each of the candidate clusters with 2-3 parameters. Classifying the remaining 10^8 - 5*10^4 points against 3 parameters is a lot faster, being O(size) time instead of O(size^2).
Does that get you moving toward something useful?

Using trainImplicit for a Recommendation system

Lets say I have a database with users buying products(There are no ratings or something similar) and I want to recommend others products for them. I am using ATL.trainImplicit where the training data has the following format:
[Rating(user=2, product=23053, rating=1.0),
Rating(user=2, product=2078, rating=1.0),
Rating(user=3, product=23, rating=1.0)]
So all the ratings in the training dataset is always 1.
Is it normal that the predictions ratings gave min value -0.6 and max rating 1.85? I would expect something between 0 and 1.
Yes, it is normal. The implicit version of ALS essentially tries to reconstruct a binary preference matrix P (rather than a matrix of explicit ratings, R). In this case, the "ratings" are treated as confidence levels - higher ratings equals higher confidence that the binary preference p(ij) should be reconstructed as 1 instead of 0.
However, ALS essentially solves a (weighted) least squares regression problem to find the user and item factor matrices that reconstruct matrix P. So the predicted values are not guaranteed to be in the range [0, 1] (though in practice they are usually close to that range). It's enough to interpret the predictions as "opaque" values where higher values equate to greater likelihood that the user might purchase that product. That's enough for sorting recommended products by predicted score.
(Note item-item or user-user similarities are typically computed using cosine similarity between the factor vectors, so these scores will lie in [-1, 1]. That computation is not directly available in Spark but can be done yourself).

Resources