Why scikit-learn truncatedSVD uses 'randomized' algorithm as default? - scikit-learn

I used with truncatedSVD with 30000 by 40000 size of term-document matrix to reducing the dimension to 3000 dimension,
when using 'randomized', variance ratio is about 0.5 (n_iter=10)
when using 'arpack', variance ratio is about 0.9
Variance ratio of 'randomized' algorithm is lower than one of 'arpack'.
So why scikit-learn truncatedSVD uses 'randomized' algorithm as default?

Speed!
According to the docs, sklearn.decomposition.TruncatedSVD can use a randomized algorithm due to Halko, Martinson, and Tropp (2009). This paper claims that their algorithm is considerably faster.
For a dense matrix, it runs in O(m*n*log(k)) time, whereas the classical algorithm takes O(m*n*k) time, where m and n are the dimensions of the matrix from which you want the kth largest components. The randomized algorithm is also easier to efficiently parallelize and makes fewer passes over the data.
Table 7.1 of the paper (on page 45) shows the performance of a few algorithms as a function of matrix size and # of components, and the randomized algorithm is often an order of magnitude faster.
The accuracy of the output is also claimed to be pretty good (Figure 7.5), though there are some modifications and constants that might affect it and I haven't gone through the sklearn code to see what they did/did not do.

Related

Desired distribution of weights in word embedding vectors

I am training my own embedding vectors as I'm focused on an academic dataset (WOS); whether the vectors are generated via word2vec or fasttext doesn't particularly matter. Say my vectors are 150 dimensions each. I'm wondering what the desired distribution of weights within a vector ought to be, if you averaged across an entire corpus's vectors?
I did a few experiments while looking at the distributions of a sample of my vectors and came to these conclusions (uncertain as to how absolutely they hold):
If one trains their model with too few epochs then the vectors don't change significantly from their initiated values (easy to see if you start you vectors as weight 0 in every category). Thus if my weight distribution is centered around some point (typically 0) then I've under-trained my corpus.
If one trains their model with too few documents/over-trains then the vectors show significant correlation with each other (I typically visualize a random set of vectors and you can see stripes where all the vectors have weights that are either positive or negative).
What I imagine is a single "good" vector has various weights across the entire range of -1 to 1. For any single vector it may have significantly more dimensions near -1 or 1. However, the weight distribution of an entire corpus would balance out vectors that randomly have more values towards one end of the spectrum or another, so that the weight distribution of the entire corpus is approximately evenly distributed across the entire corpus. Is this intuition correct?
I'm unfamiliar with any research or folk wisdom about the desirable "weights of the vectors" (by which I assume you mean the individual dimensions).
In general, since the individual dimensions aren't strongly interpretable, I'm not sure you could say much about how any one dimension's values should be distributed. And remember, our intuitions from low-dimensional spaces (2d, 3d, 4d) often don't hold up in high-dimensional spaces.
I've seen two interesting, possibly relevant observations in research:
some have observed that the raw trained vectors for words with singular meanings tend to have a larger magnitude, and those with many meanings have smaller magnitudes. A plausible explanation for this would be that word-vectors for polysemous word-tokenss are being pulled in different directions for the multiple contrasting meanings, and thus wind up "somewhere in the middle" (closer to the origin, and thus of lower magnitude). Note, though, that most word-vector-to-word-vector comparisons ignore the magnitudes, by using cosine-similarity to only compare angles (or largely equivalently, by normalizing all vectors to unit length before comparisons).
A paper "All-but-the-Top: Simple and Effective Postprocessing for Word Representations" by Mu, Bhat, & Viswanath https://arxiv.org/abs/1702.01417v2 has noted that the average of all word-vectors that were trained together tends to biased in a certain direction from the origin, but that removing that bias (and other commonalities in the vectors) can result in improved vectors for many tasks. In my own personal experiments, I've observed that the magnitude of that bias-from-origin seems correlated with the number of negative samples chosen - and that choosing the extreme (and uncommon) value of just 1 negative sample makes such a bias negligible (but might not be best for overall quality or efficiency/speed of training).
So there may be useful heuristics about vector quality from looking at the relative distributions of vectors, but I'm not sure any would be sensitive to individual dimensions (except insofar as those happen to be the projections of vectors onto a certain axis).

Interpreting clustering metrics

I'm doing clustering by k-means in Scikit-learn on 398 samples, 306 features. The features matrix is sparse, and the number of clusters is 4.
To improve the clustering, I tried two approaches:
After clustering, I used ExtraTreesClassifier() to classify and compute feature importances (samples labeled in clustering)
I used PCA to reduce the feature dimension to 2.
I have computed the following metrics (SS, CH, SH)
Method sum_of_squares, Calinski_Harabasz, Silhouette
1 kmeans 31.682 401.3 0.879
2 kmeans+top-features 5989230.351 75863584.45 0.977
3 kmeans+PCA 890.5431893 58479.00277 0.993
My questions are:
As far as I know, if sum of squares is smaller, the performance of clustering method is better, while if Silhouette is close to 1 the performance of clustering method is better. For instance in the last row both sum of squares and Silhouette are increased compared to the first row.
How can I choose which approach has better performance?
Never compare sum-of-squares and similar metrics across different projections, transformations or data sets.
To see why, simply multiply every feature by 0.5 - your SSQ will drop by 0.25. So to "improve" your data set, you just need to scale it to a tiny size...
These metrics must only be used on the exact same input and parameters. You can't even use sum-of-squares to compare k-means with different k, because the larger k will win. All you can do is multiple random attempts, and then keep the best minimum you found this way.
With 306 features you are under the curse of dimensionality. Clustering in 306 dimensions is not meaningful. Therefore I wouldn't select features after clustering.
To get interpretable results, you need to reduce dimensionality. For 398 samples you need low dimension (2, 3, maybe 4). Your PCA with dimension 2 is good. You can try 3.
An approach with selecting important features before clustering may be problematic. Anyway, are 2/3/4 "best" features meaningful in your case?

Why does scikit learn return log-density?

The function score_samples from sklearn.neighbors.kde.KernelDensity returns the log of the density. What is the advantage of that over returning the density it self?
I know that the logarithm makes sense for probabilities, which are between 0 and 1 (See this quenstion: Why use log-probability estimates in GaussianNB [scikit-learn]?) But why do you do the same for densities which are between 0 and infinity?
Is there a way to estimate log-density directly, or is it just the logarithm taken from the estimated density?
Much of what applies to probabilities also applies to densities, so the answers in Why use log-probability estimates in GaussianNB [scikit-learn]? apply:
As long as the density is everywhere positive, the logarithm is well defined. It has much better numerical resolution and stability as density tends toward 0. Imagine a gaussian kernel of a certain width to model your points and imagine them in a cluster somewhere. As you move away from this dense area, the log density amounts to the negative squared distance to the cluster. The exponential of that will quickly yield very small quantities in which you may rightfully not trust anymore.

How to scale input DBSCAN in scikit-learn

Should the input to sklearn.clustering.DBSCAN be pre-processeed?
In the example http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py the distances between the input samples X are calculated and normalized:
D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))
db = DBSCAN(eps=0.95, min_samples=10).fit(S)
In another example for v0.14 (http://jaquesgrobler.github.io/online-sklearn-build/auto_examples/cluster/plot_dbscan.html) some scaling is done:
X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
I base my code on the latter example and have the impression clustering works better with this scaling. However, this scaling "Standardizes features by removing the mean and scaling to unit variance". I try to find 2d clusters. If I have my clusters distributed in a squared area - let's say 100x100 I see no problem in the scaling. However, if the are distributed in an rectangled area e.g. 800x200 the scaling 'squeezes' my samples and changes the relative distances between them in one dimension. This deteriorates the clustering, doesn't it? Or am I understanding sth. wrong?
Do I need to apply some preprocessing at all, or can I simply input my 'raw' data?
It depends on what you are trying to do.
If you run DBSCAN on geographic data, and distances are in meters, you probably don't want to normalize anything, but set your epsilon threshold in meters, too.
And yes, in particular a non-uniform scaling does distort distances. While a non-distorting scaling is equivalent to just using a different epsilon value!
Note that in the first example, apparently a similarity and not a distance matrix is processed. S = (1 - D / np.max(D)) is a heuristic to convert a similarity matrix into a dissimilarity matrix. Epsilon 0.95 then effectively means at most "0.05 of the maximum dissimilarity observed". An alternate version that should yield the same result is:
D = distance.squareform(distance.pdist(X))
S = np.max(D) - D
db = DBSCAN(eps=0.95 * np.max(D), min_samples=10).fit(S)
Whereas in the second example, fit(X) actually processes the raw input data, and not a distance matrix. IMHO that is an ugly hack, to overload the method this way. It's convenient, but it leads to misunderstandings and maybe even incorrect usage sometimes.
Overall, I would not take sklearn's DBSCAN as a referene. The whole API seems to be heavily driven by classification, not by clustering. Usually, you don't fit a clustering, you do that for supervised methods only. Plus, sklearn currently does not use indexes for acceleration, and needs O(n^2) memory (which DBSCAN usually would not).
In general, you need to make sure that your distance works. If your distance function doesn't work no distance-based algorithm will produce the desired results. On some data sets, naive distances such as Euclidean work better when you first normalize your data. On other data sets, you have a good understanding on what distance is (e.g. geographic data. Doing a standardization on this obivously does not make sense, nor does Euclidean distance!)

Is there an FFT that uses a logarithmic division of frequency?

Wikipedia's Wavelet article contains this text:
The discrete wavelet transform is also less computationally complex, taking O(N) time as compared to O(N log N) for the fast Fourier transform. This computational advantage is not inherent to the transform, but reflects the choice of a logarithmic division of frequency, in contrast to the equally spaced frequency divisions of the FFT.
Does this imply that there's also an FFT-like algorithm that uses a logarithmic division of frequency instead of linear? Is it also O(N)? This would obviously be preferable for a lot of applications.
Yes. Yes. No.
It is called the Logarithmic Fourier Transform. It has O(n) time. However it is useful for functions which decay slowly with increasing domain/abscissa.
Referring back the wikipedia article:
The main difference is that wavelets
are localized in both time and
frequency whereas the standard Fourier
transform is only localized in
frequency.
So if you can be localized only in time (or space, pick your interpretation of the abscissa) then Wavelets (or discrete cosine transform) are a reasonable approach. But if you need to go on and on and on, then you need the fourier transform.
Read more about LFT at http://homepages.dias.ie/~ajones/publications/28.pdf
Here is the abstract:
We present an exact and analytical expression for the Fourier transform of a function that has been sampled logarithmically. The procedure is significantly more efficient computationally than the fast Fourier transformation (FFT) for transforming functions or measured responses which decay slowly with increasing abscissa value. We illustrate the proposed method with an example from electromagnetic geophysics, where the scaling is often such that our logarithmic Fourier transform (LFT) should be applied. For the example chosen, we are able to obtain results that agree with those from an FFT to within 0.5 per cent in a time that is a factor of 1.0e2 shorter. Potential applications of our LFT in geophysics include conversion of wide-band electromagnetic frequency responses to transient responses, glacial loading and unloading,
aquifer recharge problems, normal mode and earth tide studies in seismology, and impulsive shock wave modelling.
EDIT: After reading up on this I think this algorithm is not really useful for this question, I will give a description anyway for other readers.
There is also the Filon's algorithm a method based on Filon's qudrature which can be found in Numerical Recipes this [PhD thesis][1].
The timescale is log spaced as is the resulting frequeny scale.
This algorithm is used for data/functions which decayed to 0 in the observed time interval (which is probably not your case), a typical simple example would be an exponential decay.
If your data is noted by points (x_0,y_0),(x_1,y_1)...(x_i,y_i) and you want to calculate the spectrum A(f) where f is the frequency from lets say f_min=1/x_max to f_max=1/x_min
log spaced.
The real part for each frequency f is then calculated by:
A(f) = sum from i=0...i-1 { (y_i+1 - y_i)/(x_i+1 - x_i) * [ cos(2*pi*f * t_i+1) - cos(2*pi*f*t_i) ]/((2*pi*f)^2) }
The imaginary part is:
A(f) = y_0/(2*pi*f) + sum from i=0...i-1 { (y_i+1 - y_i)/(x_i+1 - x_i) * [ sin(2*pi*f * t_i+1) - sin(2*pi*f*t_i) ]/((2*pi*f)^2) }
[1] Blochowicz, Thomas: Broadband Dielectric Spectroscopy in Neat and Binary Molecular Glass Formers. University of Bayreuth, 2003, Chapter 3.2.3
To do what you want, you need to measure different time Windows, which means lower frequencies get update least often (inversely proportional to powers of 2).
Check FPPO here:
https://www.rationalacoustics.com/files/FFT_Fundamentals.pdf
This means that higher frequencies will update more often, but you always average (moving average is good), but can also let it move faster. Of course, if plan on using the inverse FFT, you don't want any of this. Also, to have better accuracy (smaller bandwidth) at lower frequencies, means these need to update much more slowly, like 16k Windows (1/3 m/s).
Yeah, a low frequency signal naturally travels slowly, and thus of course, you need a lot of time to detect them. This is not a problem that math can fix. It's a natural trade of, and you can't have high accuracy a lower frequency and fast response.
I think the link I provide will clarify some of your options...7 years after you asked the question, unfortunately.

Resources