Sparse Vector vs Dense Vector

Sparse Vector vs Dense Vector - apache-spark

How to create SparseVector and dense Vector representations
if the DenseVector is:
denseV = np.array([0., 3., 0., 4.])
What will be the Sparse Vector representation ?

Unless I have thoroughly misunderstood your doubt, the MLlib data type documentation illustrates this quite clearly:
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
// Create a dense vector (1.0, 0.0, 3.0).
Vector dv = Vectors.dense(1.0, 0.0, 3.0);
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});
Where the second argument of Vectors.sparse is an array of the indices, and the third argument is the array of the actual values in those indices.

Sparse vectors are when you have a lot of values in the vector as zero. While a dense vector is when most of the values in the vector are non zero.
If you have to create a sparse vector from the dense vector you specified, use the following syntax:
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
Vector sparseVector = Vectors.sparse(4, new int[] {1, 3}, new double[] {3.0, 4.0});

Dense: Use it when you are having high probability of data.
sparse: Use it when you are having less available data positions filled (i.e. you are having too many zeroes)
eg: {0.0,3.0,0.0,4.0}
for different Vectors it will be
val posVector = Vector.dense(0.0, 3.0, 0.0, 4.0) // all data will be in dense
val sparseVector = Vector.sparse(4, Array(1, 3), Array(3.0, 4.0)) //only non-zeros are mentioned
Syntax ex: Vector.sparse(size of vector, non-zero-index, values)

Related

what does 'computeU' mean in computeSVD() function spark

i found a code that uses computeSVD() function ,here is the code
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([
Vectors.sparse(5, {1: 1.0, 3: 7.0}),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
])
mat = RowMatrix(rows)
# Compute the top 5 singular values and corresponding singular vectors.
svd = mat.computeSVD(5, computeU=True)
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.
what does computeU=True mean in this code ?

Having problems using numpy linalg svd

I'm testing svd decomposition with simple matrix
A=np.array([[1,2,3],[4,5,6]])
but when I use :
U,D,V=np.linalg.svd(A)
the output are U whit shape (2,2), D with shape (2,) and V with shape (3,3)
the problem is the shape of V, the svd algorithm should return a 2x3 matrix since my original matrix is a 2x3 matrix and i'm geting 2 singular values, but it return a 3x3 matrix, when i take V[:2,:] and make the product:
U.dot(np.diag(D).dot(V[:2,:]))
it returns the original matrix A, what is happening here?
thank you for your reading and answers and sorry for the grammar, i'm beginning in english

This is explained in the docstring, but it might take a few readings to get it. The boolean parameter full_matrices determines the shape of the returned arrays. In your case, you want full_matrices=False, e.g.:
In [42]: A = np.array([[1, 2, 3], [4, 5, 6]])
In [43]: U, D, V = np.linalg.svd(A, full_matrices=False)
In [44]: U # np.diag(D) # V
Out[44]:
array([[1., 2., 3.],
[4., 5., 6.]])

Sample from a normal distribution using parameters from an ordered dictionary

I need help with creating a function that samples from a random uniform distribution with parameters defined in an ordered dictionary and returns a dictionary with parameter names as keys, using any random seed.
parameter=OrderedDict([('a', (100.0, 0.0)), ('b', (90.0, 5.0))])
NB: (100.0, 0.0) are mean and std deviation respectively
Expected return: {'a': 105.46565, 'b': 90}
Thanks

Something like this?
from collections import OrderedDict
import random
parameter = OrderedDict([('a', (100.0, 0.0)), ('b', (90.0, 5.0))])
samples = {}
for k, (mu, sigma) in parameter.items():
samples[k] = random.normalvariate(mu, sigma)
>>> print(samples)
{'a': 100.0, 'b': 89.02621974794464}

Cosine similarity between query and document in a search engine

I am going through the Manning book for Information retrieval. Currently I am at the part about cosine similarity. One thing is not clear for me.
Lets say I have the tf-idf vectors for the query and a document. I want to compute the cosine similarity between both vectors. When I compute the magnitude for the document vector do I sum the squares of all the terms in the vector or just the terms in the query?
Here is an example : we have user query "cat food beef" .
Lets say its vector is (0,1,0,1,1).( assume there are only 5 directions in the vector one for each unique word in the query and the document)
We have a document "Beef is delicious"
Its vector is (1,1,1,0,0). We want to find the cosine similarity between the query and the document vectors.

Cosine similarity is simply a fraction where
the numerator is the dot product between 2 vectors
the denominator is product of the magnitude of the 2 vectors
i.e. euclidean length, i.e. the square root of the dot product of the vector with itself
for the numerator, e.g. in numpy:
>>> import numpy as np
>>> y = [1.0, 1.0, 1.0, 0.0, 0.0]
>>> x = [0.0, 1.0, 0.0, 1.0, 1.0]
>>> np.dot(x,y)
1.0
Similarly if we compute the dot product by multiply x_i and y_i and summing the individual elements:
>>> x_dot_y = sum([(1.0 * 0.0) + (1.0 * 1.0) + (1.0 * 0.0) + (0.0 * 1.0) + (0.0 * 1.0)])
>>> x_dot_y
1.0
For the denominator, we can compute the magnitude in numpy:
>>> from numpy.linalg import norm
>>> y = [1.0, 1.0, 1.0, 0.0, 0.0]
>>> x = [0.0, 1.0, 0.0, 1.0, 1.0]
>>> norm(x) * norm(y)
2.9999999999999996
Similarly, if we compute the euclidean length without numpy
>>> import math
# with np.dot
>>> math.sqrt(np.dot(x,x)) * math.sqrt(np.dot(y,y))
2.9999999999999996
So the cosine similarity is:
>>> cos_x_y = np.dot(x,y) / (norm(x) * norm(y))
>>> cos_x_y
0.33333333333333337
You can also use the cosine distance function directly from scipy:
>>> from scipy import spatial
>>> 1 - spatial.distance.cosine(x,y)
0.33333333333333337
See also
How to calculate cosine similarity given 2 sentence strings? - Python
Cosine Similarity between 2 Number Lists

Dimension reduction Using PCA while preserving variance in percentage

i am trying to reduce the dimensions of MNIST dataset using PCA. Trick is, i have to preserve the certain percentage of variance(say 80%) while reducing the dimension. I am using Scikit learn. I am doing pca.get_variance ratio but it gives me same values with different dot location like 9.7 or .97 or .097. i am also tried pca.get_variance() but i assume that's not the answer. My question is how to ensure that i have reduce the dimension with certain variance percentage preserve?

If you apply PCA without passing the n_components argument, then the explained_variance_ratio_ attribute of the PCA object will give you the information you need. This attribute indicates the fraction of total variance associated with the corresponding eigenvector. Here is an example copied directly from the current stable PCA documentation:
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
>>> print(pca.explained_variance_ratio_)
[ 0.99244... 0.00755...]
In your case, if you apply np.cumsum to the explained_variance_ratio_ attribute, then the number of principal components you need to keep corresponds to the position of the first element in np.cumsum(pca.explained_variance_ratio_) that is greater than or equal to 0.8.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Sparse Vector vs Dense Vector - apache-spark

How to create SparseVector and dense Vector representations if the DenseVector is: denseV = np.array([0., 3., 0., 4.]) What will be the Sparse Vector representation ?

Related

what does 'computeU' mean in computeSVD() function spark

Having problems using numpy linalg svd

Sample from a normal distribution using parameters from an ordered dictionary

Cosine similarity between query and document in a search engine

Dimension reduction Using PCA while preserving variance in percentage

Categories

Resources