what does 'computeU' mean in computeSVD() function spark - apache-spark

i found a code that uses computeSVD() function ,here is the code
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([
Vectors.sparse(5, {1: 1.0, 3: 7.0}),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
])
mat = RowMatrix(rows)
# Compute the top 5 singular values and corresponding singular vectors.
svd = mat.computeSVD(5, computeU=True)
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.
what does computeU=True mean in this code ?

Related

How to use fit_transform with an array?

Example of array content:
[
[4.9, 3.0, 1.4, 0.2, 0.0, 2.0],
[4.7, 3.2, 1.3, 0.2, 0.0, 2.0],
[4.6, 3.1, 1.5, 0.2, 0.0, 2.0],
...
]
model = TSNE(learning_rate=100)
transformed = model.fit_transform(data)
I'm trying to apply tSNE to a float array, but I get an error. What should I change?
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (149,) + inhomogeneous part.
Try this example:
from sklearn.manifold import TSNE
import numpy as np
X = np.array([[4.9, 3.0, 1.4, 0.2, 0.0, 2.0], [4.7, 3.2, 1.3, 0.2, 0.0, 2.0]])
model = TSNE(learning_rate=100)
transformed = model.fit_transform(X)
print(transformed)

difflib: comparing a list of keywords with another list and returning ratio

I am trying to compare a list of words with a whole list of sentences using 'difflib'.
import pandas as pd
from difflib import SequenceMatcher
s1 = ['okay', 'bye', 'what is'] # reference keywords
s2 = ['okay', 'what', 'dont worry', 'what is my name', 'is', 'my', 'name', 'bye'] #actual list
SequenceMatcher(a = s1, b = s2).ratio() # returns 0.36
The above snippet returns 0.36 as an overall result. But I would need a list where the reference keywords are matched with the actual list and the score is '1.0' for them. so in the above case, my result (for example - I am putting random scores here - the values could be [1.0, 0.2, 0.0, 0.5, 0.1, 0.0, 0.0, 0.0, 1.0] . i.e. Exact match = 1.0 , no match = 0.0, partial matches = scores accordingly.
Maybe you're looking for something like this:
[max([SequenceMatcher(None, x, y).ratio() for y in s1]) for x in s2]
>>> [1.0, 0.7272727272727273, 0.2857142857142857, 0.6363636363636364, 0.4444444444444444, 0.4, 0.2857142857142857, 1.0]

Cosine similarity between query and document in a search engine

I am going through the Manning book for Information retrieval. Currently I am at the part about cosine similarity. One thing is not clear for me.
Lets say I have the tf-idf vectors for the query and a document. I want to compute the cosine similarity between both vectors. When I compute the magnitude for the document vector do I sum the squares of all the terms in the vector or just the terms in the query?
Here is an example : we have user query "cat food beef" .
Lets say its vector is (0,1,0,1,1).( assume there are only 5 directions in the vector one for each unique word in the query and the document)
We have a document "Beef is delicious"
Its vector is (1,1,1,0,0). We want to find the cosine similarity between the query and the document vectors.
Cosine similarity is simply a fraction where
the numerator is the dot product between 2 vectors
the denominator is product of the magnitude of the 2 vectors
i.e. euclidean length, i.e. the square root of the dot product of the vector with itself
for the numerator, e.g. in numpy:
>>> import numpy as np
>>> y = [1.0, 1.0, 1.0, 0.0, 0.0]
>>> x = [0.0, 1.0, 0.0, 1.0, 1.0]
>>> np.dot(x,y)
1.0
Similarly if we compute the dot product by multiply x_i and y_i and summing the individual elements:
>>> x_dot_y = sum([(1.0 * 0.0) + (1.0 * 1.0) + (1.0 * 0.0) + (0.0 * 1.0) + (0.0 * 1.0)])
>>> x_dot_y
1.0
For the denominator, we can compute the magnitude in numpy:
>>> from numpy.linalg import norm
>>> y = [1.0, 1.0, 1.0, 0.0, 0.0]
>>> x = [0.0, 1.0, 0.0, 1.0, 1.0]
>>> norm(x) * norm(y)
2.9999999999999996
Similarly, if we compute the euclidean length without numpy
>>> import math
# with np.dot
>>> math.sqrt(np.dot(x,x)) * math.sqrt(np.dot(y,y))
2.9999999999999996
So the cosine similarity is:
>>> cos_x_y = np.dot(x,y) / (norm(x) * norm(y))
>>> cos_x_y
0.33333333333333337
You can also use the cosine distance function directly from scipy:
>>> from scipy import spatial
>>> 1 - spatial.distance.cosine(x,y)
0.33333333333333337
See also
How to calculate cosine similarity given 2 sentence strings? - Python
Cosine Similarity between 2 Number Lists

How to convert RDD of dense vector into DataFrame in pyspark?

I have a DenseVector RDD like this
>>> frequencyDenseVectors.collect()
[DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])]
I want to convert this into a Dataframe. I tried like this
>>> spark.createDataFrame(frequencyDenseVectors, ['rawfeatures']).collect()
It gives an error like this
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 520, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 360, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 340, in _inferSchema
schema = _infer_schema(first)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/types.py", line 991, in _infer_schema
fields = [StructField(k, _infer_type(v), True) for k, v in items]
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/types.py", line 968, in _infer_type
raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <type 'numpy.ndarray'>
old Solution
frequencyVectors.map(lambda vector: DenseVector(vector.toArray()))
Edit 1 - Code Reproducible
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row
from pyspark.sql.functions import split
from pyspark.ml.feature import CountVectorizer
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.mllib.linalg import SparseVector, DenseVector
sqlContext = SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)
sc.setLogLevel('ERROR')
sentenceData = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
], ["label", "sentence"])
sentenceData = sentenceData.withColumn("sentence", split("sentence", "\s+"))
sentenceData.show()
vectorizer = CountVectorizer(inputCol="sentence", outputCol="rawfeatures").fit(sentenceData)
countVectors = vectorizer.transform(sentenceData).select("label", "rawfeatures")
idf = IDF(inputCol="rawfeatures", outputCol="features")
idfModel = idf.fit(countVectors)
tfidf = idfModel.transform(countVectors).select("label", "features")
frequencyDenseVectors = tfidf.rdd.map(lambda vector: [vector[0],DenseVector(vector[1].toArray())])
frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])
You cannot convert RDD[Vector] directly. It should be mapped to a RDD of objects which can be interpreted as structs, for example RDD[Tuple[Vector]]:
frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])
Otherwise Spark will try to convert object __dict__ and create use unsupported NumPy array as a field.
from pyspark.ml.linalg import DenseVector
from pyspark.sql.types import _infer_schema
v = DenseVector([1, 2, 3])
_infer_schema(v)
TypeError Traceback (most recent call last)
...
TypeError: not supported type: <class 'numpy.ndarray'>
vs.
_infer_schema((v, ))
StructType(List(StructField(_1,VectorUDT,true)))
Notes:
In Spark 2.0 you have to use correct local types:
pyspark.ml.linalg when working DataFrame based pyspark.ml API.
pyspark.mllib.linalg when working RDD based pyspark.mllib API.
These two namespaces can no longer compatible and require explicit conversions (for example How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT).
Code provided in the edit is not equivalent to the one from the original question. You should be aware that tuple and list don't have the same semantics. If you map vector to pair use tuple and convert directly to DataFrame:
tfidf.rdd.map(
lambda row: (row[0], DenseVector(row[1].toArray()))
).toDF()
using tuple (product type) would work for nested structure as well but I doubt this is what you want:
(tfidf.rdd
.map(lambda row: (row[0], DenseVector(row[1].toArray())))
.map(lambda x: (x, ))
.toDF())
list at any other place than the top level row is interpreted as an ArrayType.
It is much cleaner to use an UDF for conversion (Spark Python: Standard scaler error "Do not support ... SparseVector").
I believe the problem here is that createDataframe does not take denseVactor as argument Please try to convert denseVector into corresponding collection [i.e. Array or List]. In scala and java
toArray()
method is available you can convert the denseVector in array or list then try to create dataFrame.

Sparse Vector vs Dense Vector

How to create SparseVector and dense Vector representations
if the DenseVector is:
denseV = np.array([0., 3., 0., 4.])
What will be the Sparse Vector representation ?
Unless I have thoroughly misunderstood your doubt, the MLlib data type documentation illustrates this quite clearly:
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
// Create a dense vector (1.0, 0.0, 3.0).
Vector dv = Vectors.dense(1.0, 0.0, 3.0);
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});
Where the second argument of Vectors.sparse is an array of the indices, and the third argument is the array of the actual values in those indices.
Sparse vectors are when you have a lot of values in the vector as zero. While a dense vector is when most of the values in the vector are non zero.
If you have to create a sparse vector from the dense vector you specified, use the following syntax:
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
Vector sparseVector = Vectors.sparse(4, new int[] {1, 3}, new double[] {3.0, 4.0});
Dense: Use it when you are having high probability of data.
sparse: Use it when you are having less available data positions filled (i.e. you are having too many zeroes)
eg: {0.0,3.0,0.0,4.0}
for different Vectors it will be
val posVector = Vector.dense(0.0, 3.0, 0.0, 4.0) // all data will be in dense
val sparseVector = Vector.sparse(4, Array(1, 3), Array(3.0, 4.0)) //only non-zeros are mentioned
Syntax ex: Vector.sparse(size of vector, non-zero-index, values)

Resources