How to convert RDD of dense vector into DataFrame in pyspark? - apache-spark

I have a DenseVector RDD like this
>>> frequencyDenseVectors.collect()
[DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])]
I want to convert this into a Dataframe. I tried like this
>>> spark.createDataFrame(frequencyDenseVectors, ['rawfeatures']).collect()
It gives an error like this
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/", line 520, in createDataFrame
rdd, schema = self._createFromRDD(, schema, samplingRatio)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/", line 360, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/", line 340, in _inferSchema
schema = _infer_schema(first)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/", line 991, in _infer_schema
fields = [StructField(k, _infer_type(v), True) for k, v in items]
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/", line 968, in _infer_type
raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <type 'numpy.ndarray'>
old Solution vector: DenseVector(vector.toArray()))
Edit 1 - Code Reproducible
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row
from pyspark.sql.functions import split
from import CountVectorizer
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from import HashingTF, IDF, Tokenizer
from pyspark.mllib.linalg import SparseVector, DenseVector
sqlContext = SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)
sentenceData = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
], ["label", "sentence"])
sentenceData = sentenceData.withColumn("sentence", split("sentence", "\s+"))
vectorizer = CountVectorizer(inputCol="sentence", outputCol="rawfeatures").fit(sentenceData)
countVectors = vectorizer.transform(sentenceData).select("label", "rawfeatures")
idf = IDF(inputCol="rawfeatures", outputCol="features")
idfModel =
tfidf = idfModel.transform(countVectors).select("label", "features")
frequencyDenseVectors = vector: [vector[0],DenseVector(vector[1].toArray())]) x: (x, )).toDF(["rawfeatures"])

You cannot convert RDD[Vector] directly. It should be mapped to a RDD of objects which can be interpreted as structs, for example RDD[Tuple[Vector]]: x: (x, )).toDF(["rawfeatures"])
Otherwise Spark will try to convert object __dict__ and create use unsupported NumPy array as a field.
from import DenseVector
from pyspark.sql.types import _infer_schema
v = DenseVector([1, 2, 3])
TypeError Traceback (most recent call last)
TypeError: not supported type: <class 'numpy.ndarray'>
_infer_schema((v, ))
In Spark 2.0 you have to use correct local types: when working DataFrame based API.
pyspark.mllib.linalg when working RDD based pyspark.mllib API.
These two namespaces can no longer compatible and require explicit conversions (for example How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT).
Code provided in the edit is not equivalent to the one from the original question. You should be aware that tuple and list don't have the same semantics. If you map vector to pair use tuple and convert directly to DataFrame:
lambda row: (row[0], DenseVector(row[1].toArray()))
using tuple (product type) would work for nested structure as well but I doubt this is what you want:
.map(lambda row: (row[0], DenseVector(row[1].toArray())))
.map(lambda x: (x, ))
list at any other place than the top level row is interpreted as an ArrayType.
It is much cleaner to use an UDF for conversion (Spark Python: Standard scaler error "Do not support ... SparseVector").

I believe the problem here is that createDataframe does not take denseVactor as argument Please try to convert denseVector into corresponding collection [i.e. Array or List]. In scala and java
method is available you can convert the denseVector in array or list then try to create dataFrame.


what does 'computeU' mean in computeSVD() function spark

i found a code that uses computeSVD() function ,here is the code
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([
Vectors.sparse(5, {1: 1.0, 3: 7.0}),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
mat = RowMatrix(rows)
# Compute the top 5 singular values and corresponding singular vectors.
svd = mat.computeSVD(5, computeU=True)
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.
what does computeU=True mean in this code ?

How to use fit_transform with an array?

Example of array content:
[4.9, 3.0, 1.4, 0.2, 0.0, 2.0],
[4.7, 3.2, 1.3, 0.2, 0.0, 2.0],
[4.6, 3.1, 1.5, 0.2, 0.0, 2.0],
model = TSNE(learning_rate=100)
transformed = model.fit_transform(data)
I'm trying to apply tSNE to a float array, but I get an error. What should I change?
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (149,) + inhomogeneous part.
Try this example:
from sklearn.manifold import TSNE
import numpy as np
X = np.array([[4.9, 3.0, 1.4, 0.2, 0.0, 2.0], [4.7, 3.2, 1.3, 0.2, 0.0, 2.0]])
model = TSNE(learning_rate=100)
transformed = model.fit_transform(X)

Dense Vector Column to Sparse Vector Column

I have a unique situation where I need to go from a DenseVector to a Sparse Vector Column.
I am trying to implement the SMOTE technique I found here:
But on line 44 I had to change it from min_Array[neigh][0] - min_Array[i][0] to DenseVector(min_Array[neigh][0]) - DenseVector(min_Array[i][0]) due to an error.
Once I have the DenseVector column, I need to convert it back to a SparseVector column to union my data.
I have tried the Following:
df = sc.parallelize([
(1, DenseVector([0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0])),
(2, DenseVector([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0])),
(3, DenseVector([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])),
]).toDF(["row_num", "features"])
list_to_vector_udf = udf(lambda l: Vectors.sparse(l), VectorUDT())
df = df.withColumn('features', list_to_vector_udf(df["features"]))
"int() argument must be a string, a bytes-like object or a number, not 'DenseVector''
assembler = VectorAssembler(inputCols=['features'],outputCol='features')
df = assembler.transform(df)
"Data type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> of column features is not supported."
It usually doesn't make too much sense to convert a dense vector to a sparse vector since dense vector has already taken the memory. If you really need to do this, look at the sparse vector API, it either accepts a list of pairs (indice, value) or you need to directly pass nonzero indices and values to the constructor. Something like the following:
from import Vectors, VectorUDT
from import DenseVector
df = sc.parallelize([
(1, DenseVector([0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0])),
(2, DenseVector([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0])),
(3, DenseVector([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])),
]).toDF(["row_num", "features"])
def to_sparse(dense_vector):
size = len(dense_vector)
pairs = [(i, v) for i, v in enumerate(dense_vector.values.tolist()) if v != 0]
return Vectors.sparse(size, pairs)
dense_to_sparse_udf = udf(to_sparse, VectorUDT())
df = df.withColumn('features', dense_to_sparse_udf(df["features"]))
|row_num| features|
| 1|(10,[1,2,3,4,5],[...|
| 2| (10,[9],[100.0])|
| 3| (10,[1],[1.0])|

Optimization for faster numpy 'where' with boolean condition

I generate a bunch of 5-elements vectors with
def beam(n):
# For performance considerations, see
import numpy.random_intel
generator = numpy.random_intel.multivariate_normal
except ModuleNotFoundError:
import numpy.random
generator = numpy.random.multivariate_normal
return generator(
[1.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 1.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 1.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 1.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.2]
This vector will be multiplied by 5x5 matrices (element wise) and checked for boundaries. I use this:
b = beam(1e5)
bound = 1000
s = (b[:, 0]**2 + b[:, 3]**2) < bound**2
#b[np.where(s)] (equivalent performances)
b[s] # <= returned value from a function
It seems that this operation with 100k elements is quite time consuming (3ms on my machine).
Would there be an obvious (or less obvious) way to perform this
operation (the where part, the random generation is there to give an example) ?
As your components are uncorrelated one obvious speedup would be to use the univariate normal instead of the multivariate:
>>> from timeit import repeat
>>> import numpy as np
>>> kwds = dict(globals=globals(), number=100)
>>> repeat('np.random.multivariate_normal(np.zeros((5,)), np.diag((1,1,1,1,0.2)), (100,))', **kwds)
[0.01475344318896532, 0.01471381587907672, 0.013099645031616092]
>>> repeat('np.random.normal((0,0,0,0,0), (1,1,1,1,np.sqrt(0.2)), (100, 5))', **kwds)
[0.003930734936147928, 0.004097769036889076, 0.004246715921908617]
Further, as it stands your condition is extremely unlikely to fail. So, just check s.all() and if True do nothing.

Issues with Logistic Regression for multiclass classification using PySpark

I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector:
For full code base and error log, please check my github repo
Case 1: I tried using the pipeline of ML as follow:
# imported library from ML
from import HashingTF
from import Pipeline
from import LogisticRegression
print(type(trainingData)) # for checking only
print(trainingData.take(2)) # for of data type
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=maximumIteration, regParam=re
pipeline = Pipeline(stages=[lr])
# Train model
model =
Got the following error:
<class 'pyspark.sql.dataframe.DataFrame'>
[Row(label=2.0, features=SparseVector(2000, {51: 1.0, 160: 1.0, 341: 1.0, 417: 1.0, 561: 1.0, 656: 1.0, 863: 1.0, 939: 1.0, 1021: 1.0, 1324: 1.0, 1433: 1.0, 1573: 1.0, 1604: 1.0, 1720: 1.0})), Row(label=3.0, features=SparseVector(2000, {24: 1.0, 51: 2.0, 119: 1.0, 167: 1.0, 182: 1.0, 190: 1.0, 195: 1.0, 285: 1.0, 432: 1.0, 539: 1.0, 571: 1.0, 630: 1.0, 638: 1.0, 656: 1.0, 660: 2.0, 751: 1.0, 785: 1.0, 794: 1.0, 801: 1.0, 823: 1.0, 893: 1.0, 900: 1.0, 915: 1.0, 956: 1.0, 966: 1.0, 1025: 1.0, 1029: 1.0, 1035: 1.0, 1038: 1.0, 1093: 1.0, 1115: 2.0, 1147: 1.0, 1206: 1.0, 1252: 1.0, 1261: 1.0, 1262: 1.0, 1268: 1.0, 1304: 1.0, 1351: 1.0, 1378: 1.0, 1423: 1.0, 1437: 1.0, 1441: 1.0, 1530: 1.0, 1534: 1.0, 1556: 1.0, 1562: 1.0, 1604: 1.0, 1711: 1.0, 1737: 1.0, 1750: 1.0, 1776: 1.0, 1858: 1.0, 1865: 1.0, 1923: 1.0, 1926: 1.0, 1959: 1.0, 1999: 1.0}))]
16/08/25 19:14:07 ERROR Currently, LogisticRegression with E
lasticNet in ML package only supports binary classification. Found 5 in the input dataset.
Traceback (most recent call last):
File "/home/LR/", line 260, in <module>
accuracy = TrainLRCModel(trainData, testData)
File "/home/LR/", line 211, in TrainLRCModel
model =
File "/usr/lib/spark/python/lib/", line 69, in fit
File "/usr/lib/spark/python/lib/", line 213, in _fit
File "/usr/lib/spark/python/lib/", line 69, in fit
File "/usr/lib/spark/python/lib/", line 133, in _fit
File "/usr/lib/spark/python/lib/", line 130, in _fit_java
File "/usr/lib/spark/python/lib/", line 813, in __call__
File "/usr/lib/spark/python/lib/", line 45, in deco
File "/usr/lib/spark/python/lib/", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
: org.apache.spark.SparkException: Currently, LogisticRegression with ElasticNet in ML package only supports binary
classification. Found 5 in the input dataset.
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at py4j.reflection.MethodInvoker.invoke(
at py4j.reflection.ReflectionEngine.invoke(
at py4j.Gateway.invoke(
at py4j.commands.AbstractCommand.invokeMethod(
at py4j.commands.CallCommand.execute(
Case 2: I search the possible alternate solution of above one and got that LogisticRegressionWithLBFGS will work on multi-class classificaton, I tried as follow:
#imported library
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel, LogisticRegressionWithSGD
print(type(trainingData)) # to check the dataset type
print(trainingData.take(2)) # To see the data
model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
Got the following error:
<class 'pyspark.sql.dataframe.DataFrame'>
[Row(label=3.0, features=SparseVector(2000, {24: 1.0, 51: 2.0, 119: 1.0, 167: 1.0, 182: 1.0, 190: 1.0, 195: 1.0, 28
5: 1.0, 432: 1.0, 539: 1.0, 571: 1.0, 630: 1.0, 638: 1.0, 656: 1.0, 660: 2.0, 751: 1.0, 785: 1.0, 794: 1.0, 801: 1.
0, 823: 1.0, 893: 1.0, 900: 1.0, 915: 1.0, 956: 1.0, 966: 1.0, 1025: 1.0, 1029: 1.0, 1035: 1.0, 1038: 1.0, 1093: 1.
0, 1115: 2.0, 1147: 1.0, 1206: 1.0, 1252: 1.0, 1261: 1.0, 1262: 1.0, 1268: 1.0, 1304: 1.0, 1351: 1.0, 1378: 1.0, 14
23: 1.0, 1437: 1.0, 1441: 1.0, 1530: 1.0, 1534: 1.0, 1556: 1.0, 1562: 1.0, 1604: 1.0, 1711: 1.0, 1737: 1.0, 1750: 1
.0, 1776: 1.0, 1858: 1.0, 1865: 1.0, 1923: 1.0, 1926: 1.0, 1959: 1.0, 1999: 1.0})), Row(label=5.0, features=SparseV
ector(2000, {103: 1.0, 310: 1.0, 601: 1.0, 817: 1.0, 866: 1.0, 940: 1.0, 1023: 1.0, 1118: 1.0, 1339: 1.0, 1447: 1.0
, 1634: 1.0, 1776: 1.0}))]
Traceback (most recent call last):
File "/home/LR/", line 260, in <module>
accuracy = TrainLRCModel(trainData, testData)
File "/home/LR/", line 230, in TrainLRCModel
model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
File "/usr/lib/spark/python/lib/", line 382, in train
File "/usr/lib/spark/python/lib/", line 206, in _regression_train_wrapper
TypeError: data should be an RDD of LabeledPoint, but got <class 'pyspark.sql.types.Row'>
Again I tried to convert the dataset into RDD of Labeled Point as follow i.e case 3:
Case 3: Converted the dataset into RDD of Labeled Point so that I can use LogisticRegressionWithLBFGS as follow:
#imported libraries
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel, LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
trainingData = row:[LabeledPoint(row.label,row.features)])
print('type of trainingData')
model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
Got the following error:
<class 'pyspark.sql.dataframe.DataFrame'>
[Row(label=2.0, features=SparseVector(2000, {51: 1.0, 160: 1.0, 341: 1.0, 417: 1.0, 561: 1.0, 656: 1.0, 863: 1.0, 9
39: 1.0, 1021: 1.0, 1324: 1.0, 1433: 1.0, 1573: 1.0, 1604: 1.0, 1720: 1.0})), Row(label=3.0, features=SparseVector(
2000, {24: 1.0, 51: 2.0, 119: 1.0, 167: 1.0, 182: 1.0, 190: 1.0, 195: 1.0, 285: 1.0, 432: 1.0, 539: 1.0, 571: 1.0,
630: 1.0, 638: 1.0, 656: 1.0, 660: 2.0, 751: 1.0, 785: 1.0, 794: 1.0, 801: 1.0, 823: 1.0, 893: 1.0, 900: 1.0, 915:
1.0, 956: 1.0, 966: 1.0, 1025: 1.0, 1029: 1.0, 1035: 1.0, 1038: 1.0, 1093: 1.0, 1115: 2.0, 1147: 1.0, 1206: 1.0, 12
52: 1.0, 1261: 1.0, 1262: 1.0, 1268: 1.0, 1304: 1.0, 1351: 1.0, 1378: 1.0, 1423: 1.0, 1437: 1.0, 1441: 1.0, 1530: 1
.0, 1534: 1.0, 1556: 1.0, 1562: 1.0, 1604: 1.0, 1711: 1.0, 1737: 1.0, 1750: 1.0, 1776: 1.0, 1858: 1.0, 1865: 1.0, 1
923: 1.0, 1926: 1.0, 1959: 1.0, 1999: 1.0}))]
type of trainingData
<class 'pyspark.rdd.PipelinedRDD'>
[[LabeledPoint(2.0, (2000,[51,160,341,417,561,656,863,939,1021,1324,1433,1573,1604,1720],[1.0,1.0,1.0,1.0,1.0,1.0,1
.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))], [LabeledPoint(3.0, (2000,[24,51,119,167,182,190,195,285,432,539,571,630,638,656
Traceback (most recent call last):
File "/home/LR/", line 260, in <module>
accuracy = TrainLRCModel(trainData, testData)
File "/home/LR/", line 230, in TrainLRCModel
model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
File "/usr/lib/spark/python/lib/", line 381, in train
AttributeError: 'list' object has no attribute 'features'
Can someone please suggest where I am missing something, I wanted to use the Logistic Regression in PySpark and classify the multi-class classification.
Currently I am using spark version version 1.6.2 and python version Python 2.7.9 on google cloud.
Thanking you in advance for you kind help.
Case 1: There is nothing strange here, simply (as the error message says) LogisticRegression does not support multi-class classification, as clearly stated in the documentation.
Case 2: Here you have switched from ML to MLlib, which however does not work with dataframes but needs the input as RDD of LabeledPoint (documentation), hence again the error message is expected.
Case 3: Here is where things get interesting. First, you should remove the brackets from your map function, i.e. it should be
trainingData = row: LabeledPoint(row.label, row.features)) # no brackets after "row:"
Nevertheless, guessing from the code snippets you have provided, most probably you are going to get a different error now:
model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
: org.apache.spark.SparkException: Input validation failed.
Here is what happening (it took me some time to figure it out), using some dummy data (it's always a good idea to provide some sample data with your question):
# 3-class classification
data = sc.parallelize([
LabeledPoint(3.0, SparseVector(100,[10, 98],[1.0, 1.0])),
LabeledPoint(1.0, SparseVector(100,[1, 22],[1.0, 1.0])),
LabeledPoint(2.0, SparseVector(100,[36, 54],[1.0, 1.0]))
lrm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3) # throws exception
: org.apache.spark.SparkException: Input validation failed.
The problem is that your labels must start from 0 (and this is nowhere documented - you have to dig in the Scala source code to see that this is the case!); so, mapping the labels in my dummy data above from (1.0, 2.0, 3.0) to (0.0, 1.0, 2.0), we finally get:
# 3-class classification
data = sc.parallelize([
LabeledPoint(2.0, SparseVector(100,[10, 98],[1.0, 1.0])),
LabeledPoint(0.0, SparseVector(100,[1, 22],[1.0, 1.0])),
LabeledPoint(1.0, SparseVector(100,[36, 54],[1.0, 1.0]))
lrm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3) # no error now
Judging from your numClasses=5 argument, as well as from the label=5.0 in one of your printed records, I guess that most probably your code suffers from the same issue. Change your labels to [0.0, 4.0] and you should be fine.
(I suggest that you delete the other identical question you have opened here, for reducing clutter...)
