Spark MLlib RowMatrix from SparseVector - apache-spark

I am trying to create a RowMatrix from an RDD of SparseVectors but am getting the following error:
<console>:37: error: type mismatch;
found : dataRows.type (with underlying type org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.SparseVector])
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.SparseVector <: org.apache.spark.mllib.linalg.Vector (and dataRows.type <: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.SparseVector]), but class RDD is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
val svd = new RowMatrix(dataRows.persist()).computeSVD(20, computeU = true)
My code is:
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg._
import org.apache.spark.{SparkConf, SparkContext}
val DATA_FILE_DIR = "/user/cloudera/data/"
val DATA_FILE_NAME = "dataOct.txt"
val dataRows = sc.textFile(DATA_FILE_DIR.concat(DATA_FILE_NAME)).map(line => Vectors.dense(line.split(" ").map(_.toDouble)).toSparse)
val svd = new RowMatrix(dataRows.persist()).computeSVD(20, computeU = true)
My input data file is approximately 150 rows by 50,000 columns of space separated integers.
I am running:
Spark: Version 1.5.0-cdh5.5.1
Java: 1.7.0_67

Just provide explicit type annotation either for a RDD
val dataRows: org.apache.spark.rdd.RDD[Vector] = ???
or result of the anonymous function:
...
.map(line => Vectors.dense(line.split(" ").map(_.toDouble)).toSparse: Vector)

Related

Spark LightGBM Predict dataframe datatype different from printSchema of output datatype

I want to transform one of the column data type in my dataframe to string using a UDF.
When I printSchema of my dataframe, that column indeed shows vector datatype, However when i use my UDF to transform the vector to string, I get error:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(probability)'
due to data type mismatch: argument 1 requires vector type, however, '`probability`' is of
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> type.;;
imports
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import com.microsoft.ml.spark.{LightGBMClassifier,LightGBMClassificationModel}
import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
UDF
val vecToString = udf( (xs: Vector) => xs.toArray.mkString(";"))
DataFrame (printSchema)
val inputData = spark.read.parquet(inputDataPath)
val pipelineModel = PipelineModel.load(modelPath)
val predictions = pipelineModel.transform(inputData)
# Selecting only 2 columns from predictions DF:
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = false)
+-----------------------------------------+----------+
|probability |prediction|
+-----------------------------------------+----------+
|[0.2554504562575961,0.7445495437424039] |1.0 |
|[0.7763149003135102,0.22368509968648975] |0.0 |
Convert probability column to string using my UDF
val tmp = predictions
.withColumn("probabilityStr" , vecToString($"probability"))
And this is where the above error occurs.
Also tried:
val vecToString = udf( (xs: Array[Double]) => xs.mkString(";"))
AnalysisException: cannot resolve 'UDF(probability)' due to data type mismatch: argument 1 requires array<double> type, however, '`probability`' is of struct<type:tinyint,size:int,indices:array<int>,values:array<double>> type.;;
When I used a different model (not Light GBM, this works fine. Could it be due to type of model used?)

kd tree implementation in PySpark

I am trying to build a kdtree using pyspark. For this, I am using
UDF to recursively build kdtree from a 2-dimension list of floats.
Following is the piece of code I am trying:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import udf
from pyspark.sql.types import *
spark = SparkSession.builder.appName("SRDD").getOrCreate()
sc = spark.sparkContext
# Some sequence of floats
abc = [[0.0769,0.2982],[0.0863,0.30052],[0.0690,0.33337],[0.11975,0.2984],[0.07224,0.3467],[0.1316,0.2999]]
def build_kdtree(points,depth=0):
n=points.count()
if n<=0:
return None
axis=depth%2
sorted_points=sorted(points,key=lambda point:point[axis])
return{
'point': sorted_points[n/2],
'left':build_kdtree(sorted_points[:n/2],depth+1),
'right':build_kdtree(sorted_points[n/2 + 1:],depth+1)
}
#This is how I'm trying to specify the return type of the function
kdtree_schema=StructType([StructField('point',ArrayType(FloatType()),nullable=True),StructField('left',StructType(),nullable=True),StructField('right',StructType(),nullable=True)])
kdtree_schema=StructType([StructField('point',ArrayType(FloatType()),nullable=True),StructField('left',kdtree_schema,nullable=True),StructField('right',kdtree_schema,nullable=True)])
#UDF registration
buildkdtree_udf=udf(build_kdtree, kdtree_schema)
#Function call
pointskdtree=buildkdtree_udf(abc)
However, this throws TypeError: Invalid argument, not a string or column.
I have 2 main questions:
Is my approach to build kd tree in spark recursively correct?
The lines where I specify the return type of UDF as kdtree_schema correct?

Load Data for Machine Learning in Spark [duplicate]

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation :
val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line =>
val fields = line.split(",")
(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))}
val usergroup = user.groupByKey
val data =usergroup.map{ case(x,iter) => (x,iter.map(_._1).toArray,iter.map(_._2).toArray)}
val data_DF = data.toDF("user","item","rating")
I am using Spark 2.0.
The issue you are facing can be divided into the following :
Converting your ratings (I believe) into LabeledPoint data X.
Saving X in libsvm format.
1. Converting your ratings into LabeledPoint data X
Let's consider the following raw ratings :
val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")
You can handle those raw ratings as a coordinate list matrix (COO).
Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).
Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD
val data: RDD[MatrixEntry] =
sc.parallelize(rawRatings).map {
line => {
val fields = line.split(",")
val i = fields(0).toLong
val j = fields(1).toLong
val value = fields(2).toDouble
MatrixEntry(i, j, value)
}
}
Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :
val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
.toIndexedRowMatrix().rows // Extract indexed rows
.toDF("label", "features") // Convert rows
2. Saving LabeledPoint data in libsvm format
Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
Unfortunately we still can't use the DataFrameWriter directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.
Utilities for converting DataFrame columns from mllib.linalg to ml.linalg types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils. In our case we need to do the following (for both the dummy data and the DataFrame from step 1.)
import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)
Now let's save the DataFrame :
convertedVecDF.write.format("libsvm").save("data/foo")
And we can check the files contents :
$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0
EDIT:
In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")
In order to convert an existing to a typed DataSet I suggest the following; Use the following case class:
case class LibSvmEntry (
value: Double,
features: L.Vector)
The you can use the map function to convert it to a LibSVM entry like so:
df.map[LibSvmEntry](r: Row => /* Do your stuff here*/)
libsvm datatype features is a sparse vector, u can use pyspark.ml.linalg.SparseVector to solve the problem
a = SparseVector(4, [1, 3], [3.0, 4.0])
def sparsevecfuc(len,index,score):
"""
args: len int, index array, score array
"""
return SparseVector(len,index,score)
trans_sparse = udf(sparsevecfuc,VectorUDT())

generating vector from text data for KMeans using spark

I am new to Spark and Machine Learning. I am trying to cluster using KMeans Some data like
1::Hi How are you
2::I am fine, how about you
In the data, separator is :: and Actual text to cluster is second column that has text data.
After reading on the spark official page and numerous articles I have written following code but I am not able to generate the vector to provide as input to KMeans.train step.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
val sc = new SparkContext("local", "test")
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val rawData = sc.textFile("data/mllib/KM.txt").map(line => line.split("::")(1))
val sentenceData = rawData.toDF("sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val clusters = KMeans.train(featurizedData, 2, 10)
I am getting following error
<console>:27: error: type mismatch;
found : org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
val clusters = KMeans.train(featurizedData, 2, 10)
Please suggest how to process input data for KMeans
Thanks in advance.
Finaly I get it working after replacing the following code.
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val clusters = KMeans.train(featurizedData, 2, 10)
With
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
val kmeans = new KMeans().setK(2).setFeaturesCol("features").setPredictionCol("prediction")
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, kmeans))

Spark Decision tree with categorical variables

My data has categorical variables (response and some of features variables).
How can I convert it to libsvm format after converting the categorical variables to binary features?
If your data is an RDD you may call the method: saveAsLibSVMFile(rdd, path) It's part of apache.spark.mllib.util.MLUtils package.
For official documentation see: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.util.MLUtils$
Here's a scala example assuming you have converted your categorical data to binary features: (you can do the same in python or java too)
val responseData=sc.textFile("response.txt")
val responseValue = responseData.map(line => line.trim().split(" ").map(_.toDouble))
val featuresData=sc.textFile("features.txt")
val featuresValue = featuresData.map(line => {
val featureInt = line.trim().toInt
})
val data = featuresValue.zip(featuresData).map(
line => LabeledPoint(line._1, Vectors.dense(line._2))
)
saveAsLibSVMFile(data, "data.libsvm")
If you want the PySpark Version, haven't tested this but something like:
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
responseData=sc.textFile("response.txt")
responseValue = responseData.map(lambda line: map(lambda x: Decimal(x), line.strip().split(" ")))
# for clarity you can also extract the lambda into a function
featuresData=sc.textFile("features.txt")
featuresValue = featuresData.map(lambda line: Int(line.strip()))
mtx = zip(featuresValue.collect(),featuresData.collect())
data = map(lambda line: LabeledPoint(line[0], Vectors.sparse(line[1]), mtx))
saveAsLibSVMFile(data, "data.libsvm")

Resources