Using VectorAssembler in Spark - apache-spark

I got the following dataframe (it is assumed that it is already a dataframe):
val df = sc.parallelize(Seq((1, 2, 10), (3, 4, 11), (5, 6, 12)))
.toDF("a", "b", "c")
and I want to combine the columns(not all) to one column and make it an rdd of Array[Double]. I am doing the following:
import org.apache.spark.ml.feature.VectorAssembler
val colSelected = List("a","b")
val assembler = new VectorAssembler()
.setInputCols(colSelected.toArray)
.setOutputCol("features")
val output = assembler.transform(df).select("features").rdd
Till here it is ok. Now the output is a dataframe of the format RDD[spark.sql.Row]. I am unable to transform this to a format of RDD[Array[Double]]. Any way?
I have tried something like the following but with no success:
output.map { case Row(a: Vector[Double]) => a.getAs[Array[Double]]("features")}

The correct solution (this assumes Spark 2.0+, in 1.x use o.a.s.mllib.linalg.Vector):
import org.apache.spark.ml.linalg.Vector
output.map(_.getAs[Vector]("features").toArray)
ml / mllib Vector created by VectorAssembler is not the same as scala.collection.Vector.
Row.getAs should be used with expected type. It doesn't perform any type conversions and o.a.s.ml(lib).linalg.Vector is not an Array[Double].

Related

How to drop original columns in a spark ML transformer

When I run a spark ml transformer, we provide input and output columns. The transformed data set contains both types of columns, i.e. old columns and transformed columns
e.g.
from pyspark.ml.feature import Imputer
df = spark.createDataFrame([
(1.0, float("nan")),
(2.0, float("nan")),
(float("nan"), 3.0),
(4.0, 4.0),
(5.0, 5.0)
], ["a", "b"])
imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
model = imputer.fit(df)
model.transform(df).columns
This will print out
['a','b','out_a','out_b']
Is it possible to ask the transformer to spit out the transformed column only?
I want this to happen inside the transformer, and do not want to remove the columns using the drop method in spark dataframe

Create a vector of features by droping specific columns Spark ML

I use VectorAssembler to create a vector of features from >2000 columns so that I can run a PCA on it. I would normally explicitly state which columns need to be included in the feature vector:
val dataset = (spark.createDataFrame(
Seq((0, 1.2, 1.3, 1.7, 1.9), (1, 2.2, 2.3, 2.7, 2.9), (2, 3.2, 3.3, 3.5, 3.7))
).toDF("id", "f1", "f2", "f3", "f4"))
val assembler = (new VectorAssembler()
.setInputCols(Array("f2", "f3"))
.setOutputCol("featureVec"))
But in case of more than 2000 columns how can I specify that all columns except for "id" and "f1" should be included?
Any help appreciated!
One of the easiest way is to get all the column names, convert to a set and subtract the columns you don't need and again use it as an array as
val datasetColumnsToBeUsed = dataset.columns.toSet - "id" - "f1" toArray
import org.apache.spark.ml.feature.VectorAssembler
val assembler = (new VectorAssembler()
.setInputCols(Array(datasetColumnsToBeUsed: _*))
.setOutputCol("featureVec"))
And another easiest way is to use filter on column names as
val columnNames = dataset.columns
val datasetColumnsToBeUsed = columnNames.filterNot(x => Array("id", "f1").contains(x))
And use it as above

Load Data for Machine Learning in Spark [duplicate]

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation :
val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line =>
val fields = line.split(",")
(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))}
val usergroup = user.groupByKey
val data =usergroup.map{ case(x,iter) => (x,iter.map(_._1).toArray,iter.map(_._2).toArray)}
val data_DF = data.toDF("user","item","rating")
I am using Spark 2.0.
The issue you are facing can be divided into the following :
Converting your ratings (I believe) into LabeledPoint data X.
Saving X in libsvm format.
1. Converting your ratings into LabeledPoint data X
Let's consider the following raw ratings :
val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")
You can handle those raw ratings as a coordinate list matrix (COO).
Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).
Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD
val data: RDD[MatrixEntry] =
sc.parallelize(rawRatings).map {
line => {
val fields = line.split(",")
val i = fields(0).toLong
val j = fields(1).toLong
val value = fields(2).toDouble
MatrixEntry(i, j, value)
}
}
Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :
val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
.toIndexedRowMatrix().rows // Extract indexed rows
.toDF("label", "features") // Convert rows
2. Saving LabeledPoint data in libsvm format
Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
Unfortunately we still can't use the DataFrameWriter directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.
Utilities for converting DataFrame columns from mllib.linalg to ml.linalg types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils. In our case we need to do the following (for both the dummy data and the DataFrame from step 1.)
import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)
Now let's save the DataFrame :
convertedVecDF.write.format("libsvm").save("data/foo")
And we can check the files contents :
$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0
EDIT:
In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")
In order to convert an existing to a typed DataSet I suggest the following; Use the following case class:
case class LibSvmEntry (
value: Double,
features: L.Vector)
The you can use the map function to convert it to a LibSVM entry like so:
df.map[LibSvmEntry](r: Row => /* Do your stuff here*/)
libsvm datatype features is a sparse vector, u can use pyspark.ml.linalg.SparseVector to solve the problem
a = SparseVector(4, [1, 3], [3.0, 4.0])
def sparsevecfuc(len,index,score):
"""
args: len int, index array, score array
"""
return SparseVector(len,index,score)
trans_sparse = udf(sparsevecfuc,VectorUDT())

Do not discard keys with null values when converting to JSON in PySpark DataFrame

I am creating a column in a DataFrame from several other columns that I want to store as a JSON serialized string. When the serialization to JSON occurs, keys with null values are dropped. Is there a way to keep keys even if the value is null?
Sample program illustrating the issue:
from pyspark.sql import functions as F
df = sc.parallelize([
(1, 10),
(2, 20),
(3, None),
(4, 40),
]).toDF(['id', 'data'])
df.collect()
#[Row(id=1, data=10),
# Row(id=2, data=20),
# Row(id=3, data=None),
# Row(id=4, data=40)]
df_s = df.select(F.struct('data').alias('struct'))
df_s.collect()
#[Row(struct=Row(data=10)),
# Row(struct=Row(data=20)),
# Row(struct=Row(data=None)),
# Row(struct=Row(data=40))]
df_j = df.select(F.to_json(F.struct('data')).alias('json'))
df_j.collect()
#[Row(json=u'{"data":10}'),
# Row(json=u'{"data":20}'),
# Row(json=u'{}'), <= would like this to be u'{"data":null}'
# Row(json=u'{"data":40}')]
Running Spark 2.1.0
Could not find a Spark specific solution so just wrote a udf and used the python json package:
import json
from pyspark.sql import types as T
def to_json(data):
return json.dumps({'data': data})
to_json_udf = F.udf(to_json, T.StringType())
df.select(to_json_udf('data').alias('json')).collect()
# [Row(json=u'{"data": 10}'),
# Row(json=u'{"data": 20}'),
# Row(json=u'{"data": null}'),
# Row(json=u'{"data": 40}')]
Also posted on this StackOverflow post:
Since Pyspark 3, one can use the ignoreNullFields option when
writing to a JSON file.
spark_dataframe.write.json(output_path,ignoreNullFields=False)
Pyspark docs:
https://spark.apache.org/docs/3.1.1/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.json

How do I flatMap a row of arrays into multiple rows?

After parsing some jsons I have a one-column DataFrame of arrays
scala> val jj =sqlContext.jsonFile("/home/aahu/jj2.json")
res68: org.apache.spark.sql.DataFrame = [r: array<bigint>]
scala> jj.first()
res69: org.apache.spark.sql.Row = [List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)]
I'd like to explode each row out into several rows. How?
edit:
Original json file:
{"r": [0,1,2,3,4,5,6,7,8,9]}
{"r": [0,1,2,3,4,5,6,7,8,9]}
I want an RDD or a DataFrame with 20 rows.
I can't simply use flatMap here - I'm not sure what the appropriate command in spark is:
scala> jj.flatMap(r => r)
<console>:22: error: type mismatch;
found : org.apache.spark.sql.Row
required: TraversableOnce[?]
jj.flatMap(r => r)
You can use DataFrame.explode to achieve what you desire. Below is what I tried in spark-shell with your sample json data.
import scala.collection.mutable.ArrayBuffer
val jj1 = jj.explode("r", "r1") {list : ArrayBuffer[Long] => list.toList }
val jj2 = jj1.select($"r1")
jj2.collect
You can refer to API documentation to understand more DataFrame.explode
I've tested this with Spark 1.3.1
Or you can use Row.getAs function:
import scala.collection.mutable.ArrayBuffer
val elementsRdd = jj.select(jj("r")).map(t=>t.getAs[ArrayBuffer[Long]](0)).flatMap(x=>x)
elementsRdd.count()
>>>Long = 20
elementsRdd.take(5)
>>>Array[Long] = Array(0, 1, 2, 3, 4)
In Spark 1.3+ you can use explode function directly on the column of interest:
import org.apache.spark.sql.functions.explode
jj.select(explode($"r"))

Resources