Create a vector of features by droping specific columns Spark ML - apache-spark

I use VectorAssembler to create a vector of features from >2000 columns so that I can run a PCA on it. I would normally explicitly state which columns need to be included in the feature vector:
val dataset = (spark.createDataFrame(
Seq((0, 1.2, 1.3, 1.7, 1.9), (1, 2.2, 2.3, 2.7, 2.9), (2, 3.2, 3.3, 3.5, 3.7))
).toDF("id", "f1", "f2", "f3", "f4"))
val assembler = (new VectorAssembler()
.setInputCols(Array("f2", "f3"))
.setOutputCol("featureVec"))
But in case of more than 2000 columns how can I specify that all columns except for "id" and "f1" should be included?
Any help appreciated!

One of the easiest way is to get all the column names, convert to a set and subtract the columns you don't need and again use it as an array as
val datasetColumnsToBeUsed = dataset.columns.toSet - "id" - "f1" toArray
import org.apache.spark.ml.feature.VectorAssembler
val assembler = (new VectorAssembler()
.setInputCols(Array(datasetColumnsToBeUsed: _*))
.setOutputCol("featureVec"))
And another easiest way is to use filter on column names as
val columnNames = dataset.columns
val datasetColumnsToBeUsed = columnNames.filterNot(x => Array("id", "f1").contains(x))
And use it as above

Related

How to drop original columns in a spark ML transformer

When I run a spark ml transformer, we provide input and output columns. The transformed data set contains both types of columns, i.e. old columns and transformed columns
e.g.
from pyspark.ml.feature import Imputer
df = spark.createDataFrame([
(1.0, float("nan")),
(2.0, float("nan")),
(float("nan"), 3.0),
(4.0, 4.0),
(5.0, 5.0)
], ["a", "b"])
imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
model = imputer.fit(df)
model.transform(df).columns
This will print out
['a','b','out_a','out_b']
Is it possible to ask the transformer to spit out the transformed column only?
I want this to happen inside the transformer, and do not want to remove the columns using the drop method in spark dataframe

Spark Data frame Join: Non matching Records from first Dataframe

Hi all I have 2 Dataframes and I'm applying some join condition on those dataframes.
1.after join condition i want all the data from first dataframe whose name,id,code,lastname is not matching which second dataframe.I have written below code.
val df3=df1.join(df2,df1("name") !== df2("name_2") &&
df1("id") !== df2("id_2") &&
df1("code") !== df2("code_2") &&
df1("lastname") !== df2("lastname_2"),"inner")
.drop(df2("id_2"))
.drop(df2("name_2"))
.drop(df2("code_2"))
.drop(df2("lastname"))
expected result.
DF1
id,name,code,lastname
1,A,001,p1
2,B,002,p2
3,C,003,p3
DF2
id_2,name_2,code_2,lastname_2
1,A,001,p1
2,B,002,p4
4,D,004,p4
DF3
id,name,code,lastname
3,C,003,p3
Can someone please help me is this the correct way to do this or Should I use sql inner query with 'not In '?. I am new to spark and using first time dataframe methods
so I am not sure this is the correct way or not?
I recommend you using Spark API to work with data:
val df1 =
Seq((1, "20181231"), (2, "20190102"), (3, "20190103"), (4, "20190104"), (5, "20190105")).toDF("id", "date")
val df2 =
Seq((1, "20181231"), (2, "20190102"), (4, "20190104"), (5, "20190105")).toDF("id", "date")
Option1. You can get all rows are not included in other dataframe:
val df3=df1.except(df2)
Option2. You can use a specific fields to do anti join, for example 'id':
val df3 = df1.as("table1").join(df2.as("table2"), $"table1.id" === $"table2.id", "leftanti")
df3.show()

Load Data for Machine Learning in Spark [duplicate]

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation :
val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line =>
val fields = line.split(",")
(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))}
val usergroup = user.groupByKey
val data =usergroup.map{ case(x,iter) => (x,iter.map(_._1).toArray,iter.map(_._2).toArray)}
val data_DF = data.toDF("user","item","rating")
I am using Spark 2.0.
The issue you are facing can be divided into the following :
Converting your ratings (I believe) into LabeledPoint data X.
Saving X in libsvm format.
1. Converting your ratings into LabeledPoint data X
Let's consider the following raw ratings :
val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")
You can handle those raw ratings as a coordinate list matrix (COO).
Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).
Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD
val data: RDD[MatrixEntry] =
sc.parallelize(rawRatings).map {
line => {
val fields = line.split(",")
val i = fields(0).toLong
val j = fields(1).toLong
val value = fields(2).toDouble
MatrixEntry(i, j, value)
}
}
Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :
val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
.toIndexedRowMatrix().rows // Extract indexed rows
.toDF("label", "features") // Convert rows
2. Saving LabeledPoint data in libsvm format
Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
Unfortunately we still can't use the DataFrameWriter directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.
Utilities for converting DataFrame columns from mllib.linalg to ml.linalg types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils. In our case we need to do the following (for both the dummy data and the DataFrame from step 1.)
import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)
Now let's save the DataFrame :
convertedVecDF.write.format("libsvm").save("data/foo")
And we can check the files contents :
$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0
EDIT:
In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")
In order to convert an existing to a typed DataSet I suggest the following; Use the following case class:
case class LibSvmEntry (
value: Double,
features: L.Vector)
The you can use the map function to convert it to a LibSVM entry like so:
df.map[LibSvmEntry](r: Row => /* Do your stuff here*/)
libsvm datatype features is a sparse vector, u can use pyspark.ml.linalg.SparseVector to solve the problem
a = SparseVector(4, [1, 3], [3.0, 4.0])
def sparsevecfuc(len,index,score):
"""
args: len int, index array, score array
"""
return SparseVector(len,index,score)
trans_sparse = udf(sparsevecfuc,VectorUDT())

Using VectorAssembler in Spark

I got the following dataframe (it is assumed that it is already a dataframe):
val df = sc.parallelize(Seq((1, 2, 10), (3, 4, 11), (5, 6, 12)))
.toDF("a", "b", "c")
and I want to combine the columns(not all) to one column and make it an rdd of Array[Double]. I am doing the following:
import org.apache.spark.ml.feature.VectorAssembler
val colSelected = List("a","b")
val assembler = new VectorAssembler()
.setInputCols(colSelected.toArray)
.setOutputCol("features")
val output = assembler.transform(df).select("features").rdd
Till here it is ok. Now the output is a dataframe of the format RDD[spark.sql.Row]. I am unable to transform this to a format of RDD[Array[Double]]. Any way?
I have tried something like the following but with no success:
output.map { case Row(a: Vector[Double]) => a.getAs[Array[Double]]("features")}
The correct solution (this assumes Spark 2.0+, in 1.x use o.a.s.mllib.linalg.Vector):
import org.apache.spark.ml.linalg.Vector
output.map(_.getAs[Vector]("features").toArray)
ml / mllib Vector created by VectorAssembler is not the same as scala.collection.Vector.
Row.getAs should be used with expected type. It doesn't perform any type conversions and o.a.s.ml(lib).linalg.Vector is not an Array[Double].

How to insert the rdd data into a dataframe in pyspark?

Please find below the psuedocode:
source dataframe with 5 columns
creating a target dataframe with schema(6 columns)
For item in source_dataframe:
#adding a column to the list buy checking item.coulmn2
list = [item.column1,item.column2,newcolumn]
#creating an rdd out of this list
#now i need to add this rdd to a target dataframe?????
You could definately explain your question a bit more in detail or give some sample code. I'm interested how others will solve that. My proposed solution is this one:
df = (
sc.parallelize([
(134, "2016-07-02 12:01:40"),
(134, "2016-07-02 12:21:23"),
(125, "2016-07-02 13:22:56"),
(125, "2016-07-02 13:27:07")
]).toDF(["itemid", "timestamp"])
)
rdd = df.map(lambda x: (x[0], x[1], 10))
df2 = rdd.toDF(["itemid", "timestamp", "newCol"])
df3 = df.join(df2, df.itemid == df2.itemid and df.timestamp == df2.timestamp, "inner").drop(df2.itemid).drop(df2.timestamp)
I'm converting the RDD to a Dataframe. Afterwards I join both Dataframes, which duplicates some columns. So finally I drop those duplicated columns.

Resources