How should I convert an RDD of org.apache.spark.ml.linalg.Vector to Dataset? - apache-spark

I'm struggling to understand how the conversion among RDDs, DataSets and DataFrames works.
I'm pretty new to Spark, and I get stuck every time I need to pass from a data model to another (especially from RDDs to Datasets and Dataframes).
Could anyone explain me the right way to do it?
As an example, now I have a RDD[org.apache.spark.ml.linalg.Vector] and I need to pass it to my machine learning algorithm, for example a KMeans (Spark DataSet MLlib). So, I need to convert it to Dataset with a single column named "features" which should contain Vector typed rows. How should I do this?

All you need is an Encoder. Imports
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.ml.linalg
RDD:
val rdd = sc.parallelize(Seq(
linalg.Vectors.dense(1.0, 2.0), linalg.Vectors.sparse(2, Array(), Array())
))
Conversion:
val ds = spark.createDataset(rdd)(ExpressionEncoder(): Encoder[linalg.Vector])
.toDF("features")
ds.show
// +---------+
// | features|
// +---------+
// |[1.0,2.0]|
// |(2,[],[])|
// +---------+
ds.printSchema
// root
// |-- features: vector (nullable = true)

To convert a RDD to a dataframe, the easiest way is to use toDF() in Scala. To use this function, it is necessary to import implicits which is done using the SparkSession object. It can be done as follows:
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val df = rdd.toDF("features")
toDF() takes an RDD of tuples. When the RDD is built up of common Scala objects they will be implicitly converted, i.e. there is no need to do anything, and when the RDD has multiple columns there is no need to do anything either, the RDD already contains a tuple. However, in this special case you need to first convert RDD[org.apache.spark.ml.linalg.Vector] to RDD[(org.apache.spark.ml.linalg.Vector)]. Therefore, it is necessary to do a convertion to tuple as follows:
val df = rdd.map(Tuple1(_)).toDF("features")
The above will convert the RDD to a dataframe with a single column called features.
To convert to a dataset the easiest way is to use a case class. Make sure the case class is defined outside the Main object. First convert the RDD to a dataframe, then do the following:
case class A(features: org.apache.spark.ml.linalg.Vector)
val ds = df.as[A]
To show all possible convertions, to access the underlying RDD from a dataframe or dataset can be done using .rdd:
val rdd = df.rdd
Instead of converting back and forth between RDDs and dataframes/datasets it's usually easier to do all the computations using the dataframe API. If there is no suitable function to do what you want, usually it's possible to define an UDF, user defined function. See for example here: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html

Related

When to use map function on spark in transforming values

I'm new to spark and working with it. Previously I worked with python and pandas, pandas has a map function which is often used to apply transformation on columns. I found out that spark also have map function as well but until now I haven't used it at all except for extracting values like this df.select("id").map(r => r.getString(0)).collect.toList
import spark.implicits._
val df3 = df2.map(row=>{
val util = new Util()
val fullName = row.getString(0) +row.getString(1) +row.getString(2)
(fullName, row.getString(3),row.getInt(5))
})
val df3Map = df3.toDF("fullName","id","salary")
my questions are,
is it common to use map function to transform dataframe columns?
is it common to use map like block of code above? source from sparkbyexamples
when do people usually use map?

How to convert from dataframe to RDD and back with a case class [duplicate]

I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P
For example, assuming I am having the following:
case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)
val anRDD = sc.parallelize(Seq(
(randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
(randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
(randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))
val aDF = anRDD.toDF()
Assuming that I am having the aDF how can I get the anRDD???
I tried something like this just to get the second column but it was giving an error:
aDF.map { case r:Row => r.getAs[randomClass3]("_2")}
You can convert indirectly using Dataset[randomClass3]:
aDF.select($"_2.*").as[randomClass3].rdd
Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.
For the second column, which is struct<a: string, b: string>, it would be a Row as well:
aDF.rdd.map { _.getAs[Row]("_2") }
As commented by Tzach Zohar to get back a full RDD you'll need:
aDF.as[(randomClass2, randomClass3)].rdd
I don't know the scala API but have you considered the rdd value?
Maybe something like :
aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}

Appending data to an empty dataframe

I am creating an empty dataframe and later trying to append another data frame to that. In fact I want to append many dataframes to the initially empty dataframe dynamically depending on number of RDDs coming.
the union() function works fine if I assign the value to another a third dataframe.
val df3=df1.union(df2)
But I want to keep appending to the initial dataframe (empty) I created because I want to store all the RDDs in one dataframe. The below code however does not show right counts. It seems that it simply did not append
df1.union(df2)
df1.count() // this shows 0 although df2 has some data and that is shown if I assign to third datafram.
If I do the below (I get reassignment error since df1 is val. And if I change it to var type, I get kafka multithreading not safe error.
df1=d1.union(df2)
Any idea how to add all the dynamically created dataframes to one initially created data frame?
Not sure if this is what you are looking for!
# Import pyspark functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Define your schema
field = [StructField("Col1",StringType(), True), StructField("Col2", IntegerType(), True)]
schema = StructType(field)
# Your empty data frame
df = spark.createDataFrame(sc.emptyRDD(), schema)
l = []
for i in range(5):
# Build and append to the list dynamically
l = l + [([str(i), i])]
# Create a temporary data frame similar to your original schema
temp_df = spark.createDataFrame(l, schema)
# Do the union with the original data frame
df = df.union(temp_df)
df.show()
DataFrames and other distributed data structures are immutable, therefore methods which operate on them always return new object. There is no appending, no modification in place, and no ALTER TABLE equivalent.
And if I change it to var type, I get kafka multithreading not safe error.
Without actual code is impossible to give you a definitive answer, but it is unlikely related to union code.
There is a number of known Spark bugs cause by incorrect internal implementation (SPARK-19185, SPARK-23623 to enumerate just a few).

How to convert Spark DataFrame column of sparse vectors to a column of dense vectors?

I used the following code:
df.withColumn("dense_vector", $"sparse_vector".toDense)
but it gives an error.
I am new to Spark, so this might be obvious and there may be obvious errors in my code line. Please help. Thank you!
Contexts which require operation like this are relatively rare in Spark. With one or two exception Spark API expects common Vector class not specific implementation (SparseVector, DenseVector). This is also true in case of distributed structures from o.a.s.mllib.linalg.distributed
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val df = Seq[(Long, Vector)](
(1L, Vectors.dense(1, 2, 3)), (2L, Vectors.sparse(3, Array(1), Array(3)))
).toDF("id", "v")
new RowMatrix(df.select("v")
.map(_.getAs[Vector]("v")))
.columnSimilarities(0.9)
.entries
.first
// apache.spark.mllib.linalg.distributed.MatrixEntry = MatrixEntry(0,2,1.0)
Nevertheless you could use an UDF like this:
val asDense = udf((v: Vector) => v.toDense)
df.withColumn("vd", asDense($"v")).show
// +---+-------------+-------------+
// | id| v| vd|
// +---+-------------+-------------+
// | 1|[1.0,2.0,3.0]|[1.0,2.0,3.0]|
// | 2|(3,[1],[3.0])|[0.0,3.0,0.0]|
// +---+-------------+-------------+
Just keep in mind that since version 2.0 Spark provides two different and compatible Vector types:
o.a.s.ml.linalg.Vector
o.a.s.mllib.linalg.Vector
each with corresponding SQL UDT. See MatchError while accessing vector column in Spark 2.0

Spark SQL - Exact difference between Creating schema implicitly & Programmatically

I am trying to understand the exact difference and which Method can be used in what particular Scenario between Creating Schema Implicitly & Programmatically.
On Databricks site the information is not that much elborative & explanatory.
As we can see that when using Reflection(implicit RDD to DF) way we can create a Case Class by choosing specific columns from a textfile by using the Map function.
And in Programmatic Style - we are loading the Dataset a textfile (similar to reflection)
Creating a SchemaString (String) = "Knowing the file we can specify the columns we need " (Similar to case class in Reflection way)
Importing the ROW API - which will again Map to the Specific Columns & data types used in Schema String (Similar to case classes)
Then we create DataFrame & after this everything is same..
So what is the exact difference in these two approaches.
http://spark.apache.org/docs/1.5.2/sql-programming-guide.html#inferring-the-schema-using-reflection
http://spark.apache.org/docs/1.5.2/sql-programming-guide.html#programmatically-specifying-the-schema
Please Explain...
The produced schemas are the same, so from that point of view, there's no difference. In both cases, you're supplying a schema for your data, but in one case, you're doing it from a case class, in the other you can use collections, since a schema is built as a StructType(Array[StructField]).
So it's basically a choice between tuples and collections. The way I see it, the biggest difference is that cases classes have to be in the code, while programmatically specifying the schema can be done at runtime, so you could, for instance, build a schema based on another DataFrame that you're reading at runtime.
As an example, I wrote a generic tool to "nest" data, reading from CSV, and transforming a set of prefixed field into an array of structs.
Since the tool is generic, and the schema is known only at runtime, I used the programmatic approach.
On the other hand, it's generally easier to code it with reflection, since you don't have to deal with all the StructField objects, since they are derived from the hive metastore their data type has to be mapped to your scala types.
Programmatically Specifying the Schema
When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.
Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
For example:
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Import Row.
import org.apache.spark.sql.Row;
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType};
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")
Inferring the Schema Using Reflection
The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns. Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table. Tables can be used in subsequent SQL statements.
For example:
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")

Resources