How to store nested custom objects in Spark Dataset? - apache-spark

The question is a follow-up of How to store custom objects in Dataset?
Spark version: 3.0.1
Non-nested custom type is achievable:
import spark.implicits._
import org.apache.spark.sql.{Encoder, Encoders}
class AnObj(val a: Int, val b: String)
implicit val myEncoder: Encoder[AnObj] = Encoders.kryo[AnObj]
val d = spark.createDataset(Seq(new AnObj(1, "a")))
d.printSchema
root
|-- value: binary (nullable = true)
However, if the custom type is nested inside a product type (i.e. case class), it gives an error:
java.lang.UnsupportedOperationException: No Encoder found for InnerObj
import spark.implicits._
import org.apache.spark.sql.{Encoder, Encoders}
class InnerObj(val a: Int, val b: String)
case class MyObj(val i: Int, val j: InnerObj)
implicit val myEncoder: Encoder[InnerObj] = Encoders.kryo[InnerObj]
// error
val d = spark.createDataset(Seq(new MyObj(1, new InnerObj(0, "a"))))
// it gives Runtime error: java.lang.UnsupportedOperationException: No Encoder found for InnerObj
How can we create Dataset with nested custom type?

Adding the encoders for both MyObj and InnerObj should make it work.
class InnerObj(val a:Int, val b: String)
case class MyObj(val i: Int, j: InnerObj)
implicit val myEncoder: Encoder[InnerObj] = Encoders.kryo[InnerObj]
implicit val objEncoder: Encoder[MyObj] = Encoders.kryo[MyObj]
The above snippet compile and run fine

Another solution apart from sujesh's:
import spark.implicits._
import org.apache.spark.sql.{Encoder, Encoders}
class InnerObj(val a: Int, val b: String)
case class MyObj[T](val i: Int, val j: T)
implicit val myEncoder: Encoder[MyObj[InnerObj]] = Encoders.kryo[MyObj[InnerObj]]
// works
val d = spark.createDataset(Seq(new MyObj(1, new InnerObj(0, "a"))))
This also shows a difference between the case where the inner type can be deduced from the type parameter, and the case where it cannot be deduced.
The former case should be done:
implicit val myEncoder: Encoder[MyObj[InnerObj]] = Encoders.kryo[MyObj[InnerObj]]
The later case should be done:
implicit val myEncoder1: Encoder[InnerObj] = Encoders.kryo[InnerObj]
implicit val myEncoder2: Encoder[MyObj] = Encoders.kryo[MyObj]

Related

Passing an entire row as an argument to spark udf through spark dataframe - throws AnalysisException

I am trying to pass an entire row to the spark udf along with few other arguments, I am not using spark sql rather I am using dataframe withColumn api, but I am getting the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) col3#9 missing from col1#7,col2#8,col3#13 in operator !Project [col1#7, col2#8, col3#13, UDF(col3#9, col2, named_struct(col1, col1#7, col2, col2#8, col3, col3#9)) AS contcatenated#17]. Attribute(s) with the same name appear in the operation: col3. Please check if the right attribute(s) are used.;;
The above exception can be replicated using the below code:
addRowUDF() // call invokes
def addRowUDF() {
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().config(new SparkConf().set("master", "local[*]")).appName(this.getClass.getSimpleName).getOrCreate()
import spark.implicits._
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")).toDF("col1", "col2", "col3")
execute(df)
}
def execute(df: org.apache.spark.sql.DataFrame) {
import org.apache.spark.sql.Row
def concatFunc(x: Any, y: String, row: Row) = x.toString + ":" + y + ":" + row.mkString(", ")
import org.apache.spark.sql.functions.{ udf, struct }
val combineUdf = udf((x: Any, y: String, row: Row) => concatFunc(x, y, row))
def udf_execute(udf: String, args: org.apache.spark.sql.Column*) = (combineUdf)(args: _*)
val columns = df.columns.map(df(_))
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
val df3 = df2.withColumn("contcatenated", udf_execute("uudf", df2.col("col3"), lit("col2"), struct(columns: _*)))
df3.show(false)
}
output should be:
+----+----+-----------+----------------------------+
|col1|col2|col3 |contcatenated |
+----+----+-----------+----------------------------+
|a |b |xxxxxxxxxxx|xxxxxxxxxxx:col2:a, b, c |
|a1 |b1 |xxxxxxxxxxx|xxxxxxxxxxx:col2:a1, b1, c1 |
+----+----+-----------+----------------------------+
That happens because you refer to column that is no longer in the scope. When you call:
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
it shades the original col3 column, effectively making preceding columns with the same name accessible. Even if it wasn't the case, let's say after:
val df2 = df.select($"*", lit("xxxxxxxxxxx") as "col3")
the new col3 would be ambiguous, and indistinguishable by name from the one defined brought by *.
So to achieve the required output you'll have to use another name:
val df2 = df.withColumn("col3_", lit("xxxxxxxxxxx"))
and then adjust the rest of your code accordingly:
df2.withColumn(
"contcatenated",
udf_execute("uudf", df2.col("col3_") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")
If the logic is as simple as the one in the example, you can of course just inline things:
df.withColumn(
"contcatenated",
udf_execute("uudf", lit("xxxxxxxxxxx") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")

Group spark type-safe aggregations by multiple keys

In the snippet below the second aggregation fails (not surprising) with:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to spark_test.Record
package spark_test
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{DataFrame, Encoder, Encoders, SparkSession}
import org.scalatest.FunSuite
case class Record(k1: String, k2: String, v: Long) extends Serializable
class MyAggregator extends Aggregator[Record, Long, Long] {
override def zero: Long = 0
override def reduce(b: Long, a: Record): Long = a.v + b
override def merge(b1: Long, b2: Long): Long = b1 + b2
override def finish(reduction: Long): Long = reduction
override def bufferEncoder: Encoder[Long] = Encoders.scalaLong
override def outputEncoder: Encoder[Long] = Encoders.scalaLong
}
class TypeSafeAggTest extends FunSuite {
lazy val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("spark test")
.getOrCreate()
}
test("agg flow") {
import spark.sqlContext.implicits._
val df: DataFrame = Seq(
("a", "b", 1),
("a", "b", 1),
("c", "d", 1)
).toDF("k1", "k2", "v")
val aggregator = new MyAggregator()
.toColumn.name("output")
df.as[Record]
.groupByKey(_.k1)
.agg(aggregator)
.show(truncate = false) // < --- works #######
df.as[Record]
.groupBy($"k1", $"k2")
.agg(aggregator)
.show(truncate = false) // < --- fails runtime #######
}
}
There is a very simplistic example page from the official docs, but it doesn't cover using typesafe aggregators with grouping (so it's unclear whether such case is supported).
http://spark.apachecn.org/docs/en/2.2.0/sql-programming-guide.html#type-safe-user-defined-aggregate-functions
Is there a way to group by multiple keys when using Spark type-safe aggregators?
Please use such construction:
.groupByKey(v=> (v.k1,v.k2))

value toDF is not a member of org.apache.spark.rdd.RDD[Weather] [duplicate]

This question already has answers here:
value toDF is not a member of org.apache.spark.rdd.RDD
(5 answers)
Closed 4 years ago.
this code is normal in spark-shell,
but it is abnomal in Intellj IDE.
this is Error message.
Error:(59, 7) value toDF is not a member of org.apache.spark.rdd.RDD[Weather]
possible cause: maybe a semicolon is missing before `value toDF'?
}.toDF()
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import org.apache.spark.ml.tuning.CrossValidator
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.tuning.ParamGridBuilder
import org.apache.spark.rdd.PairRDDFunctions
import org.apache.spark.sql.DataFrame
case class Weather(
date: String,
day_of_week: String,
avg_temp: Double,
max_temp: Double,
min_temp: Double,
rainfall: Double,
daylight_hours: Double,
max_depth_snowfall: Double,
total_snowfall: Double,
solar_radiation: Double,
mean_wind_speed: Double,
max_wind_speed: Double,
max_instantaneous_wind_speed: Double,
avg_humidity: Double,
avg_cloud_cover: Double)
case class Tracffic(date: String, down: Double, up: Double)
case class Predict(describe: String, avg_temp: Double, rainfall: Double, weekend: Double, total_snowfall: Double)
object weather2 {
def main(args : Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("weather2")
val sc = new SparkContext(conf)
val weatherCSVTmp = sc.textFile("D:\\shared\\weather.csv")
val weatherHeader = sc.parallelize(Array(weatherCSVTmp.first))
val weatherCSV = weatherCSVTmp.subtract(weatherHeader)
val weatherDF = weatherCSV.map(_.split(",")).map { p =>
Weather(p(0),
p(1),
p(2).trim.toDouble,
p(3).trim.toDouble,
p(4).trim.toDouble,
p(5).trim.toDouble,
p(6).trim.toDouble,
p(7).trim.toDouble,
p(8).trim.toDouble,
p(9).trim.toDouble,
p(10).trim.toDouble,
p(11).trim.toDouble,
p(12).trim.toDouble,
p(13).trim.toDouble,
p(14).trim.toDouble)
}.toDF()//error
val tracfficCSVTmp = sc.textFile("D:\\shared\\tracffic_volume.csv")
val tracfficHeader = sc.parallelize(Array(tracfficCSVTmp.first))
val tracfficCSV = tracfficCSVTmp.subtract(tracfficHeader)
val tracfficDF = tracfficCSV.map(_.split(",")).map { p =>
Tracffic(p(0),
p(1).trim.toDouble,
p(2).trim.toDouble)
}.toDF() //error
val tracfficAndWeatherDF = tracfficDF.join(weatherDF, "date")
val isWeekend = udf((t: String) =>
t match {
case x if x.contains("Sunday") => 1d
case x if x.contains("Saturday") => 1d
case _ => 0d
})
val replacedtracfficAndWeatherDF = tracfficAndWeatherDF.withColumn(
"weekend", isWeekend(tracfficAndWeatherDF("day_of_week"))
).drop("day_of_week")
val va = new VectorAssembler().setInputCols {
Array("avg_temp", "weekend", "rainfall")
}.setOutputCol("input_vec")
val scaler = new StandardScaler().setInputCol(va.getOutputCol).setOutputCol("scaled_vec")
va.explainParams
scaler.explainParams
//down predict
val lr = new LinearRegression().setMaxIter(10).setFeaturesCol(scaler.getOutputCol).setLabelCol("down")
val pipeline = new Pipeline().setStages(Array(va, scaler, lr))
val pipelineModel = pipeline.fit(replacedtracfficAndWeatherDF)
val test = sc.parallelize(Seq(
Predict("Ussally Day", 20.0, 20, 0, 0),
Predict("Weekend", 20.0, 20, 1, 0),
Predict("Cold day", 3.0, 20, 0, 20)
)).toDF //error
val predictedDataDF = pipelineModel.transform(test)
val desAndPred = predictedDataDF.select("describe", "prediction").collect()
desAndPred.foreach {
case Row(describe: String, prediction: Double) =>
println(s"($describe) -> prediction = $prediction")
}
what is the problem? Libraries is spark 2.11.x. would you help me?
Add the below code and try
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

Create an empty DataFrame with specified schema without SparkContext with SparkSession [duplicate]

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.
Lets assume you want a data frame with the following schema:
root
|-- k: string (nullable = true)
|-- v: integer (nullable = false)
You simply define schema for a data frame and use empty RDD[Row]:
import org.apache.spark.sql.types.{
StructType, StructField, StringType, IntegerType}
import org.apache.spark.sql.Row
val schema = StructType(
StructField("k", StringType, true) ::
StructField("v", IntegerType, false) :: Nil)
// Spark < 2.0
// sqlContext.createDataFrame(sc.emptyRDD[Row], schema)
spark.createDataFrame(sc.emptyRDD[Row], schema)
PySpark equivalent is almost identical:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("k", StringType(), True), StructField("v", IntegerType(), False)
])
# or df = sc.parallelize([]).toDF(schema)
# Spark < 2.0
# sqlContext.createDataFrame([], schema)
df = spark.createDataFrame([], schema)
Using implicit encoders (Scala only) with Product types like Tuple:
import spark.implicits._
Seq.empty[(String, Int)].toDF("k", "v")
or case class:
case class KV(k: String, v: Int)
Seq.empty[KV].toDF
or
spark.emptyDataset[KV].toDF
As of Spark 2.0.0, you can do the following.
Case Class
Let's define a Person case class:
scala> case class Person(id: Int, name: String)
defined class Person
Import spark SparkSession implicit Encoders:
scala> import spark.implicits._
import spark.implicits._
And use SparkSession to create an empty Dataset[Person]:
scala> spark.emptyDataset[Person]
res0: org.apache.spark.sql.Dataset[Person] = [id: int, name: string]
Schema DSL
You could also use a Schema "DSL" (see Support functions for DataFrames in org.apache.spark.sql.ColumnName).
scala> val id = $"id".int
id: org.apache.spark.sql.types.StructField = StructField(id,IntegerType,true)
scala> val name = $"name".string
name: org.apache.spark.sql.types.StructField = StructField(name,StringType,true)
scala> import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructType
scala> val mySchema = StructType(id :: name :: Nil)
mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true))
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val emptyDF = spark.createDataFrame(sc.emptyRDD[Row], mySchema)
emptyDF: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> emptyDF.printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
Java version to create empty DataSet:
public Dataset<Row> emptyDataSet(){
SparkSession spark = SparkSession.builder().appName("Simple Application")
.config("spark.master", "local").getOrCreate();
Dataset<Row> emptyDataSet = spark.createDataFrame(new ArrayList<>(), getSchema());
return emptyDataSet;
}
public StructType getSchema() {
String schemaString = "column1 column2 column3 column4 column5";
List<StructField> fields = new ArrayList<>();
StructField indexField = DataTypes.createStructField("column0", DataTypes.LongType, true);
fields.add(indexField);
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
return schema;
}
import scala.reflect.runtime.{universe => ru}
def createEmptyDataFrame[T: ru.TypeTag] =
hiveContext.createDataFrame(sc.emptyRDD[Row],
ScalaReflection.schemaFor(ru.typeTag[T].tpe).dataType.asInstanceOf[StructType]
)
case class RawData(id: String, firstname: String, lastname: String, age: Int)
val sourceDF = createEmptyDataFrame[RawData]
Here you can create schema using StructType in scala and pass the Empty RDD so you will able to create empty table.
Following code is for the same.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.BooleanType
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.types.StringType
//import org.apache.hadoop.hive.serde2.objectinspector.StructField
object EmptyTable extends App {
val conf = new SparkConf;
val sc = new SparkContext(conf)
//create sparksession object
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
//Created schema for three columns
val schema = StructType(
StructField("Emp_ID", LongType, true) ::
StructField("Emp_Name", StringType, false) ::
StructField("Emp_Salary", LongType, false) :: Nil)
//Created Empty RDD
var dataRDD = sc.emptyRDD[Row]
//pass rdd and schema to create dataframe
val newDFSchema = sparkSession.createDataFrame(dataRDD, schema)
newDFSchema.createOrReplaceTempView("tempSchema")
sparkSession.sql("create table Finaltable AS select * from tempSchema")
}
This is helpful for testing purposes.
Seq.empty[String].toDF()
Here is a solution that creates an empty dataframe in pyspark 2.0.0 or more.
from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(),False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)
I had a special requirement wherein I already had a dataframe but given a certain condition I had to return an empty dataframe so I returned df.limit(0) instead.
I'd like to add the following syntax which was not yet mentioned:
Seq[(String, Integer)]().toDF("k", "v")
It makes it clear that the () part is for values. It's empty, so the dataframe is empty.
This syntax is also beneficial for adding null values manually. It just works, while other options either don't or are overly verbose.
As of Spark 2.4.3
val df = SparkSession.builder().getOrCreate().emptyDataFrame

Spark group by - Pig conversion

I am trying to achieve something like this in spark. The following code snippet is from Pig Latin. Is there anyway I can do the same thing with Spark?
A = load 'student' AS (name:chararray,age:int,gpa:float);
DESCRIBE A;
A: {name: chararray,age: int,gpa: float} DUMP A; (John,18,4.0F)
(Mary,19,3.8F) (Bill,20,3.9F) (Joe,18,3.8F)
B = GROUP A BY age;
Result: (18,{(John,18,4.0F),(Joe,18,3.8F)}) (19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Thanks.
It's easy to do a list of names by age. I believe the Spark API doesn't allow you to collect complete rows and get a complete row list in the same way.
// Input data
val df = {
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
import java.time.LocalDate
val simpleSchema = StructType(
StructField("name", StringType) ::
StructField("age", IntegerType) ::
StructField("gpa", FloatType) :: Nil)
val data = List(
Row("John", 18, 4.0f),
Row("Mary", 19, 3.8f),
Row("Bill", 20, 3.9f),
Row("Joe", 18, 3.8f)
)
spark.createDataFrame(data.asJava, simpleSchema)
}
df.show()
val df2 = df.groupBy(col("age")).agg(collect_list(col("name")))
df2.show()

Resources