Kryo encoder v.s. RowEncoder in Spark Dataset - apache-spark

The purpose of the following examples is to understand the difference of the two encoders in Spark Dataset.
I can do this:
val df = Seq((1, "a"), (2, "d")).toDF("id", "value")
import org.apache.spark.sql.{Encoder, Encoders, Row}
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
val myStructType = StructType(Seq(StructField("id", IntegerType), StructField("value", StringType)))
implicit val myRowEncoder = RowEncoder(myStructType)
val ds = df.map{case row => row}
ds.show
//+---+-----+
//| id|value|
//+---+-----+
//| 1| a|
//| 2| d|
//+---+-----+
I can also do this:
val df = Seq((1, "a"), (2, "d")).toDF("id", "value")
import org.apache.spark.sql.{Encoder, Encoders, Row}
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
implicit val myKryoEncoder: Encoder[Row] = Encoders.kryo[Row]
val ds = df.map{case row => row}
ds.show
//+--------------------+
//| value|
//+--------------------+
//|[01 00 6F 72 67 2...|
//|[01 00 6F 72 67 2...|
//+--------------------+
The only difference of the code is: one is using Kryo encoder, another is using RowEncoder.
Question:
What is the difference using the two?
Why one is displaying encoded values, another is displaying human readable values?
When should we use which?

Encoders.kryo simply creates an encoder that serializes objects of type T using Kryo
RowEncoder is an object in Scala with apply and other factory methods.
RowEncoder can create ExpressionEncoder[Row] from a schema.
Internally, apply creates a BoundReference for the Row type and returns a ExpressionEncoder[Row] for the input schema, a CreateNamedStruct serializer (using serializerFor internal method), a deserializer for the schema, and the Row type
RowEncoder knows about schema and uses it for serialization and deserialization.
Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.
Kryo is good for efficiently storaging large dataset and network intensive application.
for more information you can refer to these links:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-RowEncoder.html
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Encoders.html
https://medium.com/#knoldus/kryo-serialization-in-spark-55b53667e7ab
https://stackoverflow.com/questions/58946987/what-are-the-pros-and-cons-of-java-serialization-vs-kryo-serialization#:~:text=Kryo%20is%20significantly%20faster%20and,in%20advance%20for%20best%20performance.

According to Spark's documentation, SparkSQL does NOT use Kryo or Java
serializations (standardly).
Kryo is for RDDs and not Dataframes or DataSets. Hence the question is
a little off-beam afaik.
Does Kryo help in SparkSQL?
This elaborates on custom objects, but...
UPDATED Answer after some free time
Your example was not really what I would call custom type. They are
are just structs with primitives. No issue.
Kryo is a serializer, DS, DF's use Encoders for columnar advantage.
Kryo is used internally by Spark for shuffling.
This user defined example case class Foo(name: String, position: Point) is one that we can do with DS or DF or via kryo. But what's
the point with Tungsten and Catalyst working with "understanding the
structure of the data"? and thus able to optimize. You also get a
single binary value with kryo and I have found few examples of how to
work successfully with it, e.g. JOIN.
KRYO Example
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import spark.implicits._
case class Point(a: Int, b: Int)
case class Foo(name: String, position: Point)
implicit val PointEncoder: Encoder[Point] = Encoders.kryo[Point]
implicit val FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
val ds = Seq(new Foo("bar", new Point(0, 0))).toDS
ds.show()
returns:
+--------------------+
| value|
+--------------------+
|[01 00 D2 02 6C 6...|
+--------------------+
Encoder for DS using case class Example
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import spark.implicits._
case class Point(a: Int, b: Int)
case class Foo(name: String, position: Point)
val ds = Seq(new Foo("bar", new Point(0, 0))).toDS
ds.show()
returns:
+----+--------+
|name|position|
+----+--------+
| bar| [0, 0]|
+----+--------+
This strikes me as the way to go with Spark, Tungsten, Catalyst.
Now, more complicated stuff is this when an Any is involved, but Any is not a good thing:
val data = Seq(
("sublime", Map(
"good_song" -> "santeria",
"bad_song" -> "doesn't exist")
),
("prince_royce", Map(
"good_song" -> 4,
"bad_song" -> "back it up")
)
)
val schema = List(
("name", StringType, true),
("songs", MapType(StringType, StringType, true), true)
)
val rdd= spark.sparkContext.parallelize(data)
rdd.collect
val df = spark.createDataFrame(rdd)
df.show()
df.printSchema()
returns:
Java.lang.UnsupportedOperationException: No Encoder found for Any.
Then this example is interesting that is a valid custom object use case
Spark No Encoder found for java.io.Serializable in Map[String, java.io.Serializable]. But I would stay away from such.
Conclusions
Kryo vs Encoder vs Java Serialization in Spark? states that kryo is for RDD but that is for legacy; internally one can use it. Not 100% correct but actually to the point.
Spark: Dataset Serialization is also an informative link.
The stuff has evolved and the spirit is to not use kryo for DS, DF.
Hope this helps.

TL/DR: dont' trust the show() method output. Your internal structure of your case class is not lost.
I was confused about how the show() method rendered the content of the
tuples as one column named 'value' with binary content. I concluded
(incorrectly) that the dataframe was just a one column binary
blob that no longer adhered to the structure of a Tuple2[Integer,String]
with columns named id and value.
However, when I printed the actual content of the collected data frame
I saw the correct values, column names and types. So I think this is just
an issue with the show() method.
The program below should servce to reproduce my results:
object X extends App {
val sparkSession = SparkSession.builder().appName("tests")
.master("local")
.config("", "")
.getOrCreate()
import sparkSession.implicits._
val df = Seq((1, "a"), (2, "d")).toDF("id", "value")
import org.apache.spark.sql.{Encoder, Encoders, Row}
implicit val myKryoEncoder: Encoder[Row] = Encoders.kryo[Row]
val ds = df.map{case row => row}
ds.show // This shows only one column 'value' w/ binary content
// This shows that the schema and values are actually correct. The below will print:
// row shema:StructType(StructField(id,IntegerType,false),StructField(value,StringType,true))
// row:[1,a]
// row shema:StructType(StructField(id,IntegerType,false),StructField(value,StringType,true))
// row:[2,d]
val collected: util.List[Row] = ds.collectAsList()
collected.forEach{ row =>
System.out.println("row shema:" + row.schema)
System.out.println("row:" + row)
}
}

Related

Output not in readable format in Spark 2.2.0 Dataset

Following is the code which i am trying to execute with spark2.2.0 on intellij IDE. But the output i am getting is doesnt look in readble format.
val spark = SparkSession
.builder()
.appName("Spark SQL basic example").master("local[2]")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
import scala.reflect.ClassTag
implicit def kryoEncoder[A](implicit ct: ClassTag[A]) =
org.apache.spark.sql.Encoders.kryo[A](ct)
case class Person(name: String, age: Long)
// Encoders are created for case classes
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
Output shown :
+--------------------+
| value|
+--------------------+
|[01 00 44 61 74 6...|
+--------------------+
Can anyone explain if I am missing anything here?
Thanks
This is because you're using Kryo Encoder which is not designed to deserialize objects for show.
In general you should never use Kryo Encoder when more precise Encoders are available. It has poorer performance and less features. Instead use Product Encoder
spark.createDataset(Seq(Person("Andy", 32)))(Encoders.product[Person])

Group a dataset by a derived column value

I want to group the following dataset by a derived value of the timestamp column, namely the year, so a predictable substring of the timestamp column.
doi timestamp
10.1515/cipms-2015-0089 2016-06-09T18:29:46.000046+01:00
10.1515/cipms-2015-0089 2016-06-09T18:29:46.000046+01:00
10.1007/s13595-016-0548-3 2015-06-08T17:01:10.000010+01:00
I realise that I can add my own derived column and filter based on this, but is there a way to specify it in a single groupBy statement, without adding an additional column purely for grouping purposes?
If I understand your question correctly, you'll need to extract the year inside the group by clause :
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions.{unix_timestamp, year}
val sc: SparkContext = ??? // I consider that you are able to create both your SparkContext and SQLContext alone
val sqlContext: SQLContext = ???
import sqlContext.implicits._ // needed to use implicits like .toDF
val data = Seq(
"10.1515/cipms-2015-0089 2016-06-09T18:29:46.000046+01:00",
"10.1515/cipms-2015-0089 2016-06-09T18:29:46.000046+01:00",
"10.1007/s13595-016-0548-3 2015-06-08T17:01:10.000010+01:00")
// data: Seq[String] = List(10.1515/cipms-2015-0089 2016-06-09T18:29:46.000046+01:00, 10.1515/cipms-2015-0089 2016-06-09T18:29:46.000046+01:00, 10.1007/s13595-016-0548-3 2015-06-08T17:01:10.000010+01:00)
val df = sc.parallelize(data).map(_.split("\\s+") match {
case Array(doi, time) => (doi, time)
}).toDF("doi", "timestamp").withColumn("timestamp", unix_timestamp($"timestamp", "yyyy-MM-dd'T'hh:mm:ss").cast("timestamp"))
// df: org.apache.spark.sql.DataFrame = [doi: string, timestamp: timestamp]
df.groupBy(year($"timestamp").as("year")).count.show
// +----+-----+
// |year|count|
// +----+-----+
// |2015| 1|
// |2016| 2|
// +----+-----+

How to convert Spark DataFrame column of sparse vectors to a column of dense vectors?

I used the following code:
df.withColumn("dense_vector", $"sparse_vector".toDense)
but it gives an error.
I am new to Spark, so this might be obvious and there may be obvious errors in my code line. Please help. Thank you!
Contexts which require operation like this are relatively rare in Spark. With one or two exception Spark API expects common Vector class not specific implementation (SparseVector, DenseVector). This is also true in case of distributed structures from o.a.s.mllib.linalg.distributed
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val df = Seq[(Long, Vector)](
(1L, Vectors.dense(1, 2, 3)), (2L, Vectors.sparse(3, Array(1), Array(3)))
).toDF("id", "v")
new RowMatrix(df.select("v")
.map(_.getAs[Vector]("v")))
.columnSimilarities(0.9)
.entries
.first
// apache.spark.mllib.linalg.distributed.MatrixEntry = MatrixEntry(0,2,1.0)
Nevertheless you could use an UDF like this:
val asDense = udf((v: Vector) => v.toDense)
df.withColumn("vd", asDense($"v")).show
// +---+-------------+-------------+
// | id| v| vd|
// +---+-------------+-------------+
// | 1|[1.0,2.0,3.0]|[1.0,2.0,3.0]|
// | 2|(3,[1],[3.0])|[0.0,3.0,0.0]|
// +---+-------------+-------------+
Just keep in mind that since version 2.0 Spark provides two different and compatible Vector types:
o.a.s.ml.linalg.Vector
o.a.s.mllib.linalg.Vector
each with corresponding SQL UDT. See MatchError while accessing vector column in Spark 2.0

Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function

I have a table of two string type columns (username, friend) and for each username, I want to collect all of its friends on one row, concatenated as strings. For example: ('username1', 'friends1, friends2, friends3')
I know MySQL does this with GROUP_CONCAT. Is there any way to do this with Spark SQL?
Before you proceed: This operations is yet another another groupByKey. While it has multiple legitimate applications it is relatively expensive so be sure to use it only when required.
Not exactly concise or efficient solution but you can use UserDefinedAggregateFunction introduced in Spark 1.5.0:
object GroupConcat extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", StringType)
def bufferSchema = new StructType().add("buff", ArrayType(StringType))
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[String])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))
}
def evaluate(buffer: Row) = UTF8String.fromString(
buffer.getSeq[String](0).mkString(","))
}
Example usage:
val df = sc.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")
)).toDF("username", "friend")
df.groupBy($"username").agg(GroupConcat($"friend")).show
## +---------+---------------+
## | username| friends|
## +---------+---------------+
## |username1|friend1,friend2|
## |username2|friend1,friend3|
## +---------+---------------+
You can also create a Python wrapper as shown in Spark: How to map Python with Scala or Java User Defined Functions?
In practice it can be faster to extract RDD, groupByKey, mkString and rebuild DataFrame.
You can get a similar effect by combining collect_list function (Spark >= 1.6.0) with concat_ws:
import org.apache.spark.sql.functions.{collect_list, udf, lit}
df.groupBy($"username")
.agg(concat_ws(",", collect_list($"friend")).alias("friends"))
You can try the collect_list function
sqlContext.sql("select A, collect_list(B), collect_list(C) from Table1 group by A
Or you can regieter a UDF something like
sqlContext.udf.register("myzip",(a:Long,b:Long)=>(a+","+b))
and you can use this function in the query
sqlConttext.sql("select A,collect_list(myzip(B,C)) from tbl group by A")
In Spark 2.4+ this has become simpler with the help of collect_list() and array_join().
Here's a demonstration in PySpark, though the code should be very similar for Scala too:
from pyspark.sql.functions import array_join, collect_list
friends = spark.createDataFrame(
[
('jacques', 'nicolas'),
('jacques', 'georges'),
('jacques', 'francois'),
('bob', 'amelie'),
('bob', 'zoe'),
],
schema=['username', 'friend'],
)
(
friends
.orderBy('friend', ascending=False)
.groupBy('username')
.agg(
array_join(
collect_list('friend'),
delimiter=', ',
).alias('friends')
)
.show(truncate=False)
)
In Spark SQL the solution is likewise:
SELECT
username,
array_join(collect_list(friend), ', ') AS friends
FROM friends
GROUP BY username;
The output:
+--------+--------------------------+
|username|friends |
+--------+--------------------------+
|jacques |nicolas, georges, francois|
|bob |zoe, amelie |
+--------+--------------------------+
This is similar to MySQL's GROUP_CONCAT() and Redshift's LISTAGG().
Here is a function you can use in PySpark:
import pyspark.sql.functions as F
def group_concat(col, distinct=False, sep=','):
if distinct:
collect = F.collect_set(col.cast(StringType()))
else:
collect = F.collect_list(col.cast(StringType()))
return F.concat_ws(sep, collect)
table.groupby('username').agg(F.group_concat('friends').alias('friends'))
In SQL:
select username, concat_ws(',', collect_list(friends)) as friends
from table
group by username
-- the spark SQL resolution with collect_set
SELECT id, concat_ws(', ', sort_array( collect_set(colors))) as csv_colors
FROM (
VALUES ('A', 'green'),('A','yellow'),('B', 'blue'),('B','green')
) as T (id, colors)
GROUP BY id
One way to do it with pyspark < 1.6, which unfortunately doesn't support user-defined aggregate function:
byUsername = df.rdd.reduceByKey(lambda x, y: x + ", " + y)
and if you want to make it a dataframe again:
sqlContext.createDataFrame(byUsername, ["username", "friends"])
As of 1.6, you can use collect_list and then join the created list:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
join_ = F.udf(lambda x: ", ".join(x), StringType())
df.groupBy("username").agg(join_(F.collect_list("friend").alias("friends"))
Language: Scala
Spark version: 1.5.2
I had the same issue and also tried to resolve it using udfs but, unfortunately, this has led to more problems later in the code due to type inconsistencies. I was able to work my way around this by first converting the DF to an RDD then grouping by and manipulating the data in the desired way and then converting the RDD back to a DF as follows:
val df = sc
.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")))
.toDF("username", "friend")
+---------+-------+
| username| friend|
+---------+-------+
|username1|friend1|
|username1|friend2|
|username2|friend1|
|username2|friend3|
+---------+-------+
val dfGRPD = df.map(Row => (Row(0), Row(1)))
.groupByKey()
.map{ case(username:String, groupOfFriends:Iterable[String]) => (username, groupOfFriends.mkString(","))}
.toDF("username", "groupOfFriends")
+---------+---------------+
| username| groupOfFriends|
+---------+---------------+
|username1|friend2,friend1|
|username2|friend3,friend1|
+---------+---------------+
Below python-based code that achieves group_concat functionality.
Input Data:
Cust_No,Cust_Cars
1, Toyota
2, BMW
1, Audi
2, Hyundai
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import pyspark.sql.functions as F
spark = SparkSession.builder.master('yarn').getOrCreate()
# Udf to join all list elements with "|"
def combine_cars(car_list,sep='|'):
collect = sep.join(car_list)
return collect
test_udf = udf(combine_cars,StringType())
car_list_per_customer.groupBy("Cust_No").agg(F.collect_list("Cust_Cars").alias("car_list")).select("Cust_No",test_udf("car_list").alias("Final_List")).show(20,False)
Output Data:
Cust_No, Final_List
1, Toyota|Audi
2, BMW|Hyundai
You can also use Spark SQL function collect_list and after you will need to cast to string and use the function regexp_replace to replace the special characters.
regexp_replace(regexp_replace(regexp_replace(cast(collect_list((column)) as string), ' ', ''), ',', '|'), '[^A-Z0-9|]', '')
it's an easier way.
Higher order function concat_ws() and collect_list() can be a good alternative along with groupBy()
import pyspark.sql.functions as F
df_grp = df.groupby("agg_col").agg(F.concat_ws("#;", F.collect_list(df.time)).alias("time"), F.concat_ws("#;", F.collect_list(df.status)).alias("status"), F.concat_ws("#;", F.collect_list(df.llamaType)).alias("llamaType"))
Sample Output
+-------+------------------+----------------+---------------------+
|agg_col|time |status |llamaType |
+-------+------------------+----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting|red llama#;blue llama|
+-------+------------------+----------------+---------------------+

How to read a nested collection in Spark

I have a parquet table with one of the columns being
, array<struct<col1,col2,..colN>>
Can run queries against this table in Hive using LATERAL VIEW syntax.
How to read this table into an RDD, and more importantly how to filter, map etc this nested collection in Spark?
Could not find any references to this in Spark documentation. Thanks in advance for any information!
ps. I felt might be helpful to give some stats on the table.
Number of columns in main table ~600. Number of rows ~200m.
Number of "columns" in nested collection ~10. Avg number of records in nested collection ~35.
There is no magic in the case of nested collection. Spark will handle the same way a RDD[(String, String)] and a RDD[(String, Seq[String])].
Reading such nested collection from Parquet files can be tricky, though.
Let's take an example from the spark-shell (1.3.1):
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> case class Inner(a: String, b: String)
defined class Inner
scala> case class Outer(key: String, inners: Seq[Inner])
defined class Outer
Write the parquet file:
scala> val outers = sc.parallelize(List(Outer("k1", List(Inner("a", "b")))))
outers: org.apache.spark.rdd.RDD[Outer] = ParallelCollectionRDD[0] at parallelize at <console>:25
scala> outers.toDF.saveAsParquetFile("outers.parquet")
Read the parquet file:
scala> import org.apache.spark.sql.catalyst.expressions.Row
import org.apache.spark.sql.catalyst.expressions.Row
scala> val dataFrame = sqlContext.parquetFile("outers.parquet")
dataFrame: org.apache.spark.sql.DataFrame = [key: string, inners: array<struct<a:string,b:string>>]
scala> val outers = dataFrame.map { row =>
| val key = row.getString(0)
| val inners = row.getAs[Seq[Row]](1).map(r => Inner(r.getString(0), r.getString(1)))
| Outer(key, inners)
| }
outers: org.apache.spark.rdd.RDD[Outer] = MapPartitionsRDD[8] at map at DataFrame.scala:848
The important part is row.getAs[Seq[Row]](1). The internal representation of a nested sequence of struct is ArrayBuffer[Row], you could use any super-type of it instead of Seq[Row]. The 1 is the column index in the outer row. I used the method getAs here but there are alternatives in the latest versions of Spark. See the source code of the Row trait.
Now that you have a RDD[Outer], you can apply any wanted transformation or action.
// Filter the outers
outers.filter(_.inners.nonEmpty)
// Filter the inners
outers.map(outer => outer.copy(inners = outer.inners.filter(_.a == "a")))
Note that we used the spark-SQL library only to read the parquet file. You could for example select only the wanted columns directly on the DataFrame, before mapping it to a RDD.
dataFrame.select('col1, 'col2).map { row => ... }
I'll give a Python-based answer since that's what I'm using. I think Scala has something similar.
The explode function was added in Spark 1.4.0 to handle nested arrays in DataFrames, according to the Python API docs.
Create a test dataframe:
from pyspark.sql import Row
df = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3]), Row(a=2, intlist=[4,5,6])])
df.show()
## +-+--------------------+
## |a| intlist|
## +-+--------------------+
## |1|ArrayBuffer(1, 2, 3)|
## |2|ArrayBuffer(4, 5, 6)|
## +-+--------------------+
Use explode to flatten the list column:
from pyspark.sql.functions import explode
df.select(df.a, explode(df.intlist)).show()
## +-+---+
## |a|_c0|
## +-+---+
## |1| 1|
## |1| 2|
## |1| 3|
## |2| 4|
## |2| 5|
## |2| 6|
## +-+---+
Another approach would be using pattern matching like this:
val rdd: RDD[(String, List[(String, String)]] = dataFrame.map(_.toSeq.toList match {
case List(key: String, inners: Seq[Row]) => key -> inners.map(_.toSeq.toList match {
case List(a:String, b: String) => (a, b)
}).toList
})
You can pattern match directly on Row but it is likely to fail for a few reasons.
Above answers are all great answers and tackle this question from different sides; Spark SQL is also quite useful way to access nested data.
Here's example how to use explode() in SQL directly to query nested collection.
SELECT hholdid, tsp.person_seq_no
FROM ( SELECT hholdid, explode(tsp_ids) as tsp
FROM disc_mrt.unified_fact uf
)
tsp_ids is a nested of structs, which has many attributes, including person_seq_no which I'm selecting in the outer query above.
Above was tested in Spark 2.0. I did a small test and it doesn't work in Spark 1.6. This question was asked when Spark 2 wasn't around, so this answer adds nicely to the list of available options to deal with nested structures.
Have a look also on following JIRAs for Hive-compatible way to query nested data using LATERAL VIEW OUTER syntax, since Spark 2.2 also supports OUTER explode (e.g. when a nested collection is empty, but you still want to have attributes from a parent record):
SPARK-13721: Add support for LATERAL VIEW OUTER explode()
Noticable not resolved JIRA on explode() for SQL access:
SPARK-7549: Support aggregating over nested fields

Resources