Group a dataset by a derived column value - apache-spark

I want to group the following dataset by a derived value of the timestamp column, namely the year, so a predictable substring of the timestamp column.
doi timestamp
10.1515/cipms-2015-0089 2016-06-09T18:29:46.000046+01:00
10.1515/cipms-2015-0089 2016-06-09T18:29:46.000046+01:00
10.1007/s13595-016-0548-3 2015-06-08T17:01:10.000010+01:00
I realise that I can add my own derived column and filter based on this, but is there a way to specify it in a single groupBy statement, without adding an additional column purely for grouping purposes?

If I understand your question correctly, you'll need to extract the year inside the group by clause :
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions.{unix_timestamp, year}
val sc: SparkContext = ??? // I consider that you are able to create both your SparkContext and SQLContext alone
val sqlContext: SQLContext = ???
import sqlContext.implicits._ // needed to use implicits like .toDF
val data = Seq(
"10.1515/cipms-2015-0089 2016-06-09T18:29:46.000046+01:00",
"10.1515/cipms-2015-0089 2016-06-09T18:29:46.000046+01:00",
"10.1007/s13595-016-0548-3 2015-06-08T17:01:10.000010+01:00")
// data: Seq[String] = List(10.1515/cipms-2015-0089 2016-06-09T18:29:46.000046+01:00, 10.1515/cipms-2015-0089 2016-06-09T18:29:46.000046+01:00, 10.1007/s13595-016-0548-3 2015-06-08T17:01:10.000010+01:00)
val df = sc.parallelize(data).map(_.split("\\s+") match {
case Array(doi, time) => (doi, time)
}).toDF("doi", "timestamp").withColumn("timestamp", unix_timestamp($"timestamp", "yyyy-MM-dd'T'hh:mm:ss").cast("timestamp"))
// df: org.apache.spark.sql.DataFrame = [doi: string, timestamp: timestamp]
df.groupBy(year($"timestamp").as("year")).count.show
// +----+-----+
// |year|count|
// +----+-----+
// |2015| 1|
// |2016| 2|
// +----+-----+

Related

Kryo encoder v.s. RowEncoder in Spark Dataset

The purpose of the following examples is to understand the difference of the two encoders in Spark Dataset.
I can do this:
val df = Seq((1, "a"), (2, "d")).toDF("id", "value")
import org.apache.spark.sql.{Encoder, Encoders, Row}
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
val myStructType = StructType(Seq(StructField("id", IntegerType), StructField("value", StringType)))
implicit val myRowEncoder = RowEncoder(myStructType)
val ds = df.map{case row => row}
ds.show
//+---+-----+
//| id|value|
//+---+-----+
//| 1| a|
//| 2| d|
//+---+-----+
I can also do this:
val df = Seq((1, "a"), (2, "d")).toDF("id", "value")
import org.apache.spark.sql.{Encoder, Encoders, Row}
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
implicit val myKryoEncoder: Encoder[Row] = Encoders.kryo[Row]
val ds = df.map{case row => row}
ds.show
//+--------------------+
//| value|
//+--------------------+
//|[01 00 6F 72 67 2...|
//|[01 00 6F 72 67 2...|
//+--------------------+
The only difference of the code is: one is using Kryo encoder, another is using RowEncoder.
Question:
What is the difference using the two?
Why one is displaying encoded values, another is displaying human readable values?
When should we use which?
Encoders.kryo simply creates an encoder that serializes objects of type T using Kryo
RowEncoder is an object in Scala with apply and other factory methods.
RowEncoder can create ExpressionEncoder[Row] from a schema.
Internally, apply creates a BoundReference for the Row type and returns a ExpressionEncoder[Row] for the input schema, a CreateNamedStruct serializer (using serializerFor internal method), a deserializer for the schema, and the Row type
RowEncoder knows about schema and uses it for serialization and deserialization.
Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.
Kryo is good for efficiently storaging large dataset and network intensive application.
for more information you can refer to these links:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-RowEncoder.html
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Encoders.html
https://medium.com/#knoldus/kryo-serialization-in-spark-55b53667e7ab
https://stackoverflow.com/questions/58946987/what-are-the-pros-and-cons-of-java-serialization-vs-kryo-serialization#:~:text=Kryo%20is%20significantly%20faster%20and,in%20advance%20for%20best%20performance.
According to Spark's documentation, SparkSQL does NOT use Kryo or Java
serializations (standardly).
Kryo is for RDDs and not Dataframes or DataSets. Hence the question is
a little off-beam afaik.
Does Kryo help in SparkSQL?
This elaborates on custom objects, but...
UPDATED Answer after some free time
Your example was not really what I would call custom type. They are
are just structs with primitives. No issue.
Kryo is a serializer, DS, DF's use Encoders for columnar advantage.
Kryo is used internally by Spark for shuffling.
This user defined example case class Foo(name: String, position: Point) is one that we can do with DS or DF or via kryo. But what's
the point with Tungsten and Catalyst working with "understanding the
structure of the data"? and thus able to optimize. You also get a
single binary value with kryo and I have found few examples of how to
work successfully with it, e.g. JOIN.
KRYO Example
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import spark.implicits._
case class Point(a: Int, b: Int)
case class Foo(name: String, position: Point)
implicit val PointEncoder: Encoder[Point] = Encoders.kryo[Point]
implicit val FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
val ds = Seq(new Foo("bar", new Point(0, 0))).toDS
ds.show()
returns:
+--------------------+
| value|
+--------------------+
|[01 00 D2 02 6C 6...|
+--------------------+
Encoder for DS using case class Example
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import spark.implicits._
case class Point(a: Int, b: Int)
case class Foo(name: String, position: Point)
val ds = Seq(new Foo("bar", new Point(0, 0))).toDS
ds.show()
returns:
+----+--------+
|name|position|
+----+--------+
| bar| [0, 0]|
+----+--------+
This strikes me as the way to go with Spark, Tungsten, Catalyst.
Now, more complicated stuff is this when an Any is involved, but Any is not a good thing:
val data = Seq(
("sublime", Map(
"good_song" -> "santeria",
"bad_song" -> "doesn't exist")
),
("prince_royce", Map(
"good_song" -> 4,
"bad_song" -> "back it up")
)
)
val schema = List(
("name", StringType, true),
("songs", MapType(StringType, StringType, true), true)
)
val rdd= spark.sparkContext.parallelize(data)
rdd.collect
val df = spark.createDataFrame(rdd)
df.show()
df.printSchema()
returns:
Java.lang.UnsupportedOperationException: No Encoder found for Any.
Then this example is interesting that is a valid custom object use case
Spark No Encoder found for java.io.Serializable in Map[String, java.io.Serializable]. But I would stay away from such.
Conclusions
Kryo vs Encoder vs Java Serialization in Spark? states that kryo is for RDD but that is for legacy; internally one can use it. Not 100% correct but actually to the point.
Spark: Dataset Serialization is also an informative link.
The stuff has evolved and the spirit is to not use kryo for DS, DF.
Hope this helps.
TL/DR: dont' trust the show() method output. Your internal structure of your case class is not lost.
I was confused about how the show() method rendered the content of the
tuples as one column named 'value' with binary content. I concluded
(incorrectly) that the dataframe was just a one column binary
blob that no longer adhered to the structure of a Tuple2[Integer,String]
with columns named id and value.
However, when I printed the actual content of the collected data frame
I saw the correct values, column names and types. So I think this is just
an issue with the show() method.
The program below should servce to reproduce my results:
object X extends App {
val sparkSession = SparkSession.builder().appName("tests")
.master("local")
.config("", "")
.getOrCreate()
import sparkSession.implicits._
val df = Seq((1, "a"), (2, "d")).toDF("id", "value")
import org.apache.spark.sql.{Encoder, Encoders, Row}
implicit val myKryoEncoder: Encoder[Row] = Encoders.kryo[Row]
val ds = df.map{case row => row}
ds.show // This shows only one column 'value' w/ binary content
// This shows that the schema and values are actually correct. The below will print:
// row shema:StructType(StructField(id,IntegerType,false),StructField(value,StringType,true))
// row:[1,a]
// row shema:StructType(StructField(id,IntegerType,false),StructField(value,StringType,true))
// row:[2,d]
val collected: util.List[Row] = ds.collectAsList()
collected.forEach{ row =>
System.out.println("row shema:" + row.schema)
System.out.println("row:" + row)
}
}

Dataframe Comparision in Spark: Scala

I have two dataframes in spark/scala in which i have some common column like salary,bonus,increment etc.
i need to compare these two dataframes's columns and anything changes like in first dataframe salary is 3000 and in second dataframe salary is 5000 then i need to insert 5000-3000=2000 in new dataframe as salary, and if in first dataframe salary is 5000 and in second dataframe salary is 3000 then i need to insert 5000+3000=8000 in new dataframe as salary, and if salary is same in both the dataframe then need to insert from second dataframe.
val columns = df1.schema.fields.map(_.salary)
val salaryDifferences = columns.map(col => df1.select(col).except(df2.select(col)))
salaryDifferences.map(diff => {if(diff.count > 0) diff.show})
I tried above query but its giving column and value where any difference is there i need to also check if diff is negative or positive and based to that i need to perform logic.can anyone please give me a hint how can i implement this and insert record in 3rd dataframe,
Join the Dataframes and use nested when and otherwise clause.
Also find comments in the code
import org.apache.spark.sql.functions._
object SalaryDiff {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df1 = List(("1", "5000"), ("2", "3000"), ("3", "5000")).toDF("id", "salary") // First dataframe
val df2 = List(("1", "3000"), ("2", "5000"), ("3", "5000")).toDF("id", "salary") // Second dataframe
val df3 = df1 // Is your 3rd tables
.join(
df2
, df1("id") === df2("id") // Join both dataframes on id column
).withColumn("finalSalary", when(df1("salary") < df2("salary"), df2("salary") - df1("salary")) // 5000-3000=2000 check
.otherwise(
when(df1("salary") > df2("salary"), df1("salary") + df2("salary")) // 5000+3000=8000 check
.otherwise(df2("salary")))) // insert from second dataframe
.drop(df1("salary"))
.drop(df2("salary"))
.withColumnRenamed("finalSalary","salary")
.show()
}
}

Adding month to DateType based on column value

Assuming a dataframe with a date column and an Int column representing a number of months:
val df = Seq(("2011-11-11",1),("2010-11-11",3),("2012-11-11",5))
.toDF("startDate","monthsToAdd")
.withColumn("startDate",'startDate.cast(DateType))
+----------+-----------+
| startDate|monthsToAdd|
+----------+-----------+
|2011-11-11| 1|
|2010-11-11| 3|
|2012-11-11| 5|
+----------+-----------+
is there a way of creating an endDate column by adding the months to startDate without converting the date column back to string?
So basically same as the add_months function
def add_months(startDate: Column, numMonths: Int)
but passing a column instead of a literal.
you can use UDF (User Defined Functions) to achieve this. Below I have create myUDF function which add the months to date and returns the result date in String format and I will use this UDF to create a new column by using withColumn on DataFrame
import java.text.SimpleDateFormat
import java.util.Calendar
import javax.xml.bind.DatatypeConverter
import org.apache.spark.sql.functions._
import sparkSession.sqlContext.implicits._
val df = Seq(("2011-11-11",1),("2010-11-11",3),("2012-11-11",5)).toDF("startDate","monthsToAdd")
val myUDF = udf {
val simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
(startDate: String, monthValue: Int) => {
val calendar = DatatypeConverter.parseDateTime(startDate)
calendar.add(Calendar.MONTH, monthValue)
simpleDateFormat.format(calendar.getTime)
}
}
val newDf = df.withColumn("endDate", myUDF(df("startDate"), df("monthsToAdd")))
newDf.show()
Output:
+----------+-----------+----------+
| startDate|monthsToAdd| endDate|
+----------+-----------+----------+
|2011-11-11| 1|2011-12-11|
|2010-11-11| 3|2011-02-11|
|2012-11-11| 5|2013-04-11|
+----------+-----------+----------+

how can i add a timestamp as an extra column to my dataframe

*Hi all,
I have an easy question for you all.
I have an RDD, created from kafka streaming using createStream method.
Now i want to add a timestamp as a value to this rdd before converting in to dataframe.
I have tried doing to add a value to the dataframe using with withColumn() but returning this error*
val topicMaps = Map("topic" -> 1)
val now = java.util.Calendar.getInstance().getTime()
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "now" among (action, device_os_ver, device_type, event_name,
item_name, lat, lon, memberid, productUpccd, tenantid);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:15
As i came to know that DataFrames cannot be altered as they are immutable, but RDDs are immutable as well.
Then what is the best way to do it.
How to a value to the RDD(adding timestamp to an RDD dynamically).
Try current_timestamp function.
import org.apache.spark.sql.functions.current_timestamp
df.withColumn("time_stamp", current_timestamp())
For add a new column with a constant like timestamp, you can use litfunction:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("timeStamp_column", lit(System.currentTimeMillis))
This works for me. I usually perform a write after this.
val d = dataframe.withColumn("SparkLoadedAt", current_timestamp())
In Scala/Databricks:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("Timestamp",current_timestamp())
See my output
I see in comments that some folks are having trouble getting the timestamp to string. Here is a way to do that using spark 3 datetime format
import org.apache.spark.sql.functions._
val d =dataframe.
.withColumn("timeStamp_column", date_format(current_timestamp(), "y-M-d'T'H:m:sX"))

How to create a dataframe from a string key=value delimited by ";"

I have a Hive table with the structure:
I need to read the string field, breaking the keys and turn into a Hive table columns, the final table should look like this:
Very important, the number of keys in the string is dynamic and the name of the keys is also dynamic
An attempt would be to read the string with Spark SQL, create a dataframe with the schema based on all the strings and use saveAsTable () function to transform the dataframe the hive final table, but do not know how to do this
Any suggestion ?
A naive (assuming unique (code, date) combinations and no embedded = and ; in the string) can look like this:
import org.apache.spark.sql.functions.{explode, split}
val df = Seq(
(1, 1, "key1=value11;key2=value12;key3=value13;key4=value14"),
(1, 2, "key1=value21;key2=value22;key3=value23;key4=value24"),
(2, 4, "key3=value33;key4=value34;key5=value35")
).toDF("code", "date", "string")
val bits = split($"string", ";")
val kv = split($"pair", "=")
df
.withColumn("bits", bits) // Split column by `;`
.withColumn("pair", explode($"bits")) // Explode into multiple rows
.withColumn("key", kv(0)) // Extract key
.withColumn("val", kv(1)) // Extract value
// Pivot to wide format
.groupBy("code", "date")
.pivot("key")
.agg(first("val"))
// +----+----+-------+-------+-------+-------+-------+
// |code|date| key1| key2| key3| key4| key5|
// +----+----+-------+-------+-------+-------+-------+
// | 1| 2|value21|value22|value23|value24| null|
// | 1| 1|value11|value12|value13|value14| null|
// | 2| 4| null| null|value33|value34|value35|
// +----+----+-------+-------+-------+-------+-------+
This can be easily adjust to handle the case when (code, date) are not unique and you can process more complex string patterns using UDF.
Depending on a language you use and a number of columns you may be better with using RDD or Dataset. It is also worth to consider dropping full explode / pivot in favor of an UDF.
val parse = udf((text: String) => text.split(";").map(_.split("=")).collect {
case Array(k, v) => (k, v)
}.toMap)
val keys = udf((pairs: Map[String, String]) => pairs.keys.toList)
// Parse strings to Map[String, String]
val withKVs = df.withColumn("kvs", parse($"string"))
val keys = withKVs
.select(explode(keys($"kvs"))).distinct // Get unique keys
.as[String]
.collect.sorted.toList // Collect and sort
// Build a list of expressions for subsequent select
val exprs = keys.map(key => $"kvs".getItem(key).alias(key))
withKVs.select($"code" :: $"date" :: exprs: _*)
In Spark 1.5 you can try:
val keys = withKVs.select($"kvs").rdd
.flatMap(_.getAs[Map[String, String]]("kvs").keys)
.distinct
.collect.sorted.toList

Resources