Spark Structured streaming - reading timestamp from file using schema - apache-spark

I am working on a Structured Streaming job.
The data I am reading from files contains the timestamp (in millis), deviceId and a value reported by that device.
Multiple devices report data.
I am trying to write a job that aggregates (sums) values sent by all devices into tumbling windows of 1 minute.
The issue that I am having is with timestamp.
When I am trying to parse "timestamp" into Long, window function complains that it expects "timestamp type".
When I am trying to parse into TimestampType as in the snippet below I am getting .MatchError exception (the full exception can be seen below) and I am struggling to figure out why and what is the correct way to handle it
// Create schema
StructType readSchema = new StructType().add("value" , "integer")
.add("deviceId", "long")
.add("timestamp", new TimestampType());
// Read data from file
Dataset<Row> inputDataFrame = sparkSession.readStream()
.schema(readSchema)
.parquet(path);
Dataset<Row> aggregations = inputDataFrame.groupBy(window(inputDataFrame.col("timestamp"), "1 minutes"),
inputDataFrame.col("deviceId"))
.agg(sum("value"));
The exception:
org.apache.spark.sql.types.TimestampType#3eeac696 (of class org.apache.spark.sql.types.TimestampType)
scala.MatchError: org.apache.spark.sql.types.TimestampType#3eeac696 (of class org.apache.spark.sql.types.TimestampType)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212)
at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1692)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:232)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:242)
at org.apache.spark.sql.streaming.DataStreamReader.parquet(DataStreamReader.scala:450)

Typically, when your timestamp is stored in milis as a long you would convert it into a timestamp type as shown below:
// Create schema and keep column 'timestamp' as long
StructType readSchema = new StructType()
.add("value", "integer")
.add("deviceId", "long")
.add("timestamp", "long");
// Read data from file
Dataset<Row> inputDataFrame = sparkSession.readStream()
.schema(readSchema)
.parquet(path);
// convert timestamp column into a proper timestamp type
Dataset<Row> df1 = inputDataFrame.withColumn("new_timestamp", expr("timestamp/1000").cast(DataTypes.TimestampType));
df1.show(false)
+-----+--------+-------------+-----------------------+
|value|deviceId|timestamp |new_timestamp |
+-----+--------+-------------+-----------------------+
|1 |1337 |1618836775397|2021-04-19 14:52:55.397|
+-----+--------+-------------+-----------------------+
df1.printSchema();
root
|-- value: integer (nullable = true)
|-- deviceId: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- new_timestamp: timestamp (nullable = true)

Related

Convert from long to Timestamp for Insert into DB

Goal:
Read data from a JSON file where timestamp is a long type, and insert into a table that has a Timestamp type. The problem is that I don't know how to convert the long type to a Timestamp type for the insert.
Input File Sample:
{"sensor_id":"sensor1","reading_time":1549533263587,"notes":"My Notes for
Sensor1","temperature":24.11,"humidity":42.90}
I want to read this, create a Bean from it, and insert into a table. Here is my Bean Definition:
public class DummyBean {
private String sensor_id;
private String notes;
private Timestamp reading_time;
private double temperature;
private double humidity;
Here is the table I want to insert into:
create table dummy (
id serial not null primary key,
sensor_id varchar(40),
notes varchar(40),
reading_time timestamp with time zone default (current_timestamp at time zone 'UTC'),
temperature decimal(15,2),
humidity decimal(15,2)
);
Here is my Spark app to read the JSON file and do the insert (append)
SparkSession spark = SparkSession
.builder()
.appName("SparkJDBC2")
.getOrCreate();
// Java Bean used to apply schema to JSON Data
Encoder<DummyBean> dummyEncoder = Encoders.bean(DummyBean.class);
// Read JSON file to DataSet
String jsonPath = "input/dummy.json";
Dataset<DummyBean> readings = spark.read().json(jsonPath).as(dummyEncoder);
// Diagnostics and Sink
readings.printSchema();
readings.show();
// Write to JDBC Sink
String url = "jdbc:postgresql://dbhost:5432/mydb";
String table = "dummy";
Properties connectionProperties = new Properties();
connectionProperties.setProperty("user", "foo");
connectionProperties.setProperty("password", "bar");
readings.write().mode(SaveMode.Append).jdbc(url, table, connectionProperties);
Output and Error Message:
root
|-- humidity: double (nullable = true)
|-- notes: string (nullable = true)
|-- reading_time: long (nullable = true)
|-- sensor_id: string (nullable = true)
|-- temperature: double (nullable = true)
+--------+--------------------+-------------+---------+-----------+
|humidity| notes| reading_time|sensor_id|temperature|
+--------+--------------------+-------------+---------+-----------+
| 42.9|My Notes for Sensor1|1549533263587| sensor1| 24.11|
+--------+--------------------+-------------+---------+-----------+
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column "reading_time" not found in schema Some(StructType(StructField(id,IntegerType,false), StructField(sensor_id,StringType,true), StructField(notes,StringType,true), StructField(temperature,DecimalType(15,2),true), StructField(humidity,DecimalType(15,2),true)));
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4$$anonfun$6.apply(JdbcUtils.scala:147)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4$$anonfun$6.apply(JdbcUtils.scala:147)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4.apply(JdbcUtils.scala:146)
The exception in your post says "reading_time" column not found.. so please cross check if the table is having the required column in the db end. Also, the timestamp is coming in milli seconds, so you need to divide that by 1000 before applying the to_timestamp() function, otherwise you'll get a weird date.
I'm able to replicate below and convert the reading_time.
scala> val readings = Seq((42.9,"My Notes for Sensor1",1549533263587L,"sensor1",24.11)).toDF("humidity","notes","reading_time","sensor_id","temperature")
readings: org.apache.spark.sql.DataFrame = [humidity: double, notes: string ... 3 more fields]
scala> readings.printSchema();
root
|-- humidity: double (nullable = false)
|-- notes: string (nullable = true)
|-- reading_time: long (nullable = false)
|-- sensor_id: string (nullable = true)
|-- temperature: double (nullable = false)
scala> readings.show(false)
+--------+--------------------+-------------+---------+-----------+
|humidity|notes |reading_time |sensor_id|temperature|
+--------+--------------------+-------------+---------+-----------+
|42.9 |My Notes for Sensor1|1549533263587|sensor1 |24.11 |
+--------+--------------------+-------------+---------+-----------+
scala> readings.withColumn("ts", to_timestamp('reading_time/1000)).show(false)
+--------+--------------------+-------------+---------+-----------+-----------------------+
|humidity|notes |reading_time |sensor_id|temperature|ts |
+--------+--------------------+-------------+---------+-----------+-----------------------+
|42.9 |My Notes for Sensor1|1549533263587|sensor1 |24.11 |2019-02-07 04:54:23.587|
+--------+--------------------+-------------+---------+-----------+-----------------------+
scala>
Thanks for your help. Yes the table was missing the column so I fixed that.
This is what solved it (Java version)
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.to_timestamp;
...
Dataset<Row> readingsRow = readings.withColumn("reading_time", to_timestamp(col("reading_time").$div(1000L)));
// Write to JDBC Sink
String url = "jdbc:postgresql://dbhost:5432/mydb";
String table = "dummy";
Properties connectionProperties = new Properties();
connectionProperties.setProperty("user", "foo");
connectionProperties.setProperty("password", "bar");
readingsRow.write().mode(SaveMode.Append).jdbc(url, table, connectionProperties);
If your date is String you can use
String readtime = obj.getString("reading_time");
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ"); //Z for time zone
Date reading_time = sdf.parse(readtime);
or use
new Date(json.getLong(milliseconds))
if it is long

spark read orc with specific columns

I have a orc file, when read with below option it reads all the columns .
val df= spark.read.orc("/some/path/")
df.printSChema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- value: string (nullable = true)
|-- all: string (nullable = true)
|-- next: string (nullable = true)
|-- action: string (nullable = true)
but I want to read only two columns from that file , is there any way to read only two columns (id,name) while loading orc file ?
is there any way to read only two columns (id,name) while loading orc file ?
Yes, all you need is subsequent select. Spark will take care of the rest for you:
val df = spark.read.orc("/some/path/").select("id", "name")
Spark has lazy execution model. So you can do any data transformation in you code without immediate real effect. Only after action call Spark start to doing job. And Spark are smart enough not to do extra work.
So you can write like this:
val inDF: DataFrame = spark.read.orc("/some/path/")
import spark.implicits._
val filteredDF: DataFrame = inDF.select($"id", $"name")
// any additional transformations
// real work starts after this action
val result: Array[Row] = filteredDF.collect()

How to covert dataframe datatypes to String?

I have a hive Table having Date and Timestamp datatypes. I am creating DataFrame using below java code:
SparkConf conf = new SparkConf(true).setMaster("yarn-cluster").setAppName("SAMPLE_APP");
SparkContext sc = new SparkContext(conf);
HiveContext hc = new HiveContext(sc);
DataFrame df = hc.table("testdb.tbl1");
Dataframe schema:
df.printSchema
root
|-- c_date: date (nullable = true)
|-- c_timestamp: timestamp (nullable = true)
I want to covert these columns to String. How can I achieve this?
I need this because of issue : Spark csv data validation failed for date and timestamp data types of Hive
You can do the following:
df.withColumn("c_date", df.col("c_date").cast(StringType))
In scala, we generally cast datatypes like this:
df.select($"date".cast(StringType).as("new_date"))

Why do Spark DataFrames not change their schema and what to do about it?

I'm using Spark 2.1's Structured Streaming to read from a Kafka topic whose contents are binary avro-encoded.
Thus, after setting up the DataFrame:
val messages = spark
.readStream
.format("kafka")
.options(kafkaConf)
.option("subscribe", config.getString("kafka.topic"))
.load()
If I print the schema of this DataFrame (messages.printSchema()), I get the following:
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- timestampType: integer (nullable = true)
This question should be orthogonal to the problem of avro-decoding, but let's assume I want to somehow convert the value content from the messages DataFrame into a Dataset[BusinessObject], by a function Array[Byte] => BusinessObject. For example completeness, the function may just be (using avro4s):
case class BusinessObject(userId: String, eventId: String)
def fromAvro(bytes: Array[Byte]): BusinessObject =
AvroInputStream.binary[BusinessObject](
new ByteArrayInputStream(bytes)
).iterator.next
Of course, as miguno says in this related question I cannot just apply the transformation with a DataFrame.map(), because I need to provide an implicit Encoder for such a BusinessObject.
That can be defined as:
implicit val myEncoder : Encoder[BusinessObject] = org.apache.spark.sql.Encoders.kryo[BusinessObject]
Now, perform the map:
val transformedMessages : Dataset[BusinessObjecŧ] = messages.map(row => fromAvro(row.getAs[Array[Byte]]("value")))
But if I query the new schema, I get the following:
root
|-- value: binary (nullable = true)
And I think that does not make any sense, as the dataset should use the Product properties of the BusinessObject case-class and get the correct values.
I've seen some examples on Spark SQL using .schema(StructType) in the reader, but I cannot do that, not just because I'm using readStream, but because I actually have to transform the column before being able to operate in such fields.
I am hoping to tell the Spark SQL engine that the transformedMessages Dataset schema is a StructField with the case class' fields.
I would say you get exactly what you ask for. As I already explained today Encoders.kryo generates a blob with serialized object. Its internal structure is opaque for the SQL engine and cannot be accessed without deserializing the object. So effectively what your code does is taking one serialization format and replacing it with another.
Another problem you have is that you try to mix dynamically typed DataFrame (Dataset[Row]) with statically typed object. Excluding UDT API Spark SQL doesn't work like this. Either you use statically Dataset or DataFrame with object structure encoded using struct hierarchy.
Good news is simple product types like BusinessObject should work just fine without any need for clumsy Encoders.kryo. Just skip Kryo encoder definition and be sure to import implicit encoders:
import spark.implicits._

On Spark DataFrame save to JSON and load back, schema column sequence changes

I am using spark DataFrames and trying to do de-duplication across to DataFrames of same schema.
schema for before saving DataFrame to JSON is like:
root
|-- startTime: long (nullable = false)
|-- name: string (nullable = true)
Schema of DataFrame after loading from JSON file is like:
root
|-- name: string (nullable = true)
|-- startTime: long (nullable = false)
I save to JSON as:
newDF.write.json(filePath)
and read back as:
existingDF = sqlContext.read.json(filePath)
After doing unionAll
existingDF.unionAll(newDF).distinct()
or except
newDF.except(existingDF)
The de-duplication fails because of schema change.
Can I avoid this schema conversion?
Is there a way to conserve (or enforce) schema sequence while saving to and loading back from JSON file?
Implemented a workaround to convert the schema back to what I need:
val newSchema = StructType(jsonDF.schema.map {
case StructField(name, dataType, nullable, metadata) if name.equals("startTime") => StructField(name, LongType, nullable = false, metadata)
case y: StructField => y
})
existingDF = sqlContext.createDataFrame(jsonDF.rdd, newSchema).select("startTime", "name")

Resources