Convert from long to Timestamp for Insert into DB - apache-spark

Goal:
Read data from a JSON file where timestamp is a long type, and insert into a table that has a Timestamp type. The problem is that I don't know how to convert the long type to a Timestamp type for the insert.
Input File Sample:
{"sensor_id":"sensor1","reading_time":1549533263587,"notes":"My Notes for
Sensor1","temperature":24.11,"humidity":42.90}
I want to read this, create a Bean from it, and insert into a table. Here is my Bean Definition:
public class DummyBean {
private String sensor_id;
private String notes;
private Timestamp reading_time;
private double temperature;
private double humidity;
Here is the table I want to insert into:
create table dummy (
id serial not null primary key,
sensor_id varchar(40),
notes varchar(40),
reading_time timestamp with time zone default (current_timestamp at time zone 'UTC'),
temperature decimal(15,2),
humidity decimal(15,2)
);
Here is my Spark app to read the JSON file and do the insert (append)
SparkSession spark = SparkSession
.builder()
.appName("SparkJDBC2")
.getOrCreate();
// Java Bean used to apply schema to JSON Data
Encoder<DummyBean> dummyEncoder = Encoders.bean(DummyBean.class);
// Read JSON file to DataSet
String jsonPath = "input/dummy.json";
Dataset<DummyBean> readings = spark.read().json(jsonPath).as(dummyEncoder);
// Diagnostics and Sink
readings.printSchema();
readings.show();
// Write to JDBC Sink
String url = "jdbc:postgresql://dbhost:5432/mydb";
String table = "dummy";
Properties connectionProperties = new Properties();
connectionProperties.setProperty("user", "foo");
connectionProperties.setProperty("password", "bar");
readings.write().mode(SaveMode.Append).jdbc(url, table, connectionProperties);
Output and Error Message:
root
|-- humidity: double (nullable = true)
|-- notes: string (nullable = true)
|-- reading_time: long (nullable = true)
|-- sensor_id: string (nullable = true)
|-- temperature: double (nullable = true)
+--------+--------------------+-------------+---------+-----------+
|humidity| notes| reading_time|sensor_id|temperature|
+--------+--------------------+-------------+---------+-----------+
| 42.9|My Notes for Sensor1|1549533263587| sensor1| 24.11|
+--------+--------------------+-------------+---------+-----------+
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column "reading_time" not found in schema Some(StructType(StructField(id,IntegerType,false), StructField(sensor_id,StringType,true), StructField(notes,StringType,true), StructField(temperature,DecimalType(15,2),true), StructField(humidity,DecimalType(15,2),true)));
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4$$anonfun$6.apply(JdbcUtils.scala:147)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4$$anonfun$6.apply(JdbcUtils.scala:147)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4.apply(JdbcUtils.scala:146)

The exception in your post says "reading_time" column not found.. so please cross check if the table is having the required column in the db end. Also, the timestamp is coming in milli seconds, so you need to divide that by 1000 before applying the to_timestamp() function, otherwise you'll get a weird date.
I'm able to replicate below and convert the reading_time.
scala> val readings = Seq((42.9,"My Notes for Sensor1",1549533263587L,"sensor1",24.11)).toDF("humidity","notes","reading_time","sensor_id","temperature")
readings: org.apache.spark.sql.DataFrame = [humidity: double, notes: string ... 3 more fields]
scala> readings.printSchema();
root
|-- humidity: double (nullable = false)
|-- notes: string (nullable = true)
|-- reading_time: long (nullable = false)
|-- sensor_id: string (nullable = true)
|-- temperature: double (nullable = false)
scala> readings.show(false)
+--------+--------------------+-------------+---------+-----------+
|humidity|notes |reading_time |sensor_id|temperature|
+--------+--------------------+-------------+---------+-----------+
|42.9 |My Notes for Sensor1|1549533263587|sensor1 |24.11 |
+--------+--------------------+-------------+---------+-----------+
scala> readings.withColumn("ts", to_timestamp('reading_time/1000)).show(false)
+--------+--------------------+-------------+---------+-----------+-----------------------+
|humidity|notes |reading_time |sensor_id|temperature|ts |
+--------+--------------------+-------------+---------+-----------+-----------------------+
|42.9 |My Notes for Sensor1|1549533263587|sensor1 |24.11 |2019-02-07 04:54:23.587|
+--------+--------------------+-------------+---------+-----------+-----------------------+
scala>

Thanks for your help. Yes the table was missing the column so I fixed that.
This is what solved it (Java version)
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.to_timestamp;
...
Dataset<Row> readingsRow = readings.withColumn("reading_time", to_timestamp(col("reading_time").$div(1000L)));
// Write to JDBC Sink
String url = "jdbc:postgresql://dbhost:5432/mydb";
String table = "dummy";
Properties connectionProperties = new Properties();
connectionProperties.setProperty("user", "foo");
connectionProperties.setProperty("password", "bar");
readingsRow.write().mode(SaveMode.Append).jdbc(url, table, connectionProperties);

If your date is String you can use
String readtime = obj.getString("reading_time");
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ"); //Z for time zone
Date reading_time = sdf.parse(readtime);
or use
new Date(json.getLong(milliseconds))
if it is long

Related

Apache Spark's from_json not working as expected

In my Spark application, I am trying to read the incoming JSON data, sent through the socket. The data is in string format. eg. {"deviceId": "1", "temperature":4.5}.
I created a schema as shown below:
StructType dataSchema = new StructType()
.add("deviceId", "string")
.add("temperature", "double");
I wrote the below code to extract the fields, turn them into a column, so I can use that in SQL queries.
Dataset<Row> normalizedStream = stream.select(functions.from_json(new Column("value"),dataSchema)).as("json");
Dataset<Data> test = normalizedStream.select("json.*").as(Encoders.bean(Data.class));
test.printSchema();
Data.class
public class Data {
private String deviceId;
private double temperature;
}
But when I submit the Spark app, the output schema is as below.
root
|-- from_json(value): struct (nullable = true)
| |-- deviceId: string (nullable = true)
| |-- temperature: double (nullable = true)
the from_json function is coming as a column name.
What I expect is:
root
|-- deviceId: string (nullable = true)
|-- temperature: double (nullable = true)
How to fix the above? Please, let me know what I doing wrong.
The problem is the placement of alias. Right now, you are placing an alias to select, and not to from_json where it is supposed to be.
Right now, json.* does not work because the renaming is not working as intended, therefore no column called json can be found, nor any children inside of it.
So, if you move the brackets from this:
...(new Column("value"),dataSchema)).as("json");
to this:
...(new Column("value"),dataSchema).as("json"));
your final data and schema will look as:
+--------+-----------+
|deviceId|temperature|
+--------+-----------+
|1 |4.5 |
+--------+-----------+
root
|-- deviceId: string (nullable = true)
|-- temperature: double (nullable = true)
which is what you intend to do. Hope this helps, good luck!

Spark Structured streaming - reading timestamp from file using schema

I am working on a Structured Streaming job.
The data I am reading from files contains the timestamp (in millis), deviceId and a value reported by that device.
Multiple devices report data.
I am trying to write a job that aggregates (sums) values sent by all devices into tumbling windows of 1 minute.
The issue that I am having is with timestamp.
When I am trying to parse "timestamp" into Long, window function complains that it expects "timestamp type".
When I am trying to parse into TimestampType as in the snippet below I am getting .MatchError exception (the full exception can be seen below) and I am struggling to figure out why and what is the correct way to handle it
// Create schema
StructType readSchema = new StructType().add("value" , "integer")
.add("deviceId", "long")
.add("timestamp", new TimestampType());
// Read data from file
Dataset<Row> inputDataFrame = sparkSession.readStream()
.schema(readSchema)
.parquet(path);
Dataset<Row> aggregations = inputDataFrame.groupBy(window(inputDataFrame.col("timestamp"), "1 minutes"),
inputDataFrame.col("deviceId"))
.agg(sum("value"));
The exception:
org.apache.spark.sql.types.TimestampType#3eeac696 (of class org.apache.spark.sql.types.TimestampType)
scala.MatchError: org.apache.spark.sql.types.TimestampType#3eeac696 (of class org.apache.spark.sql.types.TimestampType)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212)
at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1692)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:232)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:242)
at org.apache.spark.sql.streaming.DataStreamReader.parquet(DataStreamReader.scala:450)
Typically, when your timestamp is stored in milis as a long you would convert it into a timestamp type as shown below:
// Create schema and keep column 'timestamp' as long
StructType readSchema = new StructType()
.add("value", "integer")
.add("deviceId", "long")
.add("timestamp", "long");
// Read data from file
Dataset<Row> inputDataFrame = sparkSession.readStream()
.schema(readSchema)
.parquet(path);
// convert timestamp column into a proper timestamp type
Dataset<Row> df1 = inputDataFrame.withColumn("new_timestamp", expr("timestamp/1000").cast(DataTypes.TimestampType));
df1.show(false)
+-----+--------+-------------+-----------------------+
|value|deviceId|timestamp |new_timestamp |
+-----+--------+-------------+-----------------------+
|1 |1337 |1618836775397|2021-04-19 14:52:55.397|
+-----+--------+-------------+-----------------------+
df1.printSchema();
root
|-- value: integer (nullable = true)
|-- deviceId: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- new_timestamp: timestamp (nullable = true)

Spark Cassandra Write UDT With Case-Sensitive Names Fails

Spark connector Write fails with a java.lang.IllegalArgumentException: udtId is not a field defined in this definition error when using case-sensitive field names
I need the fields in the Cassandra table to maintain case. So i have used
quotes to create them.
My Cassandra schema
CREATE TYPE my_keyspace.my_udt (
"udtId" text,
"udtValue" text
);
CREATE TABLE my_keyspace.my_table (
"id" text PRIMARY KEY,
"someCol" text,
"udtCol" list<frozen<my_udt>>
);
My Spark DataFrame schema is
root
|-- id: string (nullable = true)
|-- someCol: string (nullable = true)
|-- udtCol: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- udtId: string (nullable = true)
|-- udtValue: string (nullable = true)
Are there any other options to get this write to work other than defining my udt with lowercase names? Making them lower case would make me invoke case management code everywhere this is used and i'd like to avoid that ?
Because i couldn't write successfully, i did try read yet? Is this an issue with reads as well ?
You need to upgrade to Spark Cassandra Connector 2.5.0 - I can't find specific commit that fixes it, or specific Jira that mentions that - I suspect that it was fixed in the DataStax version first, and then released as part of merge announced here.
Here is how it works in SCC 2.5.0 + Spark 2.4.6, while it fails with SCC 2.4.2 + Spark 2.4.6:
scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._
scala> val data = spark.read.cassandraFormat("my_table", "test").load()
data: org.apache.spark.sql.DataFrame = [id: string, someCol: string ... 1 more field]
scala> val data2 = data.withColumn("id", concat(col("id"), lit("222")))
data2: org.apache.spark.sql.DataFrame = [id: string, someCol: string ... 1 more field]
scala> data2.write.cassandraFormat("my_table", "test").mode("append").save()

spark read orc with specific columns

I have a orc file, when read with below option it reads all the columns .
val df= spark.read.orc("/some/path/")
df.printSChema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- value: string (nullable = true)
|-- all: string (nullable = true)
|-- next: string (nullable = true)
|-- action: string (nullable = true)
but I want to read only two columns from that file , is there any way to read only two columns (id,name) while loading orc file ?
is there any way to read only two columns (id,name) while loading orc file ?
Yes, all you need is subsequent select. Spark will take care of the rest for you:
val df = spark.read.orc("/some/path/").select("id", "name")
Spark has lazy execution model. So you can do any data transformation in you code without immediate real effect. Only after action call Spark start to doing job. And Spark are smart enough not to do extra work.
So you can write like this:
val inDF: DataFrame = spark.read.orc("/some/path/")
import spark.implicits._
val filteredDF: DataFrame = inDF.select($"id", $"name")
// any additional transformations
// real work starts after this action
val result: Array[Row] = filteredDF.collect()

JSON Struct to Map[String,String] using sqlContext

I am trying to read json data in spark streaming job.
By default sqlContext.read.json(rdd) is converting all map types to struct types.
|-- legal_name: struct (nullable = true)
| |-- first_name: string (nullable = true)
| |-- last_name: string (nullable = true)
| |-- middle_name: string (nullable = true)
But when i read from hive table using sqlContext
val a = sqlContext.sql("select * from student_record")
below is the schema.
|-- leagalname: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Is there any way we can read data using read.json(rdd) and get Map data type?
Is there any option like
spark.sql.schema.convertStructToMap?
Any help is appreciated.
You need to explicitly define your schema, when calling read.json.
You can read about the details in Programmatically specifying the schema in the Spark SQL Documentation.
For example in your specific case it would be
import org.apache.spark.sql.types._
val schema = StructType(List(StructField("legal_name",MapType(StringType,StringType,true))))
That would be one column legal_name being a map.
When you have defined you schema you can call
sqlContext.read.json(rdd, schema) to create your data frame from your JSON dataset with the desired schema.

Resources