How to load CSVs with timestamps in custom format? - apache-spark

I have a timestamp field in a csv file that I load to a dataframe using spark csv library. The same piece of code works on my local machine with Spark 2.0 version but throws an error on Azure Hortonworks HDP 3.5 and 3.6.
I have checked and Azure HDInsight 3.5 is also using the same Spark version so I don't think it's a problem with Spark version.
import org.apache.spark.sql.types._
val sourceFile = "C:\\2017\\datetest"
val sourceSchemaStruct = new StructType()
.add("EventDate",DataTypes.TimestampType)
.add("Name",DataTypes.StringType)
val df = spark.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","|")
.option("mode","FAILFAST")
.option("inferSchema","false")
.option("dateFormat","yyyy/MM/dd HH:mm:ss.SSS")
.schema(sourceSchemaStruct)
.load(sourceFile)
The whole exception is as follows:
Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:237)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:179)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13$$anonfun$apply$2.apply$mcJ$sp(UnivocityParser.scala:142)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13$$anonfun$apply$2.apply(UnivocityParser.scala:142)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13$$anonfun$apply$2.apply(UnivocityParser.scala:142)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13.apply(UnivocityParser.scala:139)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13.apply(UnivocityParser.scala:135)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$nullSafeDatum(UnivocityParser.scala:179)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9.apply(UnivocityParser.scala:135)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9.apply(UnivocityParser.scala:134)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:215)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:187)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:304)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:304)
at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
... 27 more
The csv file has only one row as follows:
"EventDate"|"Name"
"2016/12/19 00:43:27.583"|"adam"

TL;DR Use timestampFormat option (not dateFormat).
I've managed to reproduce it in the latest Spark version 2.3.0-SNAPSHOT (built from the master).
// OS shell
$ cat so-43259485.csv
"EventDate"|"Name"
"2016/12/19 00:43:27.583"|"adam"
// spark-shell
scala> spark.version
res1: String = 2.3.0-SNAPSHOT
case class Event(EventDate: java.sql.Timestamp, Name: String)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Event].schema
scala> spark
.read
.format("csv")
.option("header", true)
.option("mode","FAILFAST")
.option("delimiter","|")
.schema(schema)
.load("so-43259485.csv")
.show(false)
17/04/08 11:03:42 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 7)
java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:237)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:167)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$17$$anonfun$apply$6.apply$mcJ$sp(UnivocityParser.scala:146)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$17$$anonfun$apply$6.apply(UnivocityParser.scala:146)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$17$$anonfun$apply$6.apply(UnivocityParser.scala:146)
at scala.util.Try.getOrElse(Try.scala:79)
The corresponding line in the Spark sources is the "root cause" of the issue:
Timestamp.valueOf(s)
Having read the javadoc of Timestamp.valueOf, you can learn that the argument should be:
timestamp in format yyyy-[m]m-[d]d hh:mm:ss[.f...]. The fractional seconds may be omitted. The leading zero for mm and dd may also be omitted.
Note "The fractional seconds may be omitted" so let's cut it off by first loading the EventDate as a String and only after removing the unneeded fractional seconds convert it to Timestamp.
val eventsAsString = spark.read.format("csv")
.option("header", true)
.option("mode","FAILFAST")
.option("delimiter","|")
.load("so-43259485.csv")
It turns out that for fields of TimestampType type Spark uses timestampFormat option first if defined and only if not uses the code the uses Timestamp.valueOf.
It turns out the fix is just to use timestampFormat option (not dateFormat!).
val df = spark.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","|")
.option("mode","FAILFAST")
.option("inferSchema","false")
.option("timestampFormat","yyyy/MM/dd HH:mm:ss.SSS")
.schema(sourceSchemaStruct)
.load(sourceFile)
scala> df.show(false)
+-----------------------+----+
|EventDate |Name|
+-----------------------+----+
|2016-12-19 00:43:27.583|adam|
+-----------------------+----+
Spark 2.1.0
Use schema inference in CSV using inferSchema option with your custom timestampFormat.
It's important to trigger schema inference using inferSchema for timestampFormat to take effect.
val events = spark.read
.format("csv")
.option("header", true)
.option("mode","FAILFAST")
.option("delimiter","|")
.option("inferSchema", true)
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss")
.load("so-43259485.csv")
scala> events.show(false)
+-------------------+----+
|EventDate |Name|
+-------------------+----+
|2016-12-19 00:43:27|adam|
+-------------------+----+
scala> events.printSchema
root
|-- EventDate: timestamp (nullable = true)
|-- Name: string (nullable = true)
"Incorrect" initial version left for learning purposes
val events = eventsAsString
.withColumn("date", split($"EventDate", " ")(0))
.withColumn("date", translate($"date", "/", "-"))
.withColumn("time", split($"EventDate", " ")(1))
.withColumn("time", split($"time", "[.]")(0)) // <-- remove millis part
.withColumn("EventDate", concat($"date", lit(" "), $"time")) // <-- make EventDate right
.select($"EventDate" cast "timestamp", $"Name")
scala> events.printSchema
root
|-- EventDate: timestamp (nullable = true)
|-- Name: string (nullable = true)
events.show(false)
scala> events.show
+-------------------+----+
| EventDate|Name|
+-------------------+----+
|2016-12-19 00:43:27|adam|
+-------------------+----+
Spark 2.2.0
As of Spark 2.2 you can use to_timestamp function to do the string to timestamp conversion.
eventsAsString.select($"EventDate", to_timestamp($"EventDate", "yyyy/MM/dd HH:mm:ss.SSS")).show(false)
scala> eventsAsString.select($"EventDate", to_timestamp($"EventDate", "yyyy/MM/dd HH:mm:ss.SSS")).show(false)
+-----------------------+----------------------------------------------------+
|EventDate |to_timestamp(`EventDate`, 'yyyy/MM/dd HH:mm:ss.SSS')|
+-----------------------+----------------------------------------------------+
|2016/12/19 00:43:27.583|2016-12-19 00:43:27 |
+-----------------------+----------------------------------------------------+

I searched for this issue, and discovered the offical Github issue page https://github.com/databricks/spark-csv/pull/280 which has fixed a related bug for parsing data with custom date format. I reviewed some source codes, and according to the code to find out your issue reason which is set inferSchema with the default value false as below.
inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default
Please change inferSchema with true for your date format yyyy/MM/dd HH:mm:ss.SSS using SimpleDateFormat.

Related

Reading a .txt file with colon (:) in spark 2.4

i am trying to read .txt file in Spark 2.4 and load it to dataframe.
FILE data looks like :-
under a single Manager there is many employee
Manager_21: Employee_575,Employee_2703,
Manager_11: Employee_454,Employee_158,
Manager_4: Employee_1545,Employee_1312
Code i have written in Scala Spark 2.4 :-
val df = spark.read
.format("csv")
.option("header", "true") //first line in file has headers
.option("mode", "DROPMALFORMED")
.load("D:/path/myfile.txt")
df.printSchema()
Unfortunately while printing schema it is visible all Employee under single Manager_21.
root
|-- Manager_21: servant_575: string (nullable = true)
|-- Employee_454: string (nullable = true)
|-- Employee_1312 string (nullable = true)
.......
...... etc
I am not sure if it is possible in spark scala....
Expected Output:
all employee of a manager in same column.
for ex: Manager 21 has 2 employee and all are in same column.
Or How can we see which all employee are under a particular manager.
Manager_21 |Manager_11 |Manager_4
Employee_575 |Employee_454 |Employee_1545
Employee_2703|Employee_158|Employee_1312
is it possible to do some other way..... please suggest
Thanks
Try using spark.read.text then using groupBy and .pivot to get the desired result.
Example:
val df=spark.read.text("<path>")
df.show(10,false)
//+--------------------------------------+
//|value |
//+--------------------------------------+
//|Manager_21: Employee_575,Employee_2703|
//|Manager_11: Employee_454,Employee_158 |
//|Manager_4: Employee_1545,Employee_1312|
//+--------------------------------------+
import org.apache.spark.sql.functions._
df.withColumn("mid",monotonically_increasing_id).
withColumn("col1",split(col("value"),":")(0)).
withColumn("col2",split(split(col("value"),":")(1),",")).
groupBy("mid").
pivot(col("col1")).
agg(min(col("col2"))).
select(max("Manager_11").alias("Manager_11"),max("Manager_21").alias("Manager_21") ,max("Manager_4").alias("Manager_4")).
selectExpr("explode(arrays_zip(Manager_11,Manager_21,Manager_4))").
select("col.*").
show()
//+-------------+-------------+--------------+
//| Manager_11| Manager_21| Manager_4|
//+-------------+-------------+--------------+
//| Employee_454| Employee_575| Employee_1545|
//| Employee_158|Employee_2703| Employee_1312|
//+-------------+-------------+--------------+
UPDATE:
val df=spark.read.text("<path>")
val df1=df.withColumn("mid",monotonically_increasing_id).
withColumn("col1",split(col("value"),":")(0)).
withColumn("col2",split(split(col("value"),":")(1),",")).
groupBy("mid").
pivot(col("col1")).
agg(min(col("col2"))).
select(max("Manager_11").alias("Manager_11"),max("Manager_21").alias("Manager_21") ,max("Manager_4").alias("Manager_4")).
selectExpr("explode(arrays_zip(Manager_11,Manager_21,Manager_4))")
//create temp table
df1.createOrReplaceTempView("tmp_table")
sql("select col.* from tmp_table").show(10,false)
//+-------------+-------------+--------------+
//|Manager_11 |Manager_21 |Manager_4 |
//+-------------+-------------+--------------+
//| Employee_454| Employee_575| Employee_1545|
//|Employee_158 |Employee_2703|Employee_1312 |
//+-------------+-------------+--------------+

Parse Micro/Nano Seconds timestamp in spark-csv Dataframe reader : Inconsistent results

I'm trying to read a csv file which has timestamps till nano seconds.
sample content of file TestTimestamp.csv-
spark- 2.4.0, scala - 2.11.11
/**
* TestTimestamp.csv -
* 101,2019-SEP-23 11.42.35.456789123 AM
*
*/
Tried to read it using timestampFormat = "yyyy-MMM-dd hh.mm.ss.SSSSSSSSS aaa"
val dataSchema = StructType(Array(StructField("ID", DoubleType, true), StructField("Created_TS", TimestampType, true)))
val data = spark.read.format("csv")
.option("header", "false")
.option("inferSchema", "false")
.option("treatEmptyValuesAsNulls", "true")
//.option("nullValue", "")
.option("dateFormat", "yyyy-MMM-dd")
.option("timestampFormat", "yyyy-MMM-dd hh.mm.ss.SSSSSSSSS aaa")
.schema(dataSchema)
.load("C:\\TestData\\Raw\\TetraPak\\Shipments\\TestTimeStamp.csv")
data.select('Created_TS).show
Output which I get is completely wrong date-time. 23rd Sept got changed to 28th September
+--------------------+
| Created_TS|
+--------------------+
|2019-09-28 18:35:...|
+--------------------+
Even if I have Hours in 24 Hour formats like -
"2019-SEP-23 16.42.35.456789123"
and I try to use only first few digits of second fractions by giving timestampFormat = "yyyy-MMM-dd HH.mm.ss.SSS"
similar incorrect result-
val data2 = spark.read.format("csv")
.option("header", "false")
.option("inferSchema", "false")
.option("treatEmptyValuesAsNulls", "true")
//.option("nullValue", "")
.option("dateFormat", "yyyy-MMM-dd")
.option("timestampFormat", "yyyy-MMM-dd hh.mm.ss.SSS")
.schema(dataSchema)
.load("C:\\TestData\\Raw\\TetraPak\\Shipments\\TestTimeStamp.csv")
data2.select('Created_TS).show
+--------------------+
| Created_TS|
+--------------------+
|2019-09-28 23:35:...|
+--------------------+
is there any way to parse such timestamp strings while creating dataframe using csv reader ?
The DataFrameReader uses the SimpleDateFormat for parsing dates:
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
Unfortunately, the SimpleDateFormat does not support nano seconds, so the part of your dates after the last dot will be interpreted as 456789123 milliseconds, which is approx 126 hours. This time is added to your date, this explains the strange results that you see. More details on this topic can be found in this answer.
So the dates have to be parsed in a second step after reading the csv, for example with a udf that uses a DateTimeFormatter:
val dataSchema = StructType(Array(StructField("ID", DoubleType, true), StructField("Created_TS_String", StringType, true)))
var df = spark.read.option("header", false)
.option("inferSchema", "false")
.option("treatEmptyValuesAsNulls", "true")
.schema(dataSchema)
.csv("C:\\TestData\\Raw\\TetraPak\\Shipments\\TestTimeStamp.csv")
val toDate = udf((date: String) => {
val formatter = new DateTimeFormatterBuilder()
.parseCaseInsensitive()
.appendPattern("yyyy-MMM-dd hh.mm.ss.SSSSSSSSS a").toFormatter()
Timestamp.valueOf(LocalDateTime.parse(date, formatter))
})
df = df.withColumn("Created_TS", toDate('Created_TS_String))
Here is the solution inspired by werner's answer about using udfs..-
Input csv -
101,2019-SEP-23 11.42.35.456789123 AM,2019-SEP-23 11.42.35.456789123 AM,2019-SEP-23 11.42.35.456789123 AM
Original Schema with TimestampType columns
val orig_schema = StructType(Array(StructField("ID", DoubleType, true), StructField("Created_TS", TimestampType, true), StructField("Updated_TS", TimestampType, true), StructField("Modified_TS", TimestampType, true)))
Convert all TimestampType to StringType
val dataSchema = StructType(orig_schema.map(x =>
{
x.dataType match {
case TimestampType => StructField(x.name, StringType, x.nullable)
case _ => x
}
}))
toDate function for convert String to Timstamp
//TODO parameterize string formats
def toDate(date: String): java.sql.Timestamp = {
val formatter = new DateTimeFormatterBuilder()
.parseCaseInsensitive()
.appendPattern("yyyy-MMM-dd hh.mm.ss.SSSSSSSSS a").toFormatter()
Timestamp.valueOf(LocalDateTime.parse(date, formatter))
}
// register toDate as udf
val to_timestamp = spark.sqlContext.udf.register("to_timestamp", toDate _)
Create Column Expression to select from raw Dataframe
// Array of Column Name & Types
val nameType: Array[(String, DataType)] = orig_schema.fields.map(f => (f.name, f.dataType))
// Create Column Expression to select from raw Dataframe
val selectExpr = nameType.map(f => {
f._2 match {
case TimestampType => expr(s"CASE WHEN ${f._1} is NULL THEN NULL ELSE to_timestamp(${f._1}) END AS ${f._1}")
case _ => expr(s"${f._1}")
}
})
Read as StringType , Use column selector expression which uses udf to convert string to Timestamp
val data = spark.read.format("csv")
.option("header", "false")
.option("inferSchema", "false")
.option("treatEmptyValuesAsNulls", "true")
//.option("nullValue", "")
.option("dateFormat", "yyyy-MMM-dd")
.option("timestampFormat", "yyyy-MMM-dd hh.mm.ss.SSSSSSSSS aaa")
.schema(dataSchema)
.load("C:\\TestData\\Raw\\TetraPak\\Shipments\\TestTimestamp_new.csv").select(selectExpr: _*)
data.show
Here's desired output..so now I don't have to worry about number of columns and creating expressions with udf manually
+-----+--------------------+--------------------+--------------------+
| ID| Created_TS| Updated_TS| Modified_TS|
+-----+--------------------+--------------------+--------------------+
|101.0|2019-09-23 11:42:...|2019-09-23 11:42:...|2019-09-23 11:42:...|
+-----+--------------------+--------------------+--------------------+

Parquet data and partition issue in Spark Structured streaming

I am using Spark Structured streaming; My DataFrame has the following schema
root
|-- data: struct (nullable = true)
| |-- zoneId: string (nullable = true)
| |-- deviceId: string (nullable = true)
| |-- timeSinceLast: long (nullable = true)
|-- date: date (nullable = true)
How can I do a writeStream with Parquet format and write the data
(containing zoneId, deviceId, timeSinceLast; everything except date) and partition the data by date ? I tried the following code and the partition by clause did
not work
val query1 = df1
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("data.zoneId")
.start()
If you want to partition by date then you have to use it in partitionBy() method.
val query1 = df1
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("date")
.start()
In case if you want to partition data structured by <year>/<month>/<day> you should make sure that the date column is of DateType type and then create columns appropriately formatted:
val df = dataset.withColumn("date", dataset.col("date").cast(DataTypes.DateType))
df.withColumn("year", functions.date_format(df.col("date"), "YYYY"))
.withColumn("month", functions.date_format(df.col("date"), "MM"))
.withColumn("day", functions.date_format(df.col("date"), "dd"))
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("year", "month", "day")
.start()
I think you should try the method repartition which can take two kinds of arguments:
column name
number of wanted partitions.
I suggest using repartition("date") to partition your data by date.
A great link on the subject: https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

Setting date format in PySpark SQL schema definition [duplicate]

I have a timestamp field in a csv file that I load to a dataframe using spark csv library. The same piece of code works on my local machine with Spark 2.0 version but throws an error on Azure Hortonworks HDP 3.5 and 3.6.
I have checked and Azure HDInsight 3.5 is also using the same Spark version so I don't think it's a problem with Spark version.
import org.apache.spark.sql.types._
val sourceFile = "C:\\2017\\datetest"
val sourceSchemaStruct = new StructType()
.add("EventDate",DataTypes.TimestampType)
.add("Name",DataTypes.StringType)
val df = spark.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","|")
.option("mode","FAILFAST")
.option("inferSchema","false")
.option("dateFormat","yyyy/MM/dd HH:mm:ss.SSS")
.schema(sourceSchemaStruct)
.load(sourceFile)
The whole exception is as follows:
Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:237)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:179)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13$$anonfun$apply$2.apply$mcJ$sp(UnivocityParser.scala:142)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13$$anonfun$apply$2.apply(UnivocityParser.scala:142)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13$$anonfun$apply$2.apply(UnivocityParser.scala:142)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13.apply(UnivocityParser.scala:139)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13.apply(UnivocityParser.scala:135)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$nullSafeDatum(UnivocityParser.scala:179)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9.apply(UnivocityParser.scala:135)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9.apply(UnivocityParser.scala:134)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:215)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:187)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:304)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:304)
at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
... 27 more
The csv file has only one row as follows:
"EventDate"|"Name"
"2016/12/19 00:43:27.583"|"adam"
TL;DR Use timestampFormat option (not dateFormat).
I've managed to reproduce it in the latest Spark version 2.3.0-SNAPSHOT (built from the master).
// OS shell
$ cat so-43259485.csv
"EventDate"|"Name"
"2016/12/19 00:43:27.583"|"adam"
// spark-shell
scala> spark.version
res1: String = 2.3.0-SNAPSHOT
case class Event(EventDate: java.sql.Timestamp, Name: String)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Event].schema
scala> spark
.read
.format("csv")
.option("header", true)
.option("mode","FAILFAST")
.option("delimiter","|")
.schema(schema)
.load("so-43259485.csv")
.show(false)
17/04/08 11:03:42 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 7)
java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:237)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:167)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$17$$anonfun$apply$6.apply$mcJ$sp(UnivocityParser.scala:146)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$17$$anonfun$apply$6.apply(UnivocityParser.scala:146)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$17$$anonfun$apply$6.apply(UnivocityParser.scala:146)
at scala.util.Try.getOrElse(Try.scala:79)
The corresponding line in the Spark sources is the "root cause" of the issue:
Timestamp.valueOf(s)
Having read the javadoc of Timestamp.valueOf, you can learn that the argument should be:
timestamp in format yyyy-[m]m-[d]d hh:mm:ss[.f...]. The fractional seconds may be omitted. The leading zero for mm and dd may also be omitted.
Note "The fractional seconds may be omitted" so let's cut it off by first loading the EventDate as a String and only after removing the unneeded fractional seconds convert it to Timestamp.
val eventsAsString = spark.read.format("csv")
.option("header", true)
.option("mode","FAILFAST")
.option("delimiter","|")
.load("so-43259485.csv")
It turns out that for fields of TimestampType type Spark uses timestampFormat option first if defined and only if not uses the code the uses Timestamp.valueOf.
It turns out the fix is just to use timestampFormat option (not dateFormat!).
val df = spark.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","|")
.option("mode","FAILFAST")
.option("inferSchema","false")
.option("timestampFormat","yyyy/MM/dd HH:mm:ss.SSS")
.schema(sourceSchemaStruct)
.load(sourceFile)
scala> df.show(false)
+-----------------------+----+
|EventDate |Name|
+-----------------------+----+
|2016-12-19 00:43:27.583|adam|
+-----------------------+----+
Spark 2.1.0
Use schema inference in CSV using inferSchema option with your custom timestampFormat.
It's important to trigger schema inference using inferSchema for timestampFormat to take effect.
val events = spark.read
.format("csv")
.option("header", true)
.option("mode","FAILFAST")
.option("delimiter","|")
.option("inferSchema", true)
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss")
.load("so-43259485.csv")
scala> events.show(false)
+-------------------+----+
|EventDate |Name|
+-------------------+----+
|2016-12-19 00:43:27|adam|
+-------------------+----+
scala> events.printSchema
root
|-- EventDate: timestamp (nullable = true)
|-- Name: string (nullable = true)
"Incorrect" initial version left for learning purposes
val events = eventsAsString
.withColumn("date", split($"EventDate", " ")(0))
.withColumn("date", translate($"date", "/", "-"))
.withColumn("time", split($"EventDate", " ")(1))
.withColumn("time", split($"time", "[.]")(0)) // <-- remove millis part
.withColumn("EventDate", concat($"date", lit(" "), $"time")) // <-- make EventDate right
.select($"EventDate" cast "timestamp", $"Name")
scala> events.printSchema
root
|-- EventDate: timestamp (nullable = true)
|-- Name: string (nullable = true)
events.show(false)
scala> events.show
+-------------------+----+
| EventDate|Name|
+-------------------+----+
|2016-12-19 00:43:27|adam|
+-------------------+----+
Spark 2.2.0
As of Spark 2.2 you can use to_timestamp function to do the string to timestamp conversion.
eventsAsString.select($"EventDate", to_timestamp($"EventDate", "yyyy/MM/dd HH:mm:ss.SSS")).show(false)
scala> eventsAsString.select($"EventDate", to_timestamp($"EventDate", "yyyy/MM/dd HH:mm:ss.SSS")).show(false)
+-----------------------+----------------------------------------------------+
|EventDate |to_timestamp(`EventDate`, 'yyyy/MM/dd HH:mm:ss.SSS')|
+-----------------------+----------------------------------------------------+
|2016/12/19 00:43:27.583|2016-12-19 00:43:27 |
+-----------------------+----------------------------------------------------+
I searched for this issue, and discovered the offical Github issue page https://github.com/databricks/spark-csv/pull/280 which has fixed a related bug for parsing data with custom date format. I reviewed some source codes, and according to the code to find out your issue reason which is set inferSchema with the default value false as below.
inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default
Please change inferSchema with true for your date format yyyy/MM/dd HH:mm:ss.SSS using SimpleDateFormat.

Trim csv file before importing into a Spark Dataset

I've seen this post about how to specify an schema for creating a Dataset
Spark Scala: Cannot up cast from string to int as it may truncate
val spark = SparkSession.builder()
.master("local")
.appName("test")
.getOrCreate()
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Record].schema
val ds = spark.read
.option("header", "true")
.schema(schema) // passing schema
.option("timestampFormat", "MM/dd/yyyy HH:mm") // passing timestamp format
.csv(path)// csv path
.as[Record] // convert to DS
It works for me, but not when there are withispaces in the csv. Is it possible to trim the csv in this same spark.read sequence?

Resources