Parse Micro/Nano Seconds timestamp in spark-csv Dataframe reader : Inconsistent results - apache-spark

I'm trying to read a csv file which has timestamps till nano seconds.
sample content of file TestTimestamp.csv-
spark- 2.4.0, scala - 2.11.11
/**
* TestTimestamp.csv -
* 101,2019-SEP-23 11.42.35.456789123 AM
*
*/
Tried to read it using timestampFormat = "yyyy-MMM-dd hh.mm.ss.SSSSSSSSS aaa"
val dataSchema = StructType(Array(StructField("ID", DoubleType, true), StructField("Created_TS", TimestampType, true)))
val data = spark.read.format("csv")
.option("header", "false")
.option("inferSchema", "false")
.option("treatEmptyValuesAsNulls", "true")
//.option("nullValue", "")
.option("dateFormat", "yyyy-MMM-dd")
.option("timestampFormat", "yyyy-MMM-dd hh.mm.ss.SSSSSSSSS aaa")
.schema(dataSchema)
.load("C:\\TestData\\Raw\\TetraPak\\Shipments\\TestTimeStamp.csv")
data.select('Created_TS).show
Output which I get is completely wrong date-time. 23rd Sept got changed to 28th September
+--------------------+
| Created_TS|
+--------------------+
|2019-09-28 18:35:...|
+--------------------+
Even if I have Hours in 24 Hour formats like -
"2019-SEP-23 16.42.35.456789123"
and I try to use only first few digits of second fractions by giving timestampFormat = "yyyy-MMM-dd HH.mm.ss.SSS"
similar incorrect result-
val data2 = spark.read.format("csv")
.option("header", "false")
.option("inferSchema", "false")
.option("treatEmptyValuesAsNulls", "true")
//.option("nullValue", "")
.option("dateFormat", "yyyy-MMM-dd")
.option("timestampFormat", "yyyy-MMM-dd hh.mm.ss.SSS")
.schema(dataSchema)
.load("C:\\TestData\\Raw\\TetraPak\\Shipments\\TestTimeStamp.csv")
data2.select('Created_TS).show
+--------------------+
| Created_TS|
+--------------------+
|2019-09-28 23:35:...|
+--------------------+
is there any way to parse such timestamp strings while creating dataframe using csv reader ?

The DataFrameReader uses the SimpleDateFormat for parsing dates:
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
Unfortunately, the SimpleDateFormat does not support nano seconds, so the part of your dates after the last dot will be interpreted as 456789123 milliseconds, which is approx 126 hours. This time is added to your date, this explains the strange results that you see. More details on this topic can be found in this answer.
So the dates have to be parsed in a second step after reading the csv, for example with a udf that uses a DateTimeFormatter:
val dataSchema = StructType(Array(StructField("ID", DoubleType, true), StructField("Created_TS_String", StringType, true)))
var df = spark.read.option("header", false)
.option("inferSchema", "false")
.option("treatEmptyValuesAsNulls", "true")
.schema(dataSchema)
.csv("C:\\TestData\\Raw\\TetraPak\\Shipments\\TestTimeStamp.csv")
val toDate = udf((date: String) => {
val formatter = new DateTimeFormatterBuilder()
.parseCaseInsensitive()
.appendPattern("yyyy-MMM-dd hh.mm.ss.SSSSSSSSS a").toFormatter()
Timestamp.valueOf(LocalDateTime.parse(date, formatter))
})
df = df.withColumn("Created_TS", toDate('Created_TS_String))

Here is the solution inspired by werner's answer about using udfs..-
Input csv -
101,2019-SEP-23 11.42.35.456789123 AM,2019-SEP-23 11.42.35.456789123 AM,2019-SEP-23 11.42.35.456789123 AM
Original Schema with TimestampType columns
val orig_schema = StructType(Array(StructField("ID", DoubleType, true), StructField("Created_TS", TimestampType, true), StructField("Updated_TS", TimestampType, true), StructField("Modified_TS", TimestampType, true)))
Convert all TimestampType to StringType
val dataSchema = StructType(orig_schema.map(x =>
{
x.dataType match {
case TimestampType => StructField(x.name, StringType, x.nullable)
case _ => x
}
}))
toDate function for convert String to Timstamp
//TODO parameterize string formats
def toDate(date: String): java.sql.Timestamp = {
val formatter = new DateTimeFormatterBuilder()
.parseCaseInsensitive()
.appendPattern("yyyy-MMM-dd hh.mm.ss.SSSSSSSSS a").toFormatter()
Timestamp.valueOf(LocalDateTime.parse(date, formatter))
}
// register toDate as udf
val to_timestamp = spark.sqlContext.udf.register("to_timestamp", toDate _)
Create Column Expression to select from raw Dataframe
// Array of Column Name & Types
val nameType: Array[(String, DataType)] = orig_schema.fields.map(f => (f.name, f.dataType))
// Create Column Expression to select from raw Dataframe
val selectExpr = nameType.map(f => {
f._2 match {
case TimestampType => expr(s"CASE WHEN ${f._1} is NULL THEN NULL ELSE to_timestamp(${f._1}) END AS ${f._1}")
case _ => expr(s"${f._1}")
}
})
Read as StringType , Use column selector expression which uses udf to convert string to Timestamp
val data = spark.read.format("csv")
.option("header", "false")
.option("inferSchema", "false")
.option("treatEmptyValuesAsNulls", "true")
//.option("nullValue", "")
.option("dateFormat", "yyyy-MMM-dd")
.option("timestampFormat", "yyyy-MMM-dd hh.mm.ss.SSSSSSSSS aaa")
.schema(dataSchema)
.load("C:\\TestData\\Raw\\TetraPak\\Shipments\\TestTimestamp_new.csv").select(selectExpr: _*)
data.show
Here's desired output..so now I don't have to worry about number of columns and creating expressions with udf manually
+-----+--------------------+--------------------+--------------------+
| ID| Created_TS| Updated_TS| Modified_TS|
+-----+--------------------+--------------------+--------------------+
|101.0|2019-09-23 11:42:...|2019-09-23 11:42:...|2019-09-23 11:42:...|
+-----+--------------------+--------------------+--------------------+

Related

spark.read.excel - Not reading all Excel rows when using custom schema

I am trying to read a Spark DataFrame from an 'excel' file. I used the crealytics dependency.
Without any predefined schema, all rows are correctly read but as only string type columns.
To prevent that, I am using my own schema (where I mentioned certain columns to be Integer type), but in this case, most of the rows are dropped when the file is being read.
The library dependency used in build.sbt:
"com.crealytics" %% "spark-excel" % "0.11.1",
Scala version - 2.11.8
Spark version - 2.3.2
val inputDF = sparkSession.read.excel(useHeader = true).load(inputLocation(0))
The above reads all the data - around 25000 rows.
But,
val inputWithSchemaDF: DataFrame = sparkSession.read
.format("com.crealytics.spark.excel")
.option("useHeader" , "false")
.option("inferSchema", "false")
.option("addColorColumns", "true")
.option("treatEmptyValuesAsNulls" , "false")
.option("keepUndefinedRows", "true")
.option("maxRowsInMey", 2000)
.schema(templateSchema)
.load(inputLocation)
This gives me only 450 rows.
Is there a way to prevent that? Thanks in advance! (edited)
As of now, I haven't found a fix to this problem, but I tried solving it in a different way by manually type-casting. To make it a bit better in terms of number of lines of code, I took the help of a for loop. My solutions is as follows:
Step 1: Create my own schema of type 'StructType':
val requiredSchema = new StructType()
.add("ID", IntegerType, true)
.add("Vendor", StringType, true)
.add("Brand", StringType, true)
.add("Product Name", StringType, true)
.add("Net Quantity", StringType, true)
Step 2: Type casting the Dataframe AFTER it has been read (WITHOUT the custom schema) from the excel file (instead of using the schema while reading the data):
def convertInputToDesiredSchema(inputDF: DataFrame, requiredSchema: StructType)(implicit sparkSession: SparkSession) : DataFrame =
{
var schemaDf: DataFrame = inputDF
for(i <- inputDF.columns.indices)
{
if(inputDF.schema(i).dataType.typeName != requiredSchema(i).dataType.typeName)
{
schemaDf = schemaDf.withColumn(schemaDf.columns(i), col(schemaDf.columns(i)).cast(requiredSchema.apply(i).dataType))
}
}
schemaDf
}
This might not be an efficient solution, but is better than typing out too many lines of code to typecast multiple columns.
I am still searching for a solution to my original question.
This solution is just in case someone might want to try and are in immediate need of a quick fix.
Here's a workaround, using PySpark, using a schema that consists of "fieldname" and "dataType":
# 1st load the dataframe with StringType for all columns
from pyspark.sql.types import *
input_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", isHeaderOn) \
.option("treatEmptyValuesAsNulls", "true") \
.option("dataAddress", xlsxAddress1) \
.option("setErrorCellsToFallbackValues", "true") \
.option("ignoreLeadingWhiteSpace", "true") \
.option("ignoreTrailingWhiteSpace", "true") \
.load(inputfile)
# 2nd Modify the datatypes within the dataframe using a file containing column names and the expected data type.
dtypes = pd.read_csv("/dbfs/mnt/schema/{}".format(file_schema_location), header=None).to_records(index=False).tolist()
fields = [StructField(dtype[0], globals()[f'{dtype[1]}']()) for dtype in dtypes]
schema = StructType(fields)
for dt in dtypes:
colname =dt[0]
coltype = dt[1].replace("Type","")
input_df = input_df.withColumn(colname, col(colname).cast(coltype))

Spark Delta Table Add new columns in middle Schema Evolution

Have to ingest a file with new column into a existing table structure.
create table sch.test (
name string ,
address string
) USING DELTA
--OPTIONS ('mergeSchema' 'true')
PARTITIONED BY (name)
LOCATION '/mnt/loc/fold'
TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true);
Code to read the file:
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/mnt/loc/fold")
display(df)
File in path contains below data
name,address
raghu,india
raj,usa
On writing it to a table,
import org.apache.spark.sql.functions._
df.withColumn("az_insert_ts", current_timestamp())
.withColumn("exec_run_id",lit("233"))
.withColumn("az_inp_file_name",lit("24234filename"))
.coalesce(12)
.write
.mode("append")
.option("mergeSchema", "true")
.format("delta")
.saveAsTable("sch.test")
display(spark.read.table("sch.test"))
Adding a new column,
name,address,age
raghu,india,12
raj,usa,13
Read the file,
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/mnt/loc/fold")
display(df)
While writing into the table using insertInto,
import org.apache.spark.sql.functions._
df.withColumn("az_insert_ts", current_timestamp())
.withColumn("exec_run_id",lit("233"))
.withColumn("az_inp_file_name",lit("24234filename"))
.coalesce(12)
.write
.mode("append")
.option("mergeSchema", "true")
.format("delta")
.insertInto("sch.test")
display(spark.read.table("sch.test"))
Getting the below error,
Setting overwriteSchema to true will wipe out the old schema and let you create a completely new table.
import org.apache.spark.sql.functions._
df.withColumn(""az_insert_ts"", current_timestamp())
.withColumn(""exec_run_id"",lit(""233""))
.withColumn(""az_inp_file_name"",lit(""24234filename""))
.coalesce(12)
.write
.mode(""append"")
.option(""overwriteSchema"", ""true"")
.format(""delta"")
.insertInto(""sch.test"")
display(spark.read.table(""sch.test""))

Setting date format in PySpark SQL schema definition [duplicate]

I have a timestamp field in a csv file that I load to a dataframe using spark csv library. The same piece of code works on my local machine with Spark 2.0 version but throws an error on Azure Hortonworks HDP 3.5 and 3.6.
I have checked and Azure HDInsight 3.5 is also using the same Spark version so I don't think it's a problem with Spark version.
import org.apache.spark.sql.types._
val sourceFile = "C:\\2017\\datetest"
val sourceSchemaStruct = new StructType()
.add("EventDate",DataTypes.TimestampType)
.add("Name",DataTypes.StringType)
val df = spark.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","|")
.option("mode","FAILFAST")
.option("inferSchema","false")
.option("dateFormat","yyyy/MM/dd HH:mm:ss.SSS")
.schema(sourceSchemaStruct)
.load(sourceFile)
The whole exception is as follows:
Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:237)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:179)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13$$anonfun$apply$2.apply$mcJ$sp(UnivocityParser.scala:142)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13$$anonfun$apply$2.apply(UnivocityParser.scala:142)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13$$anonfun$apply$2.apply(UnivocityParser.scala:142)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13.apply(UnivocityParser.scala:139)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13.apply(UnivocityParser.scala:135)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$nullSafeDatum(UnivocityParser.scala:179)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9.apply(UnivocityParser.scala:135)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9.apply(UnivocityParser.scala:134)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:215)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:187)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:304)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:304)
at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
... 27 more
The csv file has only one row as follows:
"EventDate"|"Name"
"2016/12/19 00:43:27.583"|"adam"
TL;DR Use timestampFormat option (not dateFormat).
I've managed to reproduce it in the latest Spark version 2.3.0-SNAPSHOT (built from the master).
// OS shell
$ cat so-43259485.csv
"EventDate"|"Name"
"2016/12/19 00:43:27.583"|"adam"
// spark-shell
scala> spark.version
res1: String = 2.3.0-SNAPSHOT
case class Event(EventDate: java.sql.Timestamp, Name: String)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Event].schema
scala> spark
.read
.format("csv")
.option("header", true)
.option("mode","FAILFAST")
.option("delimiter","|")
.schema(schema)
.load("so-43259485.csv")
.show(false)
17/04/08 11:03:42 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 7)
java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:237)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:167)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$17$$anonfun$apply$6.apply$mcJ$sp(UnivocityParser.scala:146)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$17$$anonfun$apply$6.apply(UnivocityParser.scala:146)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$17$$anonfun$apply$6.apply(UnivocityParser.scala:146)
at scala.util.Try.getOrElse(Try.scala:79)
The corresponding line in the Spark sources is the "root cause" of the issue:
Timestamp.valueOf(s)
Having read the javadoc of Timestamp.valueOf, you can learn that the argument should be:
timestamp in format yyyy-[m]m-[d]d hh:mm:ss[.f...]. The fractional seconds may be omitted. The leading zero for mm and dd may also be omitted.
Note "The fractional seconds may be omitted" so let's cut it off by first loading the EventDate as a String and only after removing the unneeded fractional seconds convert it to Timestamp.
val eventsAsString = spark.read.format("csv")
.option("header", true)
.option("mode","FAILFAST")
.option("delimiter","|")
.load("so-43259485.csv")
It turns out that for fields of TimestampType type Spark uses timestampFormat option first if defined and only if not uses the code the uses Timestamp.valueOf.
It turns out the fix is just to use timestampFormat option (not dateFormat!).
val df = spark.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","|")
.option("mode","FAILFAST")
.option("inferSchema","false")
.option("timestampFormat","yyyy/MM/dd HH:mm:ss.SSS")
.schema(sourceSchemaStruct)
.load(sourceFile)
scala> df.show(false)
+-----------------------+----+
|EventDate |Name|
+-----------------------+----+
|2016-12-19 00:43:27.583|adam|
+-----------------------+----+
Spark 2.1.0
Use schema inference in CSV using inferSchema option with your custom timestampFormat.
It's important to trigger schema inference using inferSchema for timestampFormat to take effect.
val events = spark.read
.format("csv")
.option("header", true)
.option("mode","FAILFAST")
.option("delimiter","|")
.option("inferSchema", true)
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss")
.load("so-43259485.csv")
scala> events.show(false)
+-------------------+----+
|EventDate |Name|
+-------------------+----+
|2016-12-19 00:43:27|adam|
+-------------------+----+
scala> events.printSchema
root
|-- EventDate: timestamp (nullable = true)
|-- Name: string (nullable = true)
"Incorrect" initial version left for learning purposes
val events = eventsAsString
.withColumn("date", split($"EventDate", " ")(0))
.withColumn("date", translate($"date", "/", "-"))
.withColumn("time", split($"EventDate", " ")(1))
.withColumn("time", split($"time", "[.]")(0)) // <-- remove millis part
.withColumn("EventDate", concat($"date", lit(" "), $"time")) // <-- make EventDate right
.select($"EventDate" cast "timestamp", $"Name")
scala> events.printSchema
root
|-- EventDate: timestamp (nullable = true)
|-- Name: string (nullable = true)
events.show(false)
scala> events.show
+-------------------+----+
| EventDate|Name|
+-------------------+----+
|2016-12-19 00:43:27|adam|
+-------------------+----+
Spark 2.2.0
As of Spark 2.2 you can use to_timestamp function to do the string to timestamp conversion.
eventsAsString.select($"EventDate", to_timestamp($"EventDate", "yyyy/MM/dd HH:mm:ss.SSS")).show(false)
scala> eventsAsString.select($"EventDate", to_timestamp($"EventDate", "yyyy/MM/dd HH:mm:ss.SSS")).show(false)
+-----------------------+----------------------------------------------------+
|EventDate |to_timestamp(`EventDate`, 'yyyy/MM/dd HH:mm:ss.SSS')|
+-----------------------+----------------------------------------------------+
|2016/12/19 00:43:27.583|2016-12-19 00:43:27 |
+-----------------------+----------------------------------------------------+
I searched for this issue, and discovered the offical Github issue page https://github.com/databricks/spark-csv/pull/280 which has fixed a related bug for parsing data with custom date format. I reviewed some source codes, and according to the code to find out your issue reason which is set inferSchema with the default value false as below.
inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default
Please change inferSchema with true for your date format yyyy/MM/dd HH:mm:ss.SSS using SimpleDateFormat.

Trim csv file before importing into a Spark Dataset

I've seen this post about how to specify an schema for creating a Dataset
Spark Scala: Cannot up cast from string to int as it may truncate
val spark = SparkSession.builder()
.master("local")
.appName("test")
.getOrCreate()
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Record].schema
val ds = spark.read
.option("header", "true")
.schema(schema) // passing schema
.option("timestampFormat", "MM/dd/yyyy HH:mm") // passing timestamp format
.csv(path)// csv path
.as[Record] // convert to DS
It works for me, but not when there are withispaces in the csv. Is it possible to trim the csv in this same spark.read sequence?

Why specifying schema to be DateType / TimestampType will make querying extremely slow?

I'm using spark-csv 1.1.0 and Spark 1.5. I make the schema as follows:
private def makeSchema(tableColumns: List[SparkSQLFieldConfig]): StructType = {
new StructType(
tableColumns.map(p => p.ColumnDataType match {
case FieldDataType.Integer => StructField(p.ColumnName, IntegerType, nullable = true)
case FieldDataType.Decimal => StructField(p.ColumnName, FloatType, nullable = true)
case FieldDataType.String => StructField(p.ColumnName, StringType, nullable = true)
case FieldDataType.DateTime => StructField(p.ColumnName, TimestampType, nullable = true)
case FieldDataType.Date => StructField(p.ColumnName, DateType, nullable = true)
case FieldDataType.Boolean => StructField(p.ColumnName, BooleanType, nullable = false)
case _ => StructField(p.ColumnName, StringType, nullable = true)
}).toArray
)
}
But when there are DateType columns, my query with Dataframes will be very slow. (The queries are just simple groupby(), sum() and so on)
With the same dataset, after I commented the two lines to map Date to DateType and DateTime to TimestampType(that is, to map them to StringType), the queries become much faster.
What is the possible reason for this? Thank you very much!
We have found a possible answer for this problem.
When simply specifying a column to be DateType or TimestampType, spark-csv will try to parse the dates with all its internal formats for each line of the row, which makes the parsing progress much slower.
From its official documentation, it seems that we can specify in the option the format for the dates. I suppose it can make the parsing progress much faster.

Resources