How to parse dates? - apache-spark

How to parse dates? - apache-spark

the date format what i have is:
2006-04-01 01:00:00.000 +0200
but I require: 2006-04-01
not able to recognize from UNIX timestamp format
valid_wdf
.withColumn("MYDateOnly", to_date(from_unixtime(unix_timestamp("Formatted Date","yyyy-MM-dd"))))
.show()
moreover it says something like this:
org.apache.spark.SparkUpgradeException: You may get a different result
due to the upgrading of Spark 3.0: Fail to parse '2006-04-01
00:00:00.000 +0200' in the new parser. You can set
spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior
before Spark 3.0, or set to CORRECTED and treat it as an invalid
datetime string.
I want to know why this library is used..if any explanation will be appreciated.

Let's use the following query in Spark 3.0.1 and review the exception.
Seq("2006-04-01 01:00:00.000 +0200")
.toDF("d")
.select(unix_timestamp($"d","yyyy-MM-dd"))
.show
The exception does say the cause of the SparkUpgradeException, but you have to look at the bottom of the stack trace where it says:
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2006-04-01 01:00:00.000 +0200' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
...
Caused by: java.time.format.DateTimeParseException: Text '2006-04-01 01:00:00.000 +0200' could not be parsed, unparsed text found at index 10
at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2049)
at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1874)
at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:78)
... 140 more
There is "unparsed text found at index 10" since the pattern yyyy-MM-dd does not cover the remaining part of the input date.
Consult Datetime Patterns for valid date and time format patterns. The easiest seems to use date_format standard function.
val q = Seq("2006-04-01 01:00:00.000 +0200")
.toDF("d")
.select(date_format($"d","yyyy-MM-dd")) // date_format
scala> q.show
+--------------------------+
|date_format(d, yyyy-MM-dd)|
+--------------------------+
| 2006-04-01|
+--------------------------+

Related

Synapse automatically converting ISO date string argument yyyy-mm-ddThh:mm:ss into mm/dd/yyyy hh:mm:ss

I am trying to run spark job using a Synapse pipeline by passing timestamp as command line argument. Before the code runs related to spark job, synapse is converting the string command line argument value from ISO format 2019-04-25T09:00:00 to 04/25/2019 09:00:00 somehow and it's throwing error because my spark code is designed to parse the dates only in ISO format.
Does anyone know why or how synapse is converting the timestamp argument. How to make the synapse pass the command line argument to spark code as is.
Also I see from I/p for spark jobs from synapse/monitor UI section that synapse adds Z (2019-04-25T09:00:00Z) at the end of the argument.
Error stdout Driver: Text '04/25/2019 09:00:00' could not be parsed at index 0
I tried to pass only 2019-04-25and it sill fails saying Text '04/25/2019' could not be parsed at index 10
Pipeline:
enter image description here
Spark Job: Doesn't have any command line arguments(they are only passed through pipeline)

Apparently, it's not reading the timestamp correctly.
Try casting it in the command line like so:
toTimestamp('2019-04-25T09:00:00','YYYY-MM-DDThh:mm:ss')

Spark 2.4 to Spark 3.0 DateTime question of date time

Right I am on a new environment upgraded from Spark 2.4 to Spark 3.0 and I am receiving these errors
ERROR 1
You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyy-MM-dd hh:mm:ss aa' pattern in the DateTimeFormatter
Lines causing this –
from_unixtime(unix_timestamp(powerappcapturetime_local, 'yyyy-MM-dd hh:mm:ss aa')+ (timezoneoffset*60),'yyyy-MM-dd HH:mm:ss') as powerappcapturetime
ERROR 2
DataSource.Error: ODBC: ERROR [42000] [Microsoft][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 94.0 failed 4 times, most recent failure: Lost task 0.3 in stage 94.0 (TID 1203) (10.139.64.43 executor 3): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse ' 01/19/2022' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:86)
at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
at scala.Option.getOrElse(Option.scala:189)”
Lines causing this –
cast ( to_date ( TT_VALID_FROM_TEXT, 'MM/dd/yyyy') as timestamp) as ttvalidfrom
My code is python with sql in the middle of it.
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
query_view_create = '''
CREATE OR REPLACE VIEW {0}.{1} as
SELECT
customername
,
cast ( to_date ( TT_VALID_FROM_TEXT, 'MM/dd/yyyy') as timestamp)
as ttvalidfrom
, from_unixtime(unix_timestamp(powerappcapturetime_local, 'yyyy-MM-dd hh:mm:ss aa')+ (timezoneoffset*60),'yyyy-MM-dd HH:mm:ss') as powerappcapturetime
from {0}.{2}
'''.format(DATABASE_NAME,VIEW_NAME_10,TABLE_NAME_12,ENVIRONMENT)
print(query_view_create)
Added to fix datetime issues we see when using spark 3.0 with Power BI that don't appear in spark 2.4
spark.sql(query_view_create)
The error still comes from Power BI when I import the table into Power BI . Not sure what I can do to make this work and not display these errors ?

#James Khan, Thanks for finding the source of the problem. Posting your discussion as an Answer to help other community members.
To set the legacy timeParserPolicy the below code may work.
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
OR
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
Still If you are getting the same after this, please check this similar SO thread.
Reference:
https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/parameters/legacy_time_parser_policy

Conversion incompatibility between timestamp type in Glue and in Spark?

I want to run a simple sql select of timestamp fields from my data using spark sql (pyspark).
However, all the timestamp fields appear as 1970-01-19 10:45:37.009 .
So looks like I have some conversion incompatibility between timestamp in Glue and in Spark.
I'm running with pyspark, and I have the glue catalog configuration so I get my database schema from Glue. In both Glue and the spark sql dataframe these columns appear with timestamp type.
However, it looks like when I read the parquet files from s3 path, the event_time column (for example) is of type long and when I get its data, I get a correct event_time as epoch in milliseconds = 1593938489000. So I can convert it and get the actual datetime.
But when I run spark.sql , the event_time column gets timestamp type but it isn’t useful and missing precision. So I get this = 1970-01-19 10:45:37.009 .
When I run the same sql query in Athena, the timestamp field looks fine so my schema in Glue looks correct.
Is there a way to overcome it?
I didn't manage to find any spark.sql configurations that solved it.

You are getting 1970, due to incorrect way of formatting. Please give a try below code to convert long to UTC timestamp
from pyspark.sql import types as T
from pyspark.sql import functions as F
df = df.withColumn('timestamp_col_original', F.lit('1593938489000'))
df = df.withColumn('timestamp_col', (F.col('timestamp_col_original') / 1000).cast(T.TimestampType()))
df.show()
While converting : 1593938489000 I was getting below
timestamp_col_original| timestamp_col|
+----------------------+-------------------+
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
+----------------------+-------------------+

Spark Sql: Loading the file from excel sheet (with extension .xlsx) can not infer the schema of a date-type column properly

I have a xlsx file containing date/time filed (My Time) in following format and sample records -
5/16/2017 12:19:00 AM
5/16/2017 12:56:00 AM
5/16/2017 1:17:00 PM
5/16/2017 5:26:00 PM
5/16/2017 6:26:00 PM
I am reading the xlsx file in following manner: -
val inputDF = spark.sqlContext.read.format("com.crealytics.spark.excel")
.option("location","file:///C:/Users/file.xlsx")
.option("useHeader","true")
.option("treatEmptyValuesAsNulls","true")
.option("inferSchema","true")
.option("addColorColumns","false")
.load()
When I try to get schema using: -
inputDF.printSchema()
, I get Double.
Sometimes, even I get the schema as String.
And when I print the data, I get the output as: -
------------------
My Time
------------------
42871.014189814814
42871.03973379629
42871.553773148145
42871.72765046296
42871.76887731482
------------------
Above output is clearly not correct for the given input.
Moreover, if I convert the xlsx file in csv format and read it, I get the output correctly. Here is the way how I read in csv format: -
spark.sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", true)
.load("file:///C:/Users/file.xlsx")
So, any help in this regard, how to infer the correct schema of any column of type date.
Note:-
Spark version is 2.0.0
Language used is Scala

I met the same problem. I also have no idea why. However, I suggest you set that "inferSchema" as "false" and then structure yours.

Unable to append "Quotes" in write for dataframe

I am trying to save a dataframe as .csv in spark. It is required to have all fields bounded by "Quotes". Currently, the file is not enclosed by "Quotes".
I am using Spark 2.1.0
Code :
DataOutputResult.write.format("com.databricks.spark.csv").
option("header", true).
option("inferSchema", false).
option("quoteMode", "ALL").
mode("overwrite").
save(Dataoutputfolder)
Output format(actual) :
Name, Id,Age,Gender
XXX,1,23,Male
Output format (Required) :
"Name", "Id" ," Age" ,"Gender"
"XXX","1","23","Male"
Options I tried so far :
QuoteMode, Quote in the options during it as file, But with no success.

("quote", "all"), replace quoteMode with quote
or play with concat or concat_wsdirectly on df columns and save without quote - mode
import org.apache.spark.sql.functions.{concat, lit}
val newDF = df.select(concat($"Name", lit("""), $"Age"))
or create own udf function to add desired behaviour, pls find more examples in Concatenate columns in apache spark dataframe

Unable to add as a comment to the above answer, so posting as an answer.
In Spark 2.3.1, use quoteAll
df1.write.format("csv")
.option("header", true)
.option("quoteAll","true")
.save(Dataoutputfolder)
Also, to add to the comment of #Karol Sudol (great answer btw), .option("quote","\u0000") will work only if one is using Pyspark with Python 3 which has default encoding as 'utf-8'. A few reported that the option did not work, because they must be using Pyspark with Python 2 whose default encoding is 'ascii'. Therefore the error "java.lang.RuntimeException: quote cannot be more than one character"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to parse dates? - apache-spark

Related

Synapse automatically converting ISO date string argument yyyy-mm-ddThh:mm:ss into mm/dd/yyyy hh:mm:ss

Spark 2.4 to Spark 3.0 DateTime question of date time

Conversion incompatibility between timestamp type in Glue and in Spark?

Spark Sql: Loading the file from excel sheet (with extension .xlsx) can not infer the schema of a date-type column properly

Unable to append "Quotes" in write for dataframe

Categories

Resources