Spark 2.4 to Spark 3.0 DateTime question of date time

Spark 2.4 to Spark 3.0 DateTime question of date time - apache-spark

Right I am on a new environment upgraded from Spark 2.4 to Spark 3.0 and I am receiving these errors
ERROR 1
You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyy-MM-dd hh:mm:ss aa' pattern in the DateTimeFormatter
Lines causing this –
from_unixtime(unix_timestamp(powerappcapturetime_local, 'yyyy-MM-dd hh:mm:ss aa')+ (timezoneoffset*60),'yyyy-MM-dd HH:mm:ss') as powerappcapturetime
ERROR 2
DataSource.Error: ODBC: ERROR [42000] [Microsoft][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 94.0 failed 4 times, most recent failure: Lost task 0.3 in stage 94.0 (TID 1203) (10.139.64.43 executor 3): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse ' 01/19/2022' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:86)
at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
at scala.Option.getOrElse(Option.scala:189)”
Lines causing this –
cast ( to_date ( TT_VALID_FROM_TEXT, 'MM/dd/yyyy') as timestamp) as ttvalidfrom
My code is python with sql in the middle of it.
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
query_view_create = '''
CREATE OR REPLACE VIEW {0}.{1} as
SELECT
customername
,
cast ( to_date ( TT_VALID_FROM_TEXT, 'MM/dd/yyyy') as timestamp)
as ttvalidfrom
, from_unixtime(unix_timestamp(powerappcapturetime_local, 'yyyy-MM-dd hh:mm:ss aa')+ (timezoneoffset*60),'yyyy-MM-dd HH:mm:ss') as powerappcapturetime
from {0}.{2}
'''.format(DATABASE_NAME,VIEW_NAME_10,TABLE_NAME_12,ENVIRONMENT)
print(query_view_create)
Added to fix datetime issues we see when using spark 3.0 with Power BI that don't appear in spark 2.4
spark.sql(query_view_create)
The error still comes from Power BI when I import the table into Power BI . Not sure what I can do to make this work and not display these errors ?

#James Khan, Thanks for finding the source of the problem. Posting your discussion as an Answer to help other community members.
To set the legacy timeParserPolicy the below code may work.
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
OR
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
Still If you are getting the same after this, please check this similar SO thread.
Reference:
https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/parameters/legacy_time_parser_policy

Related

Why DeltaTable.forPath throws "[path] is not a Delta table"?

I'm trying to read a delta lake table which I loaded previously using Spark and I'm using IntelliJ IDE.
val dt = DeltaTable.forPath(spark, "/some/path/")
Now when I'm trying to read the table again I'm getting below error, it was working fine but suddenly it throws error like these, what might be the reason for this?
Note:
Checked the files in the DeltaLake path - it looks good.
Colleague was able to read the same DeltaLake file.
Exception in thread "main" org.apache.spark.sql.AnalysisException: `/some/path/` is not a Delta table.
at org.apache.spark.sql.delta.DeltaErrors$.notADeltaTableException(DeltaErrors.scala:260)
at io.delta.tables.DeltaTable$.forPath(DeltaTable.scala:593)
at com.datalake.az.core.DeltaLake$.delayedEndpoint$com$walmart$sustainability$datalake$az$core$DeltaLake$1(DeltaLake.scala:66)
at com.datalake.az.core.DeltaLake$delayedInit$body.apply(DeltaLake.scala:18)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:431)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at com.datalake.az.core.DeltaLake$.main(DeltaLake.scala:18)
at com.datalake.az.core.DeltaLake.main(DeltaLake.scala)

AnalysisException: /some/path/ is not a Delta table.
AnalysisException is thrown when the given path has no transaction log under _delta_log directory.
There could be other issues but that's the first check.
BTW By the stacktrace I figured you may not be using the latest and greatest Delta Lake 2.0.0. Please upgrade as soon as possible as it brings tons of improvements you don't want to miss.

How to parse dates?

the date format what i have is:
2006-04-01 01:00:00.000 +0200
but I require: 2006-04-01
not able to recognize from UNIX timestamp format
valid_wdf
.withColumn("MYDateOnly", to_date(from_unixtime(unix_timestamp("Formatted Date","yyyy-MM-dd"))))
.show()
moreover it says something like this:
org.apache.spark.SparkUpgradeException: You may get a different result
due to the upgrading of Spark 3.0: Fail to parse '2006-04-01
00:00:00.000 +0200' in the new parser. You can set
spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior
before Spark 3.0, or set to CORRECTED and treat it as an invalid
datetime string.
I want to know why this library is used..if any explanation will be appreciated.

Let's use the following query in Spark 3.0.1 and review the exception.
Seq("2006-04-01 01:00:00.000 +0200")
.toDF("d")
.select(unix_timestamp($"d","yyyy-MM-dd"))
.show
The exception does say the cause of the SparkUpgradeException, but you have to look at the bottom of the stack trace where it says:
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2006-04-01 01:00:00.000 +0200' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
...
Caused by: java.time.format.DateTimeParseException: Text '2006-04-01 01:00:00.000 +0200' could not be parsed, unparsed text found at index 10
at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2049)
at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1874)
at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:78)
... 140 more
There is "unparsed text found at index 10" since the pattern yyyy-MM-dd does not cover the remaining part of the input date.
Consult Datetime Patterns for valid date and time format patterns. The easiest seems to use date_format standard function.
val q = Seq("2006-04-01 01:00:00.000 +0200")
.toDF("d")
.select(date_format($"d","yyyy-MM-dd")) // date_format
scala> q.show
+--------------------------+
|date_format(d, yyyy-MM-dd)|
+--------------------------+
| 2006-04-01|
+--------------------------+

How do I refresh a HDFS path?

I am runing a sparksession in jupyter notebook .
I would got error sometime on a dataframe which is initial by spark.read.parquet(some_path) when files under that path have changed, even if I cache the dataframe .
For example
reading code is
sp = spark.read.parquet(TB.STORE_PRODUCT)
sp.cache()
sometimes, sp can't not be access anymore, complain :
Py4JJavaError: An error occurred while calling o3274.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 326.0 failed 4 times, most recent failure: Lost task 10.3 in stage 326.0 (TID 111818, dc38, executor 7): java.io.FileNotFoundException: File does not exist: hdfs://xxxx/data/dm/sales/store_product/part-00000-169428df-a9ee-431e-918b-75477c073d71-c000.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
The problem
'REFRESH TABLE tableName' doesn't work, because
I don't have a hive table, it is only a hdfs path
Restart sparksession and read that path again can solve this problem , but
I don't want to restart sparksession, it would waste much time
One more
execute sp = spark.read.parquet(TB.STORE_PRODUCT) again doesn't work , I can understand why, spark should scan the path again or there must be a option/setting to force it scan . Keep whole path location in memory is not smart .
spark.read.parquet doesn't have a force scan option
Signature: spark.read.parquet(*paths)
Docstring:
Loads Parquet files, returning the result as a :class:`DataFrame`.
You can set the following Parquet-specific option(s) for reading Parquet files:
* ``mergeSchema``: sets whether we should merge schemas collected from all Parquet part-files. This will override ``spark.sql.parquet.mergeSchema``. The default value is specified in ``spark.sql.parquet.mergeSchema``.
>>> df = spark.read.parquet('python/test_support/sql/parquet_partitioned')
>>> df.dtypes
[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]
.. versionadded:: 1.4
Source:
#since(1.4)
def parquet(self, *paths):
"""Loads Parquet files, returning the result as a :class:`DataFrame`.
You can set the following Parquet-specific option(s) for reading Parquet files:
* ``mergeSchema``: sets whether we should merge schemas collected from all \
Parquet part-files. This will override ``spark.sql.parquet.mergeSchema``. \
The default value is specified in ``spark.sql.parquet.mergeSchema``.
>>> df = spark.read.parquet('python/test_support/sql/parquet_partitioned')
>>> df.dtypes
[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]
"""
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File: /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/sql/readwriter.py
Type: method
Is there a proper way to solve my problem ?

The problem is caused by Dataframe.cache .
I need clear that cache at first , then read again would solve the problem
code :
try:
sp.unpersist()
except:
pass
sp = spark.read.parquet(TB.STORE_PRODUCT)
sp.cache()

You can try two solutions
one is to unpersist the dataframe before reading everytime as suggested by #Mithril
or just create a temp view and trigger the refresh command
sp.createOrReplaceTempView('sp_table')
spark.sql('''REFRESH TABLE sp_table''')
df=spark.sql('''select * from sp_table''')

Spark org.apache.spark.sql.catalyst.analysis.UnresolvedException error in loading Hive table

While trying to load data from a dataset into Hive table getting the error:
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid
call to dataType on unresolved object, tree: 'ipl_appl_signed_date
My dataset contains same columns as the Hive table and the column for which am getting the error has Date datatype in my code(Java) as well as in Hive.
java code:
Date IPL_APPL_SIGNED_DATE =rs.getDate("DTL.IPL_APPL_SIGNED_DATE"); //using jdbc to get record.
Encoder<DimPolicy> encoder = Encoders.bean(Foo.class);
Dataset<DimPolicy> test=spark.createDataset(allRows,encoder); //spark is the spark session
test.write().mode("append").insertInto("someSchema.someTable"); //

I think the issue is due to a bug in Spark i.e. [SPARK-26379] Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in CurrentBatchTimestamp, that got fixed in 2.3.3, 2.4.1, 3.0.0.
A solution is to downgrade to the version of Spark that is unaffected by the bug (or wait for a new version).

Error inside where clause while comparing items in Spark SQL

I have cloudera vm running spark version 1.6.0
I created a dataframe from a CSV file and now filtering columns based on some where clause
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///home/cloudera/sample.csv')
df.registerTempTable("closedtrips")
result = sqlContext.sql("SELECT id,`safety rating` as safety_rating, route FROM closedtrips WHERE `trip frozen` == 'YES'")
However it gives me runtime error on the sql line.
py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
: java.lang.RuntimeException: [1.96] failure: identifier expected
SELECT consigner,`safety rating` as safety_rating, route FROM closedtrips WHERE `trip frozen` == 'YES'
^
Where am I going wrong here?
The above command fails in vm command line, however works fine when ran on databricks environment
Also why are column names case sensitive in vm, it fails to recognise 'trip frozen' because the actual column is 'Trip Frozen'.
All of this works fine in databricks and breaks in vm

In your VM, are you creating sqlContext as a SQLContext or as a HiveContext?
In Databricks, the automatically-created sqlContext will always point to a HiveContext.
In Spark 2.0 this distinction between HiveContext and regular SQLContext should not matter because both have been subsumed by SparkSession, but in Spark 1.6 the two types of contexts differ slightly in how they parse SQL language input.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark 2.4 to Spark 3.0 DateTime question of date time - apache-spark

Related

Why DeltaTable.forPath throws "[path] is not a Delta table"?

How to parse dates?

How do I refresh a HDFS path?

Spark org.apache.spark.sql.catalyst.analysis.UnresolvedException error in loading Hive table

Error inside where clause while comparing items in Spark SQL

Categories

Resources