spark sql current timestamp function - apache-spark

Is there a sql fucntion in spark sql which returns back current timestamp , example in impala NOW() is the function which returns back current timestamp is there similar in spark sql ?
Thanks

Try current_timestamp function.
current_timestamp() - Returns the current timestamp at the start of query evaluation. All calls of current_timestamp within the same query return the same value.

It is possible to use Date and Timestamp functions from pyspark sql functions.
Example:
spark-sql> select current_timestamp();
2022-05-07 16:43:43.207
Time taken: 0.17 seconds, Fetched 1 row(s)
spark-sql> select current_date();
2022-05-07
Time taken: 5.224 seconds, Fetched 1 row(s)
spark-sql>
Reference:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.current_timestamp.html

You can use the following code to get the date and timestamp in Spark with Scala code.
import org.apache.spark.sql.functions._
val newDf = df.withColumn("current_date",current_date())
.withColumn("current_timestamp",current_timestamp())
The result will be something like this.
+------------+-----------------------+
|current_date|current_timestamp |
+------------+-----------------------+
|2022-06-06 |2022-06-06 12:25:55.349|
+------------+-----------------------+

Related

Spark sql using posix

To find non numeric rows we can do something like below in sparksql
spark.sql("select * from tabl where UPC rlike '[^0-9]'").show()
can this same query also be written like below? I tested it does not seem to work, basically I am trying to use :alpha:/:digit:/:alnum: posix commands
spark.sql("select * from tabl where UPC rlike '[^[:digit:]]'").show()
spark-sql> select * from ( select '8787687' col1) where rlike (col1,'^[[:digit:]]+$');
Time taken: 0.02 seconds
spark-sql>

Azure Databricks Delta Table modifies the TIMESTAMP format while writing from Spark DataFrame

I am new to Azure Databricks,I am trying to write a dataframe output to a delta table which consists TIMESTAMP column. But strangely it changes the TIMESTAMP pattern after writing to delta table.
My DataFrame Output column holds the value in this format 2022-05-13 17:52:09.771,
But After writing it to the Table, The column value is getting populated as
2022-05-13T17:52:09.771+0000
I am using below function to generate this Dataframe output
val pretsUTCText = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
val tsUTCText: String = pretsUTCTextNew.format(ts)
val tsUTCCol : Column = lit(tsUTCText)
val df = df2.withColumn(to_timestamp(timestampConverter.tsUTCCol,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
The Dataframe output is returning 2022-05-13 17:52:09.771 as TIMESTAMP pattern.
But After writing it to Delta Table I see the same value is getting populated as 2022-05-13T17:52:09.771+0000
Thanks in Advance. I could not find any solution.
I have just found the same behaviour on Databricks as you, and it behaves differently than the Databricks document. It seems after some versions Databricks show timezone as a default so you see additional +0000. I think you can use date_format function when you populate data if you don't want it. Also, I think you don't need 'Z' in format text as it is for timezone. See the screenshot below.

Same query resulting in different outputs in Hive vs Spark

Hive 2.3.6-mapr
Spark v2.3.1
I am running same query:
select count(*)
from TABLE_A a
left join TABLE_B b
on a.key = c.key
and b.date > '2021-01-01'
and date_add(last_day(add_months(a.create_date, -1)),1) < '2021-03-01'
where cast(a.TIMESTAMP as date) >= '2021-01-20'
and cast(a.TIMESTAMP as date) < '2021-03-01'
But getting 1B rows as output in hive, while 1.01B in spark-sql.
By some initial analysis, it seems like all the extra rows in spark are having timestamp column as 2021-02-28 00:00:00.000000.
Both the TIMESTAMP and create_date columns have data type string.
What could be the reason behind this?
I will give you one possibility, but I need more information.
If you drop an external table, the data remains and spark can read it, but the metadata in Hive says it doesn't exist and doesn't read it.
That's why you have a difference.

In Cassandra cql query how to convert string containing timestamp and then using in where clause for time series queries?

In Cassandra table, created_date is string column "2018-04-26 12:59:38 UTC". Need to use this column to create time series query like
Select * from dynamic_data where toTimeStamp(created_date) >=? and created_date <=?;
Is there any inbuilt function in Cassandra to convert string to timestamp and then use in time series query?

Spark SQL: How to convert time string column in "yyyy-MM-dd HH:mm:ss.SSSSSSSSS" format to timestamp preserving nanoseconds?

I am trying to convert a String type column which is having timestamp string in "yyyy-MM-dd HH:mm:ss.SSSSSSSSS" format to Timestamp type. This cast operation should preserve nanosecond values.
I tried using unix_timestamp() and to_timestamp() methods by specifying the timestamp format, but returning NULL values.
using cast:
hive> select cast('2019-01-01 12:10:10.123456789' as timestamp);
OK
2019-01-01 12:10:10.123456789
Time taken: 0.611 seconds, Fetched: 1 row(s)
using timestamp():
hive> select timestamp('2019-01-01 12:10:10.123456789','yyyy-MM-dd HH:mm:ss.SSSSSSSSS');
OK
2019-01-01 12:10:10.123456789
Time taken: 12.845 seconds, Fetched: 1 row(s)
As per the description provided in source code of TimestampType and DateTimeUtils classes, they support timestamps till microseconds precision only.
So, we cannot store timestamps with nanoseconds precision in Spark SQL's TimestampType column.
References:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

Resources