Can someone show me how to extract the year from the String date in databricks SQL.
I am based in the UK and our date format is normally as follows:
dd/mm/yyyy
The field containing the dates is set as StringType()
I am trying to extract the year from the string as follows:
select year(cast(financials_0_accountsDate as Date)) from `financiallimited_csv`
I'm using the following the code to extract the quarter
select quarter(cast(financials_0_accountsDate as Date)) from `financiallimited_csv`
However, both result in NULL values.
Any thoughts on how to extract the year and quarter from dates with StringType() dd/mm/yyyy?
The table looks like the following:
Could you try the to_date function?
select year(to_date(financials_0_accountsDate, 'dd/MM/yyyy')) from `financiallimited_csv`
I get null for the timestamp 27-04-2021 14:11 with this code. What mistake am I doing? Why is the timestamp format string DD-MM-yyyy HH:mm not correct here?
df = spark.createDataFrame([('27-04-2021 14:11',)], ['t'])
df = df.select(to_timestamp(df.t, 'DD-MM-yyyy HH:mm').alias('dt'))
display(df)
D is for day of the year, and d is for day of the month.
Try this:
df = df.select(F.to_timestamp(df.t, "dd-MM-yyyy HH:mm").alias("dt"))
so our DBA's setup our hive table with the date column as the partition column, but as a "string" YYYYMMDD format.
How can I WHERE filter this "date" column for something like last 30 days?
Please use date_format to format systemdate - 30 days into YYYYMMDD and then compare with your partition column. Please note to use partition column as is so hive can choose correct partitions.
When you want to pick previous 30th days data -
select *
from mytable
where partition_col = date_format( current_date() - interval '30' days, 'yyyyMMdd')
If you want all data since last 30 days -
select *
from mytable
wherecast(partition_col as INT) >= cast(date_format( current_date() - interval '30' days, 'yyyyMMdd') as INT)
casting shouldnt impact partition benefits but you need to check the performance before using it. Please get back in such scenario.
I am trying to convert a String type column which is having timestamp string in "yyyy-MM-dd HH:mm:ss.SSSSSSSSS" format to Timestamp type. This cast operation should preserve nanosecond values.
I tried using unix_timestamp() and to_timestamp() methods by specifying the timestamp format, but returning NULL values.
using cast:
hive> select cast('2019-01-01 12:10:10.123456789' as timestamp);
OK
2019-01-01 12:10:10.123456789
Time taken: 0.611 seconds, Fetched: 1 row(s)
using timestamp():
hive> select timestamp('2019-01-01 12:10:10.123456789','yyyy-MM-dd HH:mm:ss.SSSSSSSSS');
OK
2019-01-01 12:10:10.123456789
Time taken: 12.845 seconds, Fetched: 1 row(s)
As per the description provided in source code of TimestampType and DateTimeUtils classes, they support timestamps till microseconds precision only.
So, we cannot store timestamps with nanoseconds precision in Spark SQL's TimestampType column.
References:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
In Postgres, we have function to extract MONTH, YEAR etc from timestamp using EXTRACT function. See below.
SELECT EXTRACT(MONTH FROM TIMESTAMP '2001-02-16 20:38:40');
Is it possible to do the same in cassandra? Is there a function for this?
If it is possible, I can then run queries such as find all entries in year "2015" and month "may". This is possible in postgres using EXTRACT function.
I hope you have the answer, but you can create a function in your keyspace like this:
cqlsh> use keyspace;
cqlsh:keyspace> CREATE OR REPLACE FUNCTION YEAR (input TIMESTAMP)
RETURNS NULL ON NULL INPUT RETURNS TEXT
LANGUAGE java AS 'return input.toString().substring(0,4);';
cqlsh:kespace> SELECT YEAR('2001-02-16 20:38:40') as year FROM ...;
In Cassandra you would handle that a little differently. You can have fields in your table of type timestamp or timeuuid, and then use that field in a time range query.
For example if you wanted all entries for 2015 and May and have a timestamp field called 'date', you could do a query like this:
SELECT * from mytable where date > '2015-05-01' and date < '2015-06-01' allow filtering;
You can use a number of different formats for specifying the date to specify the time more precisely (such as down to fractions of seconds).
Cassandra converts date strings using the org.apache.commons.lang3.time.DateUtils class, and allows the following date formats:
private static final String[] dateStringPatterns = new String[] {
"yyyy-MM-dd HH:mm",
"yyyy-MM-dd HH:mm:ss",
"yyyy-MM-dd HH:mmX",
"yyyy-MM-dd HH:mmXX",
"yyyy-MM-dd HH:mmXXX",
"yyyy-MM-dd HH:mm:ssX",
"yyyy-MM-dd HH:mm:ssXX",
"yyyy-MM-dd HH:mm:ssXXX",
"yyyy-MM-dd HH:mm:ss.SSS",
"yyyy-MM-dd HH:mm:ss.SSSX",
"yyyy-MM-dd HH:mm:ss.SSSXX",
"yyyy-MM-dd HH:mm:ss.SSSXXX",
"yyyy-MM-dd'T'HH:mm",
"yyyy-MM-dd'T'HH:mmX",
"yyyy-MM-dd'T'HH:mmXX",
"yyyy-MM-dd'T'HH:mmXXX",
"yyyy-MM-dd'T'HH:mm:ss",
"yyyy-MM-dd'T'HH:mm:ssX",
"yyyy-MM-dd'T'HH:mm:ssXX",
"yyyy-MM-dd'T'HH:mm:ssXXX",
"yyyy-MM-dd'T'HH:mm:ss.SSS",
"yyyy-MM-dd'T'HH:mm:ss.SSSX",
"yyyy-MM-dd'T'HH:mm:ss.SSSXX",
"yyyy-MM-dd'T'HH:mm:ss.SSSXXX",
"yyyy-MM-dd",
"yyyy-MM-ddX",
"yyyy-MM-ddXX",
"yyyy-MM-ddXXX"
};
But note that Cassandra is not as good at ad hoc queries as a relational database like Postgres. So typically you would set up your table schema to group the time ranges you wanted to query into separate partitions within a table.