How to use date range in presto - presto

I have to select DS (datetimestamp) from year to year.
Example:
SELECT
id_product,
code,
substr(ds_date, 1,10) as date,
product_type,
from Table A
where 1=1
AND ds <='2019-12-31'
AND (ds_date BETWEEN '2017-01-01' AND '2019-12-31')
groupby 1,2,3,4
Is it right way to declare DS range in presto?

If ds_date is a varchar of the form yyyy-MM-dd..., then this is readable:
substr(ds_date, 1, 4) BETWEEN '2017' AND '2019' -- inclusive: 2017, 2018, 2019
However, this prevents predicate pushdown on ds_date into data source, so this may be more performant:
ds_date >= '2017' AND ds_date < '2020'

Related

How do I group by date in Cassandra?

I'm trying to find a query in Cassandra cql to group by date. I have "date" datatype where the date is like: "mm-dd-yyyy". I'm just trying to extract the year and then group by. How to achieve that?
SELECT sum(amount) FROM data WHERE date = 'yyyy'
You cannot do a partial filter with just the year on a column of type date. It is an invalid query in Cassandra.
The CQL date type is encoded as a 32-bit integer that represents the days since epoch (Jan 1, 1970).
If you need to filter based on year the you will need to add a column to your table like in this example:
CREATE TABLE movies (
movie_title text,
release_year int,
...
PRIMARY KEY ((movie_title, release_year))
)
Here's an example for retrieving information about a movie:
SELECT ... FROM movies WHERE movie_title = ? AND release_year = ?

HIVE where date filter by x days back? string format

so our DBA's setup our hive table with the date column as the partition column, but as a "string" YYYYMMDD format.
How can I WHERE filter this "date" column for something like last 30 days?
Please use date_format to format systemdate - 30 days into YYYYMMDD and then compare with your partition column. Please note to use partition column as is so hive can choose correct partitions.
When you want to pick previous 30th days data -
select *
from mytable
where partition_col = date_format( current_date() - interval '30' days, 'yyyyMMdd')
If you want all data since last 30 days -
select *
from mytable
wherecast(partition_col as INT) >= cast(date_format( current_date() - interval '30' days, 'yyyyMMdd') as INT)
casting shouldnt impact partition benefits but you need to check the performance before using it. Please get back in such scenario.

How to use SparkSQL to select rows in Spark DF based on multiple conditions

I am relatively new to pyspark and I have a spark dataframe with a date column "Issue_Date". The "Issue_Date" column contains several dates from 1970-2060 (due to errors). From the spark dataframe, I have created a temp table from it and have been able to filter the data from year 2018. I would also like to include the data from year 2019 (i.e., multiple conditions). Is there a way to do so? I've tried many combinations but couldn't get it. Any form of help is appreciated, thank you.
# Filter data from 2018
sparkdf3.createOrReplaceTempView("table_view")
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) = 2018")
sparkdf4.count()
Did you try using year(Issue_Date) >= 2018?:
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) >= 2018")
If your column has errors, and you want to specify a range you can use year IN (2018, 2019):
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) in (2018, 2019)")

To_char function in databricks

I am using sql as language for my notebook in databricks.
Want to get the day of week from the date given.
For doing this i used to_char(date,'fmday'). Getting error as function is not registered as temporary or permanant in databricks. Is there a way to get the name of day by other means.
Date is in format yyyymmdd
You are getting that error because to_char is not a SparkSQL function. You can see the list of functions in the ScalaDocs here: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html
If your date is a DateType, you can do dayofweek(date) in SparkSQL.
get the name of the day
Being as you want to get the name of the day, you can use the date_format function with the argument 'EEEE' to get the day name, eg Monday. If you want to pass in an integer (eg numbers between 1 and 7) then you could just code a CASE statement, something like:
%sql
SELECT
dayofweek( CAST( '2018-12-31' AS DATE ) ) AS d,
date_format( CAST( '2018-12-31' AS DATE ), 'EEEE' ) AS dayname,
CASE dayofweek( CAST( '2018-12-31' AS DATE ) )
WHEN 1 THEN 'Monday'
WHEN 2 THEN 'Tuesday'
WHEN 3 THEN 'Wednesday'
WHEN 4 THEN 'Thursday'
WHEN 5 THEN 'Friday'
WHEN 6 THEN 'Saturday'
WHEN 7 THEN 'Sunday'
ELSE 'Unknown'
END AS caseTest
NB I have coded the CASE to start the week from Day 1 - Monday, which is different to the dayofweek default; this might be one reason to do that, ie you want a different default.
My Results:
I got a way to get the name of day of week as below
date_format(to_date('20170821','yyyyMMdd'),'EEEE')
Now i want to pass a column of integer datatype, but when i pass it to query getting null as output. Could someone please help

Query min partition key based on date range (clustering key)

I have a table Foo in cassandra with 4 columns foo_id bigint, date datetime, ref_id bigint, type int
here the partitioning key is foo_id. the clustering keys are date desc, ref_id and type
I want to write a CSQL query which is the equivalent of the SQL below
select min(foo_id) from foo where date >= '2016-04-01 00:00:00+0000'
I wrote the following CSQL
select foo_id from foo where
foo_id IN (-9223372036854775808, 9223372036854775807)
and date >= '2016-04-01 00:00:00+0000';
but this returns empty results.
Then I tried
select foo_id from foo where
token(foo_id) > -9223372036854775808
and token(foo_id) < 9223372036854775807
and date >= '2016-04-01 00:00:00+0000';
but this results in error
Unable to execute CSQL Script on 'Cassandra'. Cannot execute this query
as it might involve data filtering and thus may have unpredictable
performance. If you want to execute this query despite performance
unpredictability, use ALLOW FILTERING.
I don't want to use ALLOW FILTERING. but I want the minimum of foo_id at the start of the specified date.
You should probably denormalize your data and create a new table for the purpose. I propose something like:
CREATE TABLE foo_reverse (
year int,
month int,
day int,
foo_id bigint,
date datetime,
ref_id bigint,
type int,
PRIMARY KEY ((year, month, day), foo_id)
)
To get the minimum foo_id you would query that table by something like:
SELECT * FROM foo_reverse WHERE year = 2016 AND month = 4 AND day = 1 LIMIT 1;
That table would allow you to query on a "per day" basis. You can change the partition key to better reflect your needs. Beware of the potential hot spots you (and I) could create by selecting an appropriate time range.

Resources