I'm trying to find a query in Cassandra cql to group by date. I have "date" datatype where the date is like: "mm-dd-yyyy". I'm just trying to extract the year and then group by. How to achieve that?
SELECT sum(amount) FROM data WHERE date = 'yyyy'
You cannot do a partial filter with just the year on a column of type date. It is an invalid query in Cassandra.
The CQL date type is encoded as a 32-bit integer that represents the days since epoch (Jan 1, 1970).
If you need to filter based on year the you will need to add a column to your table like in this example:
CREATE TABLE movies (
movie_title text,
release_year int,
...
PRIMARY KEY ((movie_title, release_year))
)
Here's an example for retrieving information about a movie:
SELECT ... FROM movies WHERE movie_title = ? AND release_year = ?
Related
This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 11 months ago.
I'm quite new to SQL and I'm trying to filter the latest date record (DateTime column) for each unique ID present in the table.
Sample data: there are 2 unique IDs (16512) and (76513).
DateTime
ID
Notes
2021-03-26T10:39:54.9770238
16512
Still a work in Progress
2021-04-29T12:46:12.8277807
16512
Still working on it
2021-03-21T10:39:54.9770238
76513
Still a work in Progress
2021-04-20T12:46:12.8277800
76513
Still working on project
Desired result (get last row of each ID based on the DateTime column):
DateTime
ID
Notes
2021-04-29T12:46:12.8277807
16512
Still working on it
2021-04-20T12:46:12.8277800
76513
Still working on project
My query:
SELECT MAX(DateTime), ID
FROM Table1
GROUP BY DateTime, ID
Thanks in advance for you help.
SELECT max(DateTime), ID
FROM Table1
GROUP BY ID
You can use row_number here
with d as (
select *, row_number() over(partition by Id order by DataTime desc)rn
)
select Datetime, Id, Notes
from d
where rn = 1;
You didn't state a particular database but if you are using Postgres then you can use its DISTINCT ON and is often the fastest solution if the size of your groups is not too big (in your case this is the size of tasks that have the same id).
Here's an example. Please note I've excluded your notes column for brevity but it will work if you include it and will give you the output you desire above.
create temporary table tasks (
id int,
created_at date,
);
insert into tasks(id, created_at) values
(16512, '2021-03-26'),
(16512, '2021-04-29'),
(76513, '2021-03-21'),
(76513, '2021-04-20')
;
select
distinct on (id)
id,
created_at
from tasks
order by id, created_at desc
/*
id | created_at
-------+------------
16512 | 2021-04-29
76513 | 2021-04-20
*/
The mentioned row_number is one of the method solving your problem. You tagged databricks in your question, so let me show you another option that you can implement with Spark SQL using last function from aggregate functions pool.
In refrence to the spark documentation:
last(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If isIgnoreNull is true, returns only non-null values.
Note that:
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
In your example:
%sql
WITH cte AS (
SELECT *
FROM my_table
ORDER BY DateTime asc
)
SELECT Id, last(DateTime) AS DateTime, last(Notes) as Notes
FROM cte
GROUP BY Id
Similarly, you can use first function to obtain the first record in a sorted dataset.
Check if that works for you.
I have a table with the following (with other fields removed)
CREATE TABLE if NOT EXISTS request_audit (
user_id text,
request_body text,
lookup_timestamp TIMESTAMP
PRIMARY KEY ((user_id), lookup_timestamp)
) WITH CLUSTERING ORDER BY ( lookup_timestamp DESC);
I create a record with the following
INSERT INTO request_audit (user_id, lookup_timestamp, request_body) VALUES (?, ?, toTimestamp(now()))
I am trying to retrieve all rows within the last 24 hours, but I am having trouble with the timestamp,
I have tried
SELECT * from request_audit WHERE user_id = '1234' AND lookup_timestamp > toTimestamp(now() - "1 day" )
and various other ways of trying to take a day away from the query.
Cassandra has a very limited date operation support. What you need is a custom function to do date math calculation.
Inspired from here.
How to get Last 6 Month data comparing with timestamp column using cassandra query?
you can write a UDF (user defined function) to date operation.
CREATE FUNCTION dateAdd(date timestamp, day int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS
$$java.util.Calendar c = java.util.Calendar.getInstance();
c.setTime(date);
c.add(java.util.Calendar.DAY_OF_MONTH, day);
return c.getTime();$$ ;
remember that you would have to enable UDF in config. Cassandra.yml. Hope that is possible.
enable_user_defined_functions: true
once done this query works perfectly.
SELECT * from request_audit WHERE user_id = '1234' AND lookup_timestamp > dateAdd(dateof(now()), -1)
You couldn't do it directly from CQL, as it doesn't support this kind of expressions. If you're running this query from cqlsh, then you can try to substitute the desired date with something like this:
date --date='-1 day' '+%F %T%z'
and execute this query.
If you're invoking this from your program, just use corresponding date/time library to get date corresponding -1 day, but this depends on the language that you're using.
I have a table Foo in cassandra with 4 columns foo_id bigint, date datetime, ref_id bigint, type int
here the partitioning key is foo_id. the clustering keys are date desc, ref_id and type
I want to write a CSQL query which is the equivalent of the SQL below
select min(foo_id) from foo where date >= '2016-04-01 00:00:00+0000'
I wrote the following CSQL
select foo_id from foo where
foo_id IN (-9223372036854775808, 9223372036854775807)
and date >= '2016-04-01 00:00:00+0000';
but this returns empty results.
Then I tried
select foo_id from foo where
token(foo_id) > -9223372036854775808
and token(foo_id) < 9223372036854775807
and date >= '2016-04-01 00:00:00+0000';
but this results in error
Unable to execute CSQL Script on 'Cassandra'. Cannot execute this query
as it might involve data filtering and thus may have unpredictable
performance. If you want to execute this query despite performance
unpredictability, use ALLOW FILTERING.
I don't want to use ALLOW FILTERING. but I want the minimum of foo_id at the start of the specified date.
You should probably denormalize your data and create a new table for the purpose. I propose something like:
CREATE TABLE foo_reverse (
year int,
month int,
day int,
foo_id bigint,
date datetime,
ref_id bigint,
type int,
PRIMARY KEY ((year, month, day), foo_id)
)
To get the minimum foo_id you would query that table by something like:
SELECT * FROM foo_reverse WHERE year = 2016 AND month = 4 AND day = 1 LIMIT 1;
That table would allow you to query on a "per day" basis. You can change the partition key to better reflect your needs. Beware of the potential hot spots you (and I) could create by selecting an appropriate time range.
I have a very simple table to store collection of IDs by a date rage
CREATE TABLE schedule_range (
start_date timestamp,
end_date timestamp,
schedules set<text>,
PRIMARY KEY ((start_date, end_date)));
I was hoping to be able to query it by a date range
SELECT *
FROM schedule_range
WHERE start_date >= 'xxx'
AND end_date < 'yyy'
Unfortunately it doesn't work this way. I've tried few different approaches and it always fail for a different reason.
How should I store IDs to be able to get them all by a date range?
In cassandra you only can use >, < operators with last field of primary key, in your case 'end_date'. For previous fields you must use equal operator. If you just considerate that schema maybe you could use other choices.
One approximation is use Apache Spark. There is some projects that built an abstraction layer in Spark over Cassandra and let you make operations in cassandra such as joins, any filter, groups by ...
Check this projects:
Stratio Deep
Datastax Connector
Using this table with a query that somewhat resembles yours works because 1) it doesn't use the conditional on the partition key start_date. Only EQ and IN relation are supported on the partition key. 2) The greater-than and less-than comparison on the clustering column is restricted to filters that select a contiguous ordering of rows. Filtering by the clustering column--2nd component in the compound key--id, does the latter.
create table schedule_range2(start_date timestamp, end_date timestamp, id int, schedules set<text>, primary key (start_date, id, end_date));
insert into schedule_range2 (start_date, id, end_date, schedules) VALUES ('2014-02-03 04:05', 1, '2014-02-04 04:00', {'event1', 'event2'});
insert into schedule_range2 (start_date, id, end_date, schedules) VALUES ('2014-02-05 04:05', 1, '2014-02-06 04:00', {'event3', 'event4'});
select * from schedule_range2 where id=1 and end_date >='2014-02-04 04:00' and end_date < '2014-02-06 04:00' ALLOW FILTERING;
The structure of my column family is something like
CREATE TABLE product (
id UUID PRIMARY KEY,
product_name text,
product_code text,
status text,//in stock, out of stock
mfg_date timestamp,
exp_date timestamp
);
Secondary Index is created on status, mfg_date, product_code and exp_date fields.
I want to select the list of products whose status is IS (In Stock) and the manufactured date is between timestamp xxxx to xxxx.
So I tried the following query.
SELECT * FROM product where status='IS' and mfg_date>= xxxxxxxxx and mfg_date<= xxxxxxxxxx LIMIT 50 ALLOW FILTERING;
It throws error like No indexed columns present in by-columns clause with "equals" operator.
Is there anything I need to change in the structure? Please help me out. Thanks in Advance.
cassandra is not supporting >= so you have to change the value and have to use only >(greater then) and <(lessthen) for executing query.
You should have at least one "equals" operator on one of the indexed or primary key column fields in your where clause, i.e. "mfg_date = xxxxx"