I have a Cassandra table that looks something like this
VmwareVirtualMachine | 2020-04-24 02:04:57+0000
VmwareVirtualMachine | 2020-05-14 06:02:23+0000
VmwareVirtualMachine | 2020-05-14 06:02:23+0000
VmwareVirtualMachine | 2020-06-26 13:19:03+0000
VmwareVirtualMachine | 2020-06-29 06:14:00+0000
My requirement is to print all rows that are less than or equal to 2020-05-14 06:02:23+0000
My query is:
cqlsh -ksd -e "select type, timevalue from table_one where timevalue != null and timevalue<='2020-05-14 06:02:23+0000'"
It prints
VmwareVirtualMachine | 2020-04-24 02:04:57+0000
Since I gave <= I expected it to print
VmwareVirtualMachine | 2020-04-24 02:04:57+0000
VmwareVirtualMachine | 2020-05-14 06:02:23+0000
VmwareVirtualMachine | 2020-05-14 06:02:23+0000
If I do something like this:
cqlsh -ksd -e "select type, timevalue from table_one where timevalue != null and timevalue<='2020-05-14 06:02:24+0000'"
It prints the first 3 rows if I increase the time by a second. Not sure why less than or not equal to does not work in my statement.
Any help?
Related
I have a dataframe with two columns Order date and Customer(which have duplicates of only 2 values which has been sorted), I want to subtract the second Order date of the second occurrence of a Customer from the first Order date. Order date is in datetime format
here is a sample of the table
context I'm trying to calculate the time it takes for a customer to make a second order\
Order date Customer
4260 2022-11-11 16:29:00 (App admin)
8096 2022-10-22 12:54:00 (App admin)
996 2021-09-22 20:30:00 10013
946 2021-09-14 15:16:00 10013
3499 2022-04-20 12:17:00 100151
... ... ...
2856 2022-03-21 13:49:00 99491
2788 2022-03-18 12:15:00 99523
2558 2022-03-08 12:07:00 99523
2580 2022-03-04 16:03:00 99762
2544 2022-03-02 15:40:00 99762
I have tried deleting by index but it returns just the first two values.
expected output should be another dataframe with just the Customer name and the difference between the Second and first Order dates of the duplicate customer in minutes
expected output:
| Customer | difference in minutes |
| -------- | -------- |
| 1232 | 445.0 |
|(App Admin)| 3432.0 |
| 1145 | 2455.0 |
|6653 | 32.0 |
You can use groupby:
df['Order date'] = pd.to_datetime(df['Order date'])
out = (df.groupby('Customer', as_index=False)['Order date']
.agg(lambda x: (x.iloc[0] - x.iloc[-1]).total_seconds() / 60)
.query('`Order date` != 0'))
print(out)
# Output:
Customer Order date
0 (App admin) 29015.0
1 10013 11834.0
4 99523 14408.0
5 99762 2903.0
I have a table defined as:
CREATE TABLE downtime(
asset_code text,
down_start timestamp,
down_end timestamp,
down_duration duration,
down_type text,
down_reason text,
PRIMARY KEY ((asset_code, down_start), down_end)
);
I'd like to get downtime on a particular day, such as:
SELECT * FROM downtime \
WHERE asset_code = 'CA-PU-03-LB' \
AND todate(down_start) = '2022-12-11';
I got a syntax error:
SyntaxException: line 1:66 no viable alternative at input '(' (...where asset_code = 'CA-PU-03-LB' and [todate](...)
If function is not allowed on a partition key in where clause, how can I get data with "down_start" of a particular day?
You don't need to use the TODATE() function to filter for a specific date. You can simply specify the date as '2022-12-11' when applying a filter on a CQL timestamp column.
But the difference is that you cannot use the equality operator (=) because the CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT) so you need to be precise when you're working with timestamps.
Let me illustrate using this example table:
CREATE TABLE tstamps (
id int,
tstamp timestamp,
colour text,
PRIMARY KEY (id, tstamp)
)
My table contains the following sample data:
cqlsh> SELECT * FROM tstamps ;
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-05 11:25:01.000000+0000 | red
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
1 | 2022-12-07 01:48:07.870000+0000 | blue
1 | 2022-12-07 03:13:27.313000+0000 | indigo
The cqlshi client formats the tstamp column into a human-readable date in UTC. But really, the tstamp values are stored as integers:
cqlsh> SELECT tstamp, TOUNIXTIMESTAMP(tstamp) FROM tstamps ;
tstamp | system.tounixtimestamp(tstamp)
---------------------------------+--------------------------------
2022-12-05 11:25:01.000000+0000 | 1670239501000
2022-12-06 02:45:04.564000+0000 | 1670294704564
2022-12-06 11:06:48.119000+0000 | 1670324808119
2022-12-06 19:02:52.192000+0000 | 1670353372192
2022-12-07 01:48:07.870000+0000 | 1670377687870
2022-12-07 03:13:27.313000+0000 | 1670382807313
To retrieve the rows for a specific date, you need to specify the range of timestamps which fall on a specific date. For example, the timestamps for 6 Dec 2022 UTC ranges from 1670284800000 (2022-12-06 00:00:00.000 UTC) to 1670371199999 (2022-12-06 23:59:59.999 UTC).
This means if we want to query for December 6, we need to filter using a range query:
SELECT * FROM tstamps \
WHERE id = 1 \
AND tstamp >= '2022-12-06' \
AND tstamp < '2022-12-07';
and we get:
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
WARNING - In your case where the timestamp column is part of the partition key, performing a range query is dangerous because it results in a multi-partition query -- there are 86M possible values between 1670284800000 and 1670371199999. For this reason, timestamps are not a good choice for partition keys. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!
How to rewrite the following query:
WHERE (
parsedTime BETWEEN
TIMESTAMP '2019-10-29 00:00:00 America/New_York' AND
TIMESTAMP '2019-11-11 23:59:59 America/New_York'
)
but making the interval dynamic: from 14 days ago to current_date?
Presto provides quite handy functionality interval within date and time functions and operations.
-- Creating sample dataset
WITH dataset AS (
SELECT
'engineering' as department,
ARRAY[
TIMESTAMP '2019-11-05 00:00:00',
TIMESTAMP '2018-10-29 00:00:00'
] as parsedTime_array
)
SELECT department, parsedTime FROM dataset
CROSS JOIN UNNEST(parsedTime_array) as t(parsedTime)
-- Filtering records for the past 14 days from current_date
WHERE(
parsedTime > current_date - interval '14' day
)
Result
| department | parsedTime
---------------------------------------
1 | engineering | 2019-11-05 00:00:00.000
Update 2019-11-11
Note: current_date returns the current date as of the start of the query. I think, Athena would always use UTC time, but not 100% sure. So to extract current date in a particular time zone, I'd suggest to use timestamps with time zone conversion. Although it is true that
current_timestamp = current_timestamp at TIME ZONE 'America/New_York'
since AT TIME ZONE represents the same instant in time but differs only in the time zone used to print them. However the following is not always true due to 5 hour offset.
DATE(current_timestamp) = DATE(current_timestamp at TIME ZONE 'America/New_York')
This can be easily verified with:
WITH dataset AS (
SELECT
ARRAY[
TIMESTAMP '2019-10-29 23:59:59 UTC',
TIMESTAMP '2019-10-30 00:00:00 UTC',
TIMESTAMP '2019-10-30 04:59:59 UTC',
TIMESTAMP '2019-10-30 05:00:00 UTC'
] as parsedTime_array
)
SELECT
parsedTime AS "Time UTC",
DATE(parsedTime) AS "Date UTC",
DATE(parsedTime at TIME ZONE 'America/New_York') AS "Date NY",
to_unixtime(DATE(parsedTime)) AS "Unix UTC",
to_unixtime(DATE(parsedTime at TIME ZONE 'America/New_York')) AS "Unix NY"
FROM
dataset,
UNNEST(parsedTime_array) as t(parsedTime)
Result. Here we can see that 2 NY timestamps fall into 2019-10-29 and 2019-10-30 whereas for UTC timestamps it is only 1 and 3 respectively.
Time UTC | Date UTC | Date NY | Unix UTC | Unix NY
-----------------------------|------------|------------|------------|------------
2019-10-29 23:59:59.000 UTC | 2019-10-29 | 2019-10-29 | 1572307200 | 1572307200
2019-10-30 00:00:00.000 UTC | 2019-10-30 | 2019-10-29 | 1572393600 | 1572307200
2019-10-30 04:59:59.000 UTC | 2019-10-30 | 2019-10-30 | 1572393600 | 1572393600
2019-10-30 05:00:00.000 UTC | 2019-10-30 | 2019-10-30 | 1572393600 | 1572393600
Now, let's fast forward a month. There was a change to winter time in NY on 3rd or November 2019. However, timestamp in UTC format is not affected by it. Therefore:
WITH dataset AS (
SELECT
ARRAY[
TIMESTAMP '2019-11-29 23:59:59 UTC',
TIMESTAMP '2019-11-30 00:00:00 UTC',
TIMESTAMP '2019-11-30 04:59:59 UTC',
TIMESTAMP '2019-11-30 05:00:00 UTC'
] as parsedTime_array
)
SELECT
parsedTime AS "Time UTC",
DATE(parsedTime) AS "Date UTC",
DATE(parsedTime at TIME ZONE 'America/New_York') AS "Date NY",
to_unixtime(DATE(parsedTime)) AS "Unix UTC",
to_unixtime(DATE(parsedTime at TIME ZONE 'America/New_York')) AS "Unix NY"
FROM
dataset,
UNNEST(parsedTime_array) as t(parsedTime)
Result. Here we can see that 3 NY timestamps fall into 2019-11-29 and 1 falling into 2019-11-30, whereas for UTC timestamps ratio of 1/3 remained the same.
Time UTC | Date UTC | Date NY | Unix UTC | Unix NY
-----------------------------|------------|------------|------------|------------
2019-11-29 23:59:59.000 UTC | 2019-11-29 | 2019-11-29 | 1574985600 | 1574985600
2019-11-30 00:00:00.000 UTC | 2019-11-30 | 2019-11-29 | 1575072000 | 1574985600
2019-11-30 04:59:59.000 UTC | 2019-11-30 | 2019-11-29 | 1575072000 | 1574985600
2019-11-30 05:00:00.000 UTC | 2019-11-30 | 2019-11-30 | 1575072000 | 1575072000
Furthermore, different countries switch to winter/summer time on different dates. For instance in 2019, London (UK) moved clock 1 hour back on 27 October 2019, whereas NY (USA) moved clock 1 hour back on 3 November 2019.
I am trying a query which groups the data by months.
test_db=# select date_trunc('month', install_ts) AS month, count(id) AS count from api_booking group by month order by month asc;
month | count
------------------------+-------
2016-08-01 00:00:00+00 | 297
2016-09-01 00:00:00+00 | 2409
2016-10-01 00:00:00+00 | 2429
2016-11-01 00:00:00+00 | 3512
(4 rows)
This is the output in my postgres db shell.
How ever, when I try this query in excel, this is the output,
month | count
------------------------+-------
2016-07-31 17:00:00+00 | 297
2016-08-31 17:00:00+00 | 2409
2016-09-30 17:00:00+00 | 2429
2016-10-31 17:00:00+00 | 3512
(4 rows)
The problem is I think excel is understanding date format in some different timezone.
So, How can I tell excel to read it correctly?
OR any solution to this problem?
Try...
select date(date_trunc('month', install_ts)) AS month, count(id) AS count from api_booking
The date() strips out the time from a date with a time.
I have a schema pretty similar to this:-
create table x(id int, start_date timestamp, end_date timestamp,
primary key((id), start_date, end_date))
with clustering order by (start_date desc, end_date desc);
Now I am stuck with a problem where I have to query between start date and end date. something like this : -
select count(*) from x where id=2 and start_date > 'date' and end_date < 'date' ;
But it gives me an error similar to the following: -
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY column "end_date"
cannot be restricted (preceding column "start_date" is restricted
by a non-EQ relation)"
I am new to cassandra, any and all suggestions are welcomed even if it requires us to do a schema change. :)
You don't say which version of Cassandra you are running, but in 2.2 and later you can do multi-column slice restrictions on clustering columns. This can get close to what you want. The syntax in CQL is a little ugly, but basically you have to specify the starting range with all the clustering columns specified, like a compound key. It's important to think about the rows being sorted first by the first column, then within that sorted by the second column.
So assume we have this data:
SELECT * from x;
id | start_date | end_date
----+--------------------------+--------------------------
2 | 2015-09-01 09:16:47+0000 | 2015-11-01 09:16:47+0000
2 | 2015-08-01 09:16:47+0000 | 2015-10-01 09:16:47+0000
2 | 2015-07-01 09:16:47+0000 | 2015-09-01 09:16:47+0000
2 | 2015-06-01 09:16:47+0000 | 2015-10-01 09:16:47+0000
Now let's select based on both dates:
SELECT * from x where id=2
and (start_date,end_date) >= ('2015-07-01 09:16:47+0000','2015-07-01 09:16:47+0000')
and (start_date,end_date) <= ('2015-09-01 09:16:47+0000','2015-09-01 09:16:47+0000');
id | start_date | end_date
----+--------------------------+--------------------------
2 | 2015-08-01 09:16:47+0000 | 2015-10-01 09:16:47+0000
2 | 2015-07-01 09:16:47+0000 | 2015-09-01 09:16:47+0000
Now you'll notice that one of those end dates appears to be later than our restriction, but it isn't. Since things are sorted by start_date first, you'll get all the end dates with a matching start_date since they are in the range of the compound range. To get rid of rows like that you'll probably need to do a little filtering on the client side.
See more information here, under "Multi-column slice restrictions".