I want to create a VIEW that always returns the last 1 hour of data from the 2 most recent Athena partitions.
I am using the following Amazon Athena DDL with a partition column called 'datehour' of type varchar.
CREATE EXTERNAL TABLE mydb.table_foo (
`account_id` string,
`account_email_address` string,
`record_timestamp` timestamp)
PARTITIONED BY (
`datehour` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://data-here-9191919191/table_foo'
TBLPROPERTIES (
'projection.datehour.format'='yyyy/MM/dd/HH',
'projection.datehour.interval'='1',
'projection.datehour.interval.unit'='HOURS',
'projection.datehour.range'='2020/11/01/00,NOW',
'projection.datehour.type'='date',
'projection.enabled'='true',
'storage.location.template'='s3://data-here-9191919191/table_foo/${datehour}',
'transient_lastDdlTime'='1604013447')
Here's the query I want to run to create a VIEW that always returns the last 1 hour of data from a the 2 most recent partitions.
select *
from mydb.table_foo
where
(datehour = CONCAT(
CAST( year(current_timestamp) AS varchar) , '/',
CAST( month(current_timestamp) AS varchar), '/',
CAST( day(current_timestamp) AS varchar), '/',
CAST( hour(current_timestamp) AS varchar))
OR
datehour = CONCAT(
CAST( year(current_timestamp) AS varchar) , '/',
CAST( month(current_timestamp) AS varchar), '/',
CAST( day(current_timestamp) AS varchar), '/',
CAST( ( hour(current_timestamp - interval '1' hour) ) AS varchar)))
AND
record_timestamp BETWEEN (current_timestamp - interval '1' hour) AND current_timestamp
Sample SQL I'd like to make dynamic would be something like this:
select *
from table_foo
where datehour = '2020/11/17/23' or datehour = '2020/11/18/00' AND
record_timestamp BETWEEN (current_timestamp - interval '1' hour) AND current_timestamp
The dynamic WHERE logic poses problems around changes in all fields, month, day, and year. I ran into this problem on Day 1, around midnight today. How can I generate my datehour partition dynamically?
The solution should be that you cast current_timestamp - interval '1' hour as varchar in the correct format. Assuming the data is partitioned as yyyy/MM/dd/HH as described in your table definition:
select *
from mydb.table_foo
WHERE datehour BETWEEN date_format(date_trunc('hour',current_timestamp - interval '1' hour),'%Y/%m/%d/%H') AND date_format(date_trunc('hour',current_timestamp),'%Y/%m/%d/%H')
AND record_timestamp BETWEEN (current_timestamp - interval '1' hour) AND current_timestamp
Related
I am loading a couple of csv files using this Query
SELECT
*
FROM
OPENROWSET(
BULK 'https://xxxxxx.core.windows.net/jde/*.CSV',
FORMAT = 'CSV',
FIELDTERMINATOR =',',
FIRSTROW = 2 ,
PARSER_VERSION='2.0'
)
with (
Project varchar(255),
"Description 2" varchar(255),
"Unit Num" varchar(255),
"Date Issue" Date
) as rows
I get an error, my date format is 25/12/20, when I change the date for varchar everything works but obviously with the date is loaded as a text, how to define date format for synapse on demand
I was not able to parse the date correctly in the WITH statement. However, using CONVERT will convert the character date into DATE format.
SELECT
CONVERT(DATE, DateIssue, 3) as FormatDate,
*
FROM
OPENROWSET(
BULK 'https://storage.dfs.core.windows.net/datalakehouse/bronze/sourcedata/static/csvtest/test_ddmmyy.csv',
FORMAT = 'CSV',
PARSER_VERSION='2.0',
FIRSTROW = 2
)
WITH (
DescriptionText VARCHAR(10),
UnitNum TINYINT,
DateIssue VARCHAR(10)
) AS rowsoutput
I have a table of events that are done every minute. I want to be able to filter these events by time period and also aggregate data for hour/day/etc.
My data model:
create table min_dev_data (
device TEXT,
event_time BIGINT,
hour BIGINT,
day BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
)
CREATE MATERIALIZED VIEW hour_dev_data AS
SELECT device, event_time, hour, value
FROM min_dev_data
WHERE hour IS NOT NULL AND value IS NOT NULL
and event_time IS NOT NULL AND device IS NOT NULL
PRIMARY KEY ((device), hour, event_time)
my query is
select hour, sum(value)
from hour_dev_data
where device = 'tst' and event_time < 149000000 group by device, hour;
fails with error
code=2200 [Invalid query] message="PRIMARY KEY column "event_time" cannot be restricted as preceding column "hour" is not restricted"
The only way to make it work is to add ALLOW FILTERING, which is unpredictable.
How can I change my data model to address my query and avoid ALLOW FILTERING mode?
You have to proactively produce these results:
create table min_dev_data (
device TEXT,
event_time BIGINT,
hour BIGINT,
day BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
create table hour_dev_data (
device TEXT,
hour BIGINT,
day BIGINT,
event_time BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
create table day_dev_data (
device TEXT,
day BIGINT,
event_time BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
Each table will satisfy ONE granularity only.
Every hour you query the minute data for the latest hour data for each device with something like:
SELECT * FROM min_dev_data WHERE device = X AND event_time < YYYY
Sum that at application level and store this value into the hour table:
INSERT INTO hour_dev_data (device, hour, day, event_time, value) VALUES (....);
And every day you query the hour table to produce the further aggregate data:
SELECT * FROM hour_dev_data WHERE device = X AND event_time < YYYY
sum at application level and store this value into the day table.
Please consider adding some form of bucketing because, at one minute interval, in two months your minute table will have wide partitions. This should not be a problem if you keep the table in reverse order (like I did) and query only for the last couple of hours. But if you want to query back in time as well then you must definitely use bucketing in your tables.
I think you had things pretty much right already, but you need to change your filter on event_time to be a filter on hour.
select hour, sum(value)
from hour_dev_data
where device = 'tst' and hour < 1500000000
group by device, hour;
When you were filtering on event_time, you were implicitly requiring a full scan of the row, as the event_time is clustered after the hour. To filter by the event_time, every cell would need to be examined to check the event_time. When you filter by hour, it is first in the clustering key, so it can be efficiently scanned and filtered. See this post post on ALLOW FILTERING for more on this.
I agree with xmas79 that you probably want to be bucketing at some level, perhaps by month or year depending on your frequency of events. If you're always going to be looking for the most recent values, then setting the clustering key ordering to desc is probably going to be helpful too:
CREATE MATERIALIZED VIEW hour_dev_data3 AS
SELECT device, event_time, hour, value
FROM min_dev_data
WHERE hour IS NOT NULL AND value IS NOT NULL
and event_time IS NOT NULL AND device IS NOT NULL
PRIMARY KEY ((device), hour, event_time)
WITH CLUSTERING ORDER BY (hour DESC);
Scheduling aggregations like xmas79 suggests is going to be more efficient as the sum is done once, rather than summing every time reads are done, however it does add more maintenance burden, where the materialised view handles it for you.
I have a table with timestamps in a 15 min interval. It's possible to aggregate or group by hour and the load field being the average?
Theres a post on materialized views. You can use it to create a copy of data batched by hour. Then use the average aggregate functions on load. I think CASSANDRA-11871 Is what your looking for though, which has all its dependancies in group by has recently been completed but hasnt been worked on yet.
Kinda just guessing on your schema but something like (disclaimer not really tested):
CREATE TABLE load (
ref_equip text,
ptd_assoc text,
date timestamp,
date_hour bigint,
load float,
PRIMARY KEY ((ref_equip, ptd_assoc), date)
);
CREATE MATERIALIZED VIEW load_by_hour AS
SELECT * FROM load
WHERE ref_equip IS NOT NULL AND ptd_assoc IS NOT NULL
PRIMARY KEY ((ref_equip, ptd_assoc), date_hour, date);
where date_hour is just the timestamp with hour resolution, meaning divide by 1000*60*60 (epoc is ms) when doing insert. Can then select average
SELECT avg(load) FROM load_by_hour WHERE ref_equip='blarg' AND ptd_assoc='blargy' AND date_hour = 410632;
Alternatively something that may just be better to begin with is to store you data, partitioned by hour:
CREATE TABLE load (
ref_equip text,
ptd_assoc text,
date timestamp,
date_hour bigint,
load float,
PRIMARY KEY ((ref_equip, ptd_assoc, date_hour), date)
);
I have a table Foo in cassandra with 4 columns foo_id bigint, date datetime, ref_id bigint, type int
here the partitioning key is foo_id. the clustering keys are date desc, ref_id and type
I want to write a CSQL query which is the equivalent of the SQL below
select min(foo_id) from foo where date >= '2016-04-01 00:00:00+0000'
I wrote the following CSQL
select foo_id from foo where
foo_id IN (-9223372036854775808, 9223372036854775807)
and date >= '2016-04-01 00:00:00+0000';
but this returns empty results.
Then I tried
select foo_id from foo where
token(foo_id) > -9223372036854775808
and token(foo_id) < 9223372036854775807
and date >= '2016-04-01 00:00:00+0000';
but this results in error
Unable to execute CSQL Script on 'Cassandra'. Cannot execute this query
as it might involve data filtering and thus may have unpredictable
performance. If you want to execute this query despite performance
unpredictability, use ALLOW FILTERING.
I don't want to use ALLOW FILTERING. but I want the minimum of foo_id at the start of the specified date.
You should probably denormalize your data and create a new table for the purpose. I propose something like:
CREATE TABLE foo_reverse (
year int,
month int,
day int,
foo_id bigint,
date datetime,
ref_id bigint,
type int,
PRIMARY KEY ((year, month, day), foo_id)
)
To get the minimum foo_id you would query that table by something like:
SELECT * FROM foo_reverse WHERE year = 2016 AND month = 4 AND day = 1 LIMIT 1;
That table would allow you to query on a "per day" basis. You can change the partition key to better reflect your needs. Beware of the potential hot spots you (and I) could create by selecting an appropriate time range.
There are 100s of data points, each data point has its own seperate table with schema and queries as mentioned below:
Current Schema in SQLite
Table Name: Name of Data Point e.g. Tempearature
Column-1: Name: Timestamp Type: TEXT (yyyy-MM-dd HH:mm:ss.ttt format) PRIMARY KEY
Column-2: Name: Value Type: FLOAT
Column-3: Name: Quality Type: TEXT ("GOOD", "BAD")
Queries for SQLite
SELECT * FROM data-point-name;
SELECT * FROM data-point-name WHERE Timestamp BETWEEN timesamp-1 AND timestamp-2;
INSERT INTO data-point-name (Timestamp, Value, Quality) VALUES ("2016-01-01 00:00:05.254", 123.25454, "GOOD"); (this is an example)
Currently I have SQLite db where I have a table per data-point with above schema, essentially I have 100s of tables. This way reads/writes are not disturbing queries running on different data-points.
How to translate this schema to be used in Cassandra?
In your case, you can store all your data points in a single table :
CREATE TABLE datapoints (
datatype varchar(30),
time timestamp,
value float,
quality varchar(4),
PRIMARY KEY (datatype, time)
);
With this structure, you can run queries like :
SELECT *
FROM datapoints
WHERE datatype = 'data-point-name';
SELECT *
FROM datapoints
WHERE datatype = 'data-point-name'
AND time >= '2016-01-01 00:00:00'
AND time <= '2016-01-02 00:00:00';
But with this structure, cassandra will partition data by datapoint name,
if you have many points, your partition will be huge and you can have query performence issues.
You can also refine the partitionning by decompose the time :
CREATE TABLE datapoints (
datatype varchar(30),
year int,
month int,
day int,
milisecondsinday int,
value float,
quality varchar(4),
PRIMARY KEY ((datatype, year, month, day), milisecondsinday)
) WITH CLUSTERING ORDER BY (milisecondsinday ASC);
In this case, this structure allow cassandra to store datas in more small partition than the first exemple and it's more powerfull if you query you data by day :
SELECT *
FROM datapoints
WHERE datatype = 'data-point-type'
AND year = 2016
AND month = 1
AND day = 1;
get all points for 'data-points-type'
for the 2016-01-01
between 00:00 AM and 01:00 AM
SELECT *
FROM datapoints
WHERE datatype = 'data-point-type'
AND year = 2016
AND month = 1
AND day = 1
AND milisecondsinday >= 0
AND milisecondsinday <= 3600000;
Of course, you can partition by day (like exemple) or others time scale (hours, minutes, seconds and miliseconds). If you can, small partition will be good for performence.
Hope this can help you.