Why does creating an index in Memsql take so long?

Why does creating an index in Memsql take so long? - singlestore

Creating an index on a distributed table of ~500m narrow rows on a 20c/40t 256gb server takes many hours and for the life of me I cannot understand why.
CREATE TABLE users_userlocation (
id bigint(20) unsigned NOT NULL,
user_id mediumint(9) unsigned NOT NULL,
lat float NOT NULL,
lon float NOT NULL,
speed decimal(4,2) NOT NULL,
status_id tinyint(4) unsigned NOT NULL,
date datetime NOT NULL,
prev_date datetime DEFAULT NULL,
next_date datetime DEFAULT NULL,
point geographypoint DEFAULT null,
/*!90618 SHARD */ KEY user_id (user_id),
KEY date (date DESC,user_id),
KEY point (point),
KEY date2 (user_id,date DESC),
KEY date3 (date,user_id)
);
alter table users_userlocation add index date3 (date, user_id);
As of this post, the above has been running for 6.5 hours.

Index build is a reasonably slow operation in MemSQL. A conversion of around 10 to 20 thousand rows a second per core is typical. It depends on the characteristics of the table or index your adding (skinny rows on a table with fewer indexes will be faster). Index build should have minimal impact on your running workload (some CPU use). If you can share your SHOW CREATE TABLE and CREATE INDEX statement I can check if your seeing something abnormal.

Related

Cassandra: Data Modeling for event based time series

I have a data modeling question. In my application I'm reading data from a few different sensors and storing it in Cassandra. The sensors generate new values at very different rates: Some every other second, some every other month.
Furthermore, the assumption is that a value stays valid until the next one is encountered. Example: Sensor 1 sent a value of 500 at 10s after EPOCH and a value of 1000 at 20s after EPOCH. The valid value for 15s after EPOCH would need to be 500.
Since some rates are going to be high and I don't want unbounded partitions, I want to apply bucketing. I'm thinking about modeling my data like this:
CREATE TABLE sensor_data (
sensor_id text,
some_timing_bucket date,
measured_at time,
value double
PRIMARY KEY ((sensor_id, some_timing_bucket), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC);
The usual queries the application would need to serve are "give me the data of the last 5/15 minutes/1 day", so I would choose the some_timing_bucket accordingly. Maybe even have multiple tables with different bucket sizes.
What I cannot wrap my head around is this: Consider I choose one day as bucketing interval. Now I want to retrieve the current value of a sensor that hasn't updated in ten days. There will be no partition for today, so on my application layer I would need to send nine queries that yield nothing until I have gone far enough back in time to encounter the value that is currently valid. That doesn't sound very efficient and I'd appreciate any input on how to model this.
Side note: This would not be an issue if all data for the same sensor was in the same partition: Just ask for all the points with a timestamp less than the beginning of the ranged query and limit the results to one. But that's not feasible because of the unbounded partition.

There is a much simpler way to model your data by using one-day buckets. Something like:
CREATE TABLE sensor_data_by_day (
sensor_id text,
year int,
month int,
day int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, year, month, day), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
If a sensor measures a data point every second, then there are 86,400 maximum possible values for a single day (60 secs x 60 mins * 24 hrs). 86K rows per partition is still manageable.
If today is 17 August 2022 and you wanted to retrieve the data for the previous day, the query would be:
SELECT value FROM sensor_data_by_day
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 16
Assuming it is currently 08:30:00 GMT on the 17th of August (1660725000000 ms since epoch), to retrieve the data for the last 15 minutes (900 secs ago or 1660724100000 ms):
SELECT value FROM
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 17
AND measured_at > 1660724100000
I think you'll find that it is easier to work with timestamps because it provides a bit more flexibility when it comes to doing range queries. Cheers!

you can do this with a simpler table like this:
CREATE TABLE sensor_data (
sensor_id text,
day_number_from_1970 int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, day_number_from_1970), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
and you can query data like that:
SELECT value
FROM sensor_data
WHERE sensor_id = some_sensor_id
AND day_number_from_1970 = day_number
AND measured_at > start_time
AND measured_at < end_time
with a single int column, you should less data on disk and get results well

Cassandra grouping with filter

I have a table of events that are done every minute. I want to be able to filter these events by time period and also aggregate data for hour/day/etc.
My data model:
create table min_dev_data (
device TEXT,
event_time BIGINT,
hour BIGINT,
day BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
)
CREATE MATERIALIZED VIEW hour_dev_data AS
SELECT device, event_time, hour, value
FROM min_dev_data
WHERE hour IS NOT NULL AND value IS NOT NULL
and event_time IS NOT NULL AND device IS NOT NULL
PRIMARY KEY ((device), hour, event_time)
my query is
select hour, sum(value)
from hour_dev_data
where device = 'tst' and event_time < 149000000 group by device, hour;
fails with error
code=2200 [Invalid query] message="PRIMARY KEY column "event_time" cannot be restricted as preceding column "hour" is not restricted"
The only way to make it work is to add ALLOW FILTERING, which is unpredictable.
How can I change my data model to address my query and avoid ALLOW FILTERING mode?

You have to proactively produce these results:
create table min_dev_data (
device TEXT,
event_time BIGINT,
hour BIGINT,
day BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
create table hour_dev_data (
device TEXT,
hour BIGINT,
day BIGINT,
event_time BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
create table day_dev_data (
device TEXT,
day BIGINT,
event_time BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
Each table will satisfy ONE granularity only.
Every hour you query the minute data for the latest hour data for each device with something like:
SELECT * FROM min_dev_data WHERE device = X AND event_time < YYYY
Sum that at application level and store this value into the hour table:
INSERT INTO hour_dev_data (device, hour, day, event_time, value) VALUES (....);
And every day you query the hour table to produce the further aggregate data:
SELECT * FROM hour_dev_data WHERE device = X AND event_time < YYYY
sum at application level and store this value into the day table.
Please consider adding some form of bucketing because, at one minute interval, in two months your minute table will have wide partitions. This should not be a problem if you keep the table in reverse order (like I did) and query only for the last couple of hours. But if you want to query back in time as well then you must definitely use bucketing in your tables.

I think you had things pretty much right already, but you need to change your filter on event_time to be a filter on hour.
select hour, sum(value)
from hour_dev_data
where device = 'tst' and hour < 1500000000
group by device, hour;
When you were filtering on event_time, you were implicitly requiring a full scan of the row, as the event_time is clustered after the hour. To filter by the event_time, every cell would need to be examined to check the event_time. When you filter by hour, it is first in the clustering key, so it can be efficiently scanned and filtered. See this post post on ALLOW FILTERING for more on this.
I agree with xmas79 that you probably want to be bucketing at some level, perhaps by month or year depending on your frequency of events. If you're always going to be looking for the most recent values, then setting the clustering key ordering to desc is probably going to be helpful too:
CREATE MATERIALIZED VIEW hour_dev_data3 AS
SELECT device, event_time, hour, value
FROM min_dev_data
WHERE hour IS NOT NULL AND value IS NOT NULL
and event_time IS NOT NULL AND device IS NOT NULL
PRIMARY KEY ((device), hour, event_time)
WITH CLUSTERING ORDER BY (hour DESC);
Scheduling aggregations like xmas79 suggests is going to be more efficient as the sum is done once, rather than summing every time reads are done, however it does add more maintenance burden, where the materialised view handles it for you.

Cassandra - aggregate timestamp by hour

I have a table with timestamps in a 15 min interval. It's possible to aggregate or group by hour and the load field being the average?

Theres a post on materialized views. You can use it to create a copy of data batched by hour. Then use the average aggregate functions on load. I think CASSANDRA-11871 Is what your looking for though, which has all its dependancies in group by has recently been completed but hasnt been worked on yet.
Kinda just guessing on your schema but something like (disclaimer not really tested):
CREATE TABLE load (
ref_equip text,
ptd_assoc text,
date timestamp,
date_hour bigint,
load float,
PRIMARY KEY ((ref_equip, ptd_assoc), date)
);
CREATE MATERIALIZED VIEW load_by_hour AS
SELECT * FROM load
WHERE ref_equip IS NOT NULL AND ptd_assoc IS NOT NULL
PRIMARY KEY ((ref_equip, ptd_assoc), date_hour, date);
where date_hour is just the timestamp with hour resolution, meaning divide by 1000*60*60 (epoc is ms) when doing insert. Can then select average
SELECT avg(load) FROM load_by_hour WHERE ref_equip='blarg' AND ptd_assoc='blargy' AND date_hour = 410632;
Alternatively something that may just be better to begin with is to store you data, partitioned by hour:
CREATE TABLE load (
ref_equip text,
ptd_assoc text,
date timestamp,
date_hour bigint,
load float,
PRIMARY KEY ((ref_equip, ptd_assoc, date_hour), date)
);

How to define keyspaces for a timeseries data in Cassandra?

There are 100s of data points, each data point has its own seperate table with schema and queries as mentioned below:
Current Schema in SQLite
Table Name: Name of Data Point e.g. Tempearature
Column-1: Name: Timestamp Type: TEXT (yyyy-MM-dd HH:mm:ss.ttt format) PRIMARY KEY
Column-2: Name: Value Type: FLOAT
Column-3: Name: Quality Type: TEXT ("GOOD", "BAD")
Queries for SQLite
SELECT * FROM data-point-name;
SELECT * FROM data-point-name WHERE Timestamp BETWEEN timesamp-1 AND timestamp-2;
INSERT INTO data-point-name (Timestamp, Value, Quality) VALUES ("2016-01-01 00:00:05.254", 123.25454, "GOOD"); (this is an example)
Currently I have SQLite db where I have a table per data-point with above schema, essentially I have 100s of tables. This way reads/writes are not disturbing queries running on different data-points.
How to translate this schema to be used in Cassandra?

In your case, you can store all your data points in a single table :
CREATE TABLE datapoints (
datatype varchar(30),
time timestamp,
value float,
quality varchar(4),
PRIMARY KEY (datatype, time)
);
With this structure, you can run queries like :
SELECT *
FROM datapoints
WHERE datatype = 'data-point-name';
SELECT *
FROM datapoints
WHERE datatype = 'data-point-name'
AND time >= '2016-01-01 00:00:00'
AND time <= '2016-01-02 00:00:00';
But with this structure, cassandra will partition data by datapoint name,
if you have many points, your partition will be huge and you can have query performence issues.
You can also refine the partitionning by decompose the time :
CREATE TABLE datapoints (
datatype varchar(30),
year int,
month int,
day int,
milisecondsinday int,
value float,
quality varchar(4),
PRIMARY KEY ((datatype, year, month, day), milisecondsinday)
) WITH CLUSTERING ORDER BY (milisecondsinday ASC);
In this case, this structure allow cassandra to store datas in more small partition than the first exemple and it's more powerfull if you query you data by day :
SELECT *
FROM datapoints
WHERE datatype = 'data-point-type'
AND year = 2016
AND month = 1
AND day = 1;
get all points for 'data-points-type'
for the 2016-01-01
between 00:00 AM and 01:00 AM
SELECT *
FROM datapoints
WHERE datatype = 'data-point-type'
AND year = 2016
AND month = 1
AND day = 1
AND milisecondsinday >= 0
AND milisecondsinday <= 3600000;
Of course, you can partition by day (like exemple) or others time scale (hours, minutes, seconds and miliseconds). If you can, small partition will be good for performence.
Hope this can help you.

Do I have a wide row?

I created a table with this staement
CREATE TABLE history (
salt int,
tagName varchar,
day timestamp,
room int static,
component varchar static,
instance varchar static,
property varchar static,
offset int,
value float,
PRIMARY KEY ((salt,tagName,day), offset)
);
The goal is to have for each rowkey (salt, tagName, day)
One column for component, instance and property.
One column for each offset with value as column value.
Day is just the current day (e.g. '2016-06-08'), not the current timestamp.
Salt will be very small. It is there to avoid exceeding row size if data is sampled very fast
I wanted to check my schema with the thrift client but it is no longer installed with the 3.5 version I have.
Is my schema correct for my goal? Is there a way to see the actual 'physical' rows with cqlsh?
Thanks!

cassandra-cli equivalent of your cql will be
RowKey (salt:tagName:day)
column(offsetvalue:,value= ,timestamp=sometimestamp)
column(offsetvalue:room,value=roomValue,timestamp=sometimestamp)
column(offsetvalue:component ,value=componentValue,timestamp=sometimestamp)
column(offsetvalue:instance,value=instanceValue,timestamp=sometimestamp)
column(offsetvalue:property,value=propertyValue,timestamp=sometimestamp)
column(offsetvalue:value,value=valueValue,timestamp=sometimestamp)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string