Primary key in nested structure on clickhouse database - nested

In clickhouse I have created a table with a nested structure
CREATE TABLE IF NOT EXISTS table_name (
timestamp Date,
str_1 String,
Nested_structure Nested (
index_array UInt32,
metric_2 UInt64,
metric_3 UInt8
),
sign Int8 DEFAULT 1
) ENGINE = CollapsingMergeTree(sign) PARTITION BY (toYYYYMM(timestamp)) ORDER BY (timestamp, str_1)
The queries that I will make are like:
SELECT count(*) AS count FROM table_name
WHERE (timestamp = '2017-09-01')
AND
arrayFirst((i, x) -> x = 7151, Nested_structure.metric_2, Nested_structure.index_array) > 50000
I want to count str_1 where:
value of (array) column metric_2 in index matched from indexed_array for value 7151, is greater than a given threshold (50000)
I was wondering if it is possible to have a primary key for column: index_array in order to make query faster.
If I add column: Nested_structure.index_array in the order by clause it is assumed to be the array column of the large table and not the individual values of the column indexed_array of the Nested_structure
e.g. ORDER BY (timestamp, str_1, Nested_structure.index_array)
The algorithm is:
Search for the index of a given value in index_array
Having the index from step (1), retrieve the value from other arrays
If index_array is sorted and the table has knowledge of that, then step (1) could be faster (use a binary search algorithm for example)
Someone has an idea?
=============
EDIT
Cardinality of columns:
str_1 15,000,000 millions distinct values
index_array: 15,000 - 20,000 thousands distinct values
Assuming that index_array distinct values are: column_1, ..., column_15000, then a denormalized table should have the below structure:
timestamp,
str_1,
column_1a, <-- store values for metric_2
...
column_15000a, <-- store values for metric_2
column_1b, <-- store values for metric_3
...
column_15000b, <-- store values for metric_3
#Amos may you give me the structure of the table, if I use a column of type LowCardinality ?

I was wondering if it is possible to have a primary key for column: index_array in order to make query faster.
Nope, ClickHouse doesn't have array indices. If you supply Nested_structure.index_array as the third argument in the order by clause, it will just order the entire row taking into account the array column. Note, [1,2] < [1,2,3].
You can just denormalize the table without the nested column and make the first two columns with type LowCardinality which is almost production-ready.
Update
It seems you won't benefit much from LowCardinality types. What I meant was doing something like this
CREATE TABLE IF NOT EXISTS table_name (
timestamp Date,
str_1 String,
index_array UInt32,
metric_2 UInt64,
metric_3 UInt8,
sign Int8 DEFAULT 1
) ENGINE = CollapsingMergeTree(sign) PARTITION BY (toYYYYMM(timestamp)) ORDER BY (timestamp, str_1, index_array)
And you can still use the old insertion logic by doing this
CREATE TABLE IF NOT EXISTS table_name ( timestamp Date, str_1 String, index_array UInt32, metric_2 UInt64, metric_3 UInt8, sign Int8 DEFAULT 1 ) ENGINE = CollapsingMergeTree(sign) PARTITION BY (toYYYYMM(timestamp)) ORDER BY (timestamp, str_1, index_array)
CREATE TABLE IF NOT EXISTS source_table ( timestamp Date, str_1 String, Nested_structure Nested ( index_array UInt32, metric_2 UInt64, metric_3 UInt8 ), sign Int8 DEFAULT 1 ) ENGINE Null;
create materialized view data_pipe to table_name as select timestamp, str_1, Nested_structure.index_array index_array, Nested_structure.metric_2 metric_2, Nested_structure.metric_3 metric_3, sign from source_table array join Nested_structure;
insert into source_table values (today(), 'fff', [1,2,3], [2,3,4], [3,4,5], 1);

Related

Set primary key for range query in Cassandra

I want to create a table with these columns: id1, id2, type, time, data, version.
The frequent query is:
select * from table_name where id1 = ... and id2 =... and type = ...
select * from table_name where id1= ... and type = ... and time > ... and time < ...
I don't know how to set the primary key for the fast query?
As you have two different queries, you will likely need to have two different tables for them to perform well. This is not unusual for Cassandra data models. Keep in mind that for both of these, the PRIMARY KEY definition in Cassandra is largely dependent on the cardinalities and anticipated query patterns. As you have only provided the latter, you may need to make adjustments based on the cardinalities of id1, id2, and type.
select * from table_name where id1 = X and id2 = Y and type = Z;
So here I'm going to make an educated guess that id1 and id2 are nigh unique (high cardinality), as IDs usually are. I don't know how many types are available in your application, but as long as there aren't more than 10,000 this should work:
CREATE TABLE table_name_by_ids (
id1 TEXT,
id2 TEXT,
type TEXT,
time TIMESTAMP,
data TEXT,
version TEXT,
PRIMARY KEY ((id1,id2),type));
This will key your partitions on a joint hash of id1 and id2, sorting the rows inside by type (default ascending).
select * from table_name where id1= X and type = Z and time > A and time < B;
Likewise, the table to support this query will look like this:
CREATE TABLE table_name_by_id1_time (
id1 TEXT,
id2 TEXT,
type TEXT,
time TIMESTAMP,
data TEXT,
version TEXT,
PRIMARY KEY ((id1),type,time))
WITH CLUSTERING ORDER BY (type ASC, time DESC);
Again, this should work as long as you don't have more than several thousand type/time combinations.
One final adjustment that I would make though, would be around judging just how many type/time combinations you expect to have over the life of the application. If this data will grow over time, then the above will cause the partitions to grow to an unmaintainable point. To keep that from happening, I'd also recommend adding a time "bucket."
version TEXT,
month_bucket TEXT,
PRIMARY KEY ((id1,month_bucket),type,time))
WITH CLUSTERING ORDER BY (type ASC, time DESC);
Likewise for this, the query will need to be adjusted as well:
select * from table_name_by_id1_time
where id1= 'X' and type = 'Z'
and month_bucket='201910'
and time > '2019-10-07 00:00:00' and time < '2019-10-07 16:22:12';
Hope this helps.
how do I guarantee the atomicity of these two insertions?
Simply put, you can run the two INSERTs together in an atomic batch.
BEGIN BATCH
INSERT INTO table_name_by_ids (
id1, id2, type, time, data, version
) VALUES (
'X', 'Y', 'Z', '2019-10-07 12:00:01','stuff','1.0'
) ;
INSERT INTO table_name_by_id1_time (
id1, id2, type, time, data, version, month_bucket
) VALUES (
'X', 'Y', 'Z', '2019-10-07 12:00:01','stuff','1.0','201910'
);
APPLY BATCH;
For more info, check out the DataStax docs on atomic batches: https://docs.datastax.com/en/dse/6.7/cql/cql/cql_using/useBatchGoodExample.html

SASI indexes on year and month

I am new to SASI indexes in Cassandra and I am unclear how they index when multiple columns are included in the "where" predicate that are indexed.
Here is one option I am looking at:
Option 1:
CREATE TABLE IF NOT EXISTS my_timeseries_data (
id text,
event_time timestamp,
value text,
year int,
month int,
PRIMARY KEY (id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
CREATE CUSTOM INDEX year_idx ON my_timeseries_data (year)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'mode': 'SPARSE' };
CREATE CUSTOM INDEX month_idx ON my_timeseries_data (month)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'mode': 'SPARSE' };
I expect to query like this sometimes:
select * from my_timeseries_data
where year = 2016 and month = 1 ALLOW FILTERING;
Does the SASI index on 'month' column help my performance?
Option 2:
Would it be better to index a concatenated column like 'year_and_month' below?
CREATE TABLE IF NOT EXISTS my_timeseries_data (
id text,
event_time timestamp,
value text,
year_and_month text,
PRIMARY KEY (id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
CREATE CUSTOM INDEX year_idx ON my_timeseries_data (year_and_month)
USING 'org.apache.cassandra.index.sasi.SASIIndex';
And then query like this on a single SASI index:
select * from my_timeseries_data
where year_and_month = '2016_1';
Option 3:
NO need for extra month and year columns and SASI indexes because having 'event_time' as a CLUSTERING COLUMN allows scalable time-range queries that I want to do anway?

how to do the query in cassandra If i have two cluster key in column family

I have a column family and syntax like this:
CREATE TABLE sr_number_callrecord (
id int,
callerph text,
sr_number text,
callid text,
start_time text,
plan_id int,
PRIMARY KEY((sr_number), start_time, callerph)
);
I want to do the query like :
a) select * from dummy where sr_number='+919xxxx8383'
and start_time >='2014-12-02 08:23:18' limit 10;
b) select * from dummy where sr_number='+919xxxxxx83'
and start_time >='2014-12-02 08:23:18'
and callerph='+9120xxxxxxxx0' limit 10;
First query works fine but second query is giving error like
Bad Request: PRIMARY KEY column "callerph" cannot be restricted
(preceding column "start_time" is either not restricted or by a non-EQ
relation)
If I get the result in first query, In second query I am just adding one
more cluster key to get filter result and the row will be less
Just like you cannot skip PRIMARY KEY components, you may only use a non-equals operator on the last component that you query (which is why your 1st query works).
If you do need to serve both of the queries you have listed above, then you will need to have separate query tables for each. To serve the second query, a query table (with the same columns) will work if you define it with a PRIMARY KEY like this:
PRIMARY KEY((sr_number), callerph, start_time)
That way you are still specifying the parts of your PRIMARY KEY in order, and your non-equals condition is on the last PRIMARY KEY component.
There are certain restrictions in the way the primary key columns are to be used in the where clause http://docs.datastax.com/en/cql/3.1/cql/cql_reference/select_r.html
One solution that will work in your situation is to change the order of clustering columns in the primary key
CREATE TABLE sr_number_callrecord (
id int,
callerph text,
sr_number text,
callid text,
start_time text,
plan_id int,
PRIMARY KEY((sr_number), callerph, start_time,)
);
Now you can use range query on the last column as
select * from sr_number_callrecord where sr_number = '1234' and callerph = '+91123' and start_time >= '1234';

Order in Limited query with composite keys on cassandra

In the following scenario:
CREATE TABLE temperature_by_day (
weatherstation_id text,
date text,
event_time timestamp,
temperature text,
PRIMARY KEY ((weatherstation_id,date),event_time)
)WITH CLUSTERING ORDER BY (event_time DESC);
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-03','2013-04-03 07:01:00','72F');
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-03','2013-04-03 08:01:00','74F');
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-04','2013-04-04 07:01:00','73F');
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-04','2013-04-04 08:01:00','76F');
If I do the following query:
SELECT *
FROM temperature_by_day
WHERE weatherstation_id='1234ABCD'
AND date in ('2013-04-04', '2013-04-03') limit 2;
I realized that the result of cassandra is ordered by the same sequence of patkeys in clausa IN. In this case, I'd like to know if the expected result is ALWAYS the two records of the day '2013-04-04'? Ie Cassadra respects the order of the IN clause in the ordering of the result even in a scenario with multiple nodes?

Understanding Cassandra's storage overhead

I have been reading this section of the Cassandra docs and found the following a little puzzling:
Determine column overhead:
regular_total_column_size = column_name_size + column_value_size + 15
counter - expiring_total_column_size = column_name_size + column_value_size + 23
Every column in Cassandra incurs 15 bytes of overhead. Since each row in a table can have different column names as well as differing numbers of columns, metadata is stored for each column. For counter columns and expiring columns, you should add an additional 8 bytes (23 bytes total).
The way I interpret the above for a CQL3 defined schema such as:
CREATE TABLE mykeyspace.mytable(
id text,
report_id text,
subset_id text,
report_date timestamp,
start_date timestamp,
end_date timestamp,
subset_descr text,
x int,
y double,
z int,
PRIMARY KEY (id, report_id, subset_id)
);
is that each row will contain the metadata for the column names, e.g., the strings report_date, start_date, end_date, etc. and their type along with the data. However, it's not clear to me what it means that each row in a table can have different column names. This sounds wrong to me given the schema above is totally static, i.e., Cassandra 2.0 will most certainly complain if I try to write:
INSERT INTO mykeyspace.mytable (id, report_id , subset_id, x, y, z, w)
VALUES ( 'asd','qwe','rty',100,1.234,12, 123.123);
Bad Request: Unknown identifier w
Now it looks to me like column names are fixed given this table schema and thus the metadata should not need to be stored per each row. I am guessing either the phrasing in the documentation is outdated (it's the same as Cassandra 1.2) or I'm misunderstanding some core concept at work here.
Can anybody clarify? Bottom line: do I have to worry about the length of the names of my columns or not?
We have been playing it safe and used single character names where possible (so the above columns would actually be i, r, s, dr, ds, de, sd, ...), but it's so non human unreadable and can be confusing to work with.
The easiest way to figure out what is going on in situations like this is to check the sstable2json (cassandra/bin) representation of your data. This will show you what ends up actually be saved on disk.
Here is the example for your situation
[
{"key": "4b6579","columns": [
["rid1:ssid1:","",1401469033325000],
["rid1:ssid1:end_date","2004-10-03 00:00:00-0700",1401469033325000],
["rid1:ssid1:report_date","2004-10-03 00:00:00-0700",1401469033325000],
["rid1:ssid1:start_date","2004-10-03 00:00:00-0700",1401469033325000],
["rid1:ssid1:subset_descr","descr",1401469033325000],
["rid1:ssid1:x","1",1401469033325000],
["rid1:ssid1:y","5.5",1401469033325000],
["rid1:ssid1:z","1",1401469033325000],
["rid2:ssid2:","",1401469938599000],
["rid2:ssid2:end_date", "2004-10-03 00:00:00-0700",1401469938599000],
["rid2:ssid2:report_date","2004-10-03 00:00:00-0700",1401469938599000],
["rid2:ssid2:start_date","2004-10-03 00:00:00-0700",1401469938599000],
["rid2:ssid2:subset_descr","descr",1401469938599000],
["rid2:ssid2:x","1",1401469938599000],
["rid2:ssid2:y","5.5",1401469938599000],
["rid2:ssid2:z","1",1401469938599000]
}
]
The value of the partition key is saved once per partition (per sstable) as you can see above, the column name in this case doesn't matter at all since it is implicit given the table. The column names for the clustering columns are also not present because with C* you aren't allowed to insert without specifying all portions of the key.
Whats left though does have the column name, this is needed incase a partial update to a row is made so it can be saved without the rest of the row information. You could imagine an update to a single column field in a row, to indicate which field this is C* currently uses the column name but there are tickets to change this to a smaller representation.
https://issues.apache.org/jira/browse/CASSANDRA-4175
To generate this
cqlsh
CREATE TABLE mykeyspace.mytable( id text, report_id text, subset_id text, report_date timestamp, start_date timestamp, end_date timestamp, subset_descr text, x int, y double, z int, PRIMARY KEY (id, report_id, subset_id) );
INSERT INTO mykeyspace.mytable (id, report_id , subset_id , report_date , start_date , end_date , subset_descr ,x, y, z) VALUES ( 'Key', 'rid1','ssid1', '2004-10-03','2004-10-03','2004-10-03','descr',1,5.5,1);
INSERT INTO mykeyspace.mytable (id, report_id , subset_id , report_date , start_date , end_date , subset_descr ,x, y, z) VALUES ( 'Key', 'rid2','ssid2', '2004-10-03','2004-10-03','2004-10-03','descr',1,5.5,1);
exit;
nodetool flush
bin/sstable2json $DATA_DIR/mytable/mykeyspace-mytable-jb-1-Data.db

Resources