CQL3 and millions of columns composite key use case - cassandra

How in CQL3 do we do millions of columns? We have one special table where all rows are basically composite keys and very very wide.
I was reading this question that implied two ways
Does collections in CQL3 have certain limits?
Also, the types of our composite keys are String.bytes and ordered by STring
We have an exact matching table that is Decimal.bytes and ordered by decimal.
How would one handle this in CQL3?
thanks,
Dean

"oh, and part of my question was missing since SO formatted it out of the question. I was looking for Decimal.bytes and String.bytes as my composite key....there is no "value", just a col name and I want all columns were decimal > 10 and decimal < 20 so to speak and the column name = 10 occurs multiple times as in 10.a, 10.b 11.c, 11.d, 11.e"
CREATE TABLE widerow
(
row_key text, //whatever
column_composite1 decimal,
column_composite2 text,
PRIMARY KEY(row_key,column_composite1,column_composite2)
)
SELECT * FROM widerow WHERE row_key=...
AND column_composite1>=10.0
AND column_composite1<=20.0
In that case, you can query with range over column_composite1 and have for EACH column_composite1, different values of column_composite2 (10.a, 10.b 11.c, 11.d, 11.e...)
"How do I get all the columns where row_composite1 > "a" and row_composite1 < "b" in that use case? ie. I dont' care about the second half of the composite name. "
2 possible solutions here
Make row_composite1 a composite component of column
Use OrderPreservingPartitioner (this is indeed strongly discouraged)
For solution 1
CREATE TABLE widerow
(
fake_row_key text, //whatever
column_composite1 text, // previously row_composite1
column_composite2 decimal,
column_composite3 text,
PRIMARY KEY(row_key,column_composite1,column_composite2,column_composite3)
)
SELECT * FROM widerow WHERE row_key=...
AND column_composite1>='a'
AND column_composite1<='b'
This modeling has some drawback though. To be able to range query over DOUBLE values, you need to provide first the column_composite1:
SELECT * FROM widerow WHERE row_key=...
AND column_composite1='a'
AND column_composite2>=10.0
AND column_composite2<=20.0

CREATE TABLE widerow
(
row_composite1 text,
row_composite2 text,
column_name decimal,
value text,
PRIMARY KEY((row_composite1,row_composite2),column_name)
)
SELECT * FROM widerow WHERE row_composite1=...
AND row_composite2=...
AND column_name>=10.0
AND column_name<=20.0
ORDER BY column_name DESC

Related

Set primary key for range query in Cassandra

I want to create a table with these columns: id1, id2, type, time, data, version.
The frequent query is:
select * from table_name where id1 = ... and id2 =... and type = ...
select * from table_name where id1= ... and type = ... and time > ... and time < ...
I don't know how to set the primary key for the fast query?
As you have two different queries, you will likely need to have two different tables for them to perform well. This is not unusual for Cassandra data models. Keep in mind that for both of these, the PRIMARY KEY definition in Cassandra is largely dependent on the cardinalities and anticipated query patterns. As you have only provided the latter, you may need to make adjustments based on the cardinalities of id1, id2, and type.
select * from table_name where id1 = X and id2 = Y and type = Z;
So here I'm going to make an educated guess that id1 and id2 are nigh unique (high cardinality), as IDs usually are. I don't know how many types are available in your application, but as long as there aren't more than 10,000 this should work:
CREATE TABLE table_name_by_ids (
id1 TEXT,
id2 TEXT,
type TEXT,
time TIMESTAMP,
data TEXT,
version TEXT,
PRIMARY KEY ((id1,id2),type));
This will key your partitions on a joint hash of id1 and id2, sorting the rows inside by type (default ascending).
select * from table_name where id1= X and type = Z and time > A and time < B;
Likewise, the table to support this query will look like this:
CREATE TABLE table_name_by_id1_time (
id1 TEXT,
id2 TEXT,
type TEXT,
time TIMESTAMP,
data TEXT,
version TEXT,
PRIMARY KEY ((id1),type,time))
WITH CLUSTERING ORDER BY (type ASC, time DESC);
Again, this should work as long as you don't have more than several thousand type/time combinations.
One final adjustment that I would make though, would be around judging just how many type/time combinations you expect to have over the life of the application. If this data will grow over time, then the above will cause the partitions to grow to an unmaintainable point. To keep that from happening, I'd also recommend adding a time "bucket."
version TEXT,
month_bucket TEXT,
PRIMARY KEY ((id1,month_bucket),type,time))
WITH CLUSTERING ORDER BY (type ASC, time DESC);
Likewise for this, the query will need to be adjusted as well:
select * from table_name_by_id1_time
where id1= 'X' and type = 'Z'
and month_bucket='201910'
and time > '2019-10-07 00:00:00' and time < '2019-10-07 16:22:12';
Hope this helps.
how do I guarantee the atomicity of these two insertions?
Simply put, you can run the two INSERTs together in an atomic batch.
BEGIN BATCH
INSERT INTO table_name_by_ids (
id1, id2, type, time, data, version
) VALUES (
'X', 'Y', 'Z', '2019-10-07 12:00:01','stuff','1.0'
) ;
INSERT INTO table_name_by_id1_time (
id1, id2, type, time, data, version, month_bucket
) VALUES (
'X', 'Y', 'Z', '2019-10-07 12:00:01','stuff','1.0','201910'
);
APPLY BATCH;
For more info, check out the DataStax docs on atomic batches: https://docs.datastax.com/en/dse/6.7/cql/cql/cql_using/useBatchGoodExample.html

Storing time specific data in cassandra

I am looking for a good way to store time specific data in cassandra.
Each entry can look like (start_time, value). Later, I would like to retrieve the current value.
Logic of retrieving current value is like following.
Find all rows with start_time<=current_time.
Then find the value with maximum start_time from the rows obtained in the first step.
PS:- Edited the question to make it more clear
The exact requirements are not possible. But we can get close to it with one more column.
First, to be able to use <= operator, your start_time column need to be the clustering key of your table.
Then, you need a different partition key. You could choose a fixed value but it could bring problems when the partition will have too many rows. Then you should better use something like the year or the month of the start_time.
CREATE TABLE time_specific_table (
year bigint,
start_time timestamp,
value text,
PRIMARY KEY((year), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC);
The problem is that when you will query the table, you will need to know the value of the partition key :
Find all rows with start_time<=current_time
SELECT * FROM time_specific_table
WHERE year = :year AND start_time <= :time;
select the value with maximum start_time
SELECT * FROM time_specific_table
WHERE year = :year LIMIT 1;
Create two separate table like below :
CREATE TABLE data (
start_time timestamp,
value int,
PRIMARY KEY(start_time, value)
);
CREATE TABLE current_value (
partition int PRIMARY KEY,
value int
);
Now you have to insert data into both table, to insert data into second table use a static value like 1
INSERT INTO current_value(partition, value) VALUES(1, 10);
Now In current value table your data will be upsert and You will get latest value whenever you select.

how to do the query in cassandra If i have two cluster key in column family

I have a column family and syntax like this:
CREATE TABLE sr_number_callrecord (
id int,
callerph text,
sr_number text,
callid text,
start_time text,
plan_id int,
PRIMARY KEY((sr_number), start_time, callerph)
);
I want to do the query like :
a) select * from dummy where sr_number='+919xxxx8383'
and start_time >='2014-12-02 08:23:18' limit 10;
b) select * from dummy where sr_number='+919xxxxxx83'
and start_time >='2014-12-02 08:23:18'
and callerph='+9120xxxxxxxx0' limit 10;
First query works fine but second query is giving error like
Bad Request: PRIMARY KEY column "callerph" cannot be restricted
(preceding column "start_time" is either not restricted or by a non-EQ
relation)
If I get the result in first query, In second query I am just adding one
more cluster key to get filter result and the row will be less
Just like you cannot skip PRIMARY KEY components, you may only use a non-equals operator on the last component that you query (which is why your 1st query works).
If you do need to serve both of the queries you have listed above, then you will need to have separate query tables for each. To serve the second query, a query table (with the same columns) will work if you define it with a PRIMARY KEY like this:
PRIMARY KEY((sr_number), callerph, start_time)
That way you are still specifying the parts of your PRIMARY KEY in order, and your non-equals condition is on the last PRIMARY KEY component.
There are certain restrictions in the way the primary key columns are to be used in the where clause http://docs.datastax.com/en/cql/3.1/cql/cql_reference/select_r.html
One solution that will work in your situation is to change the order of clustering columns in the primary key
CREATE TABLE sr_number_callrecord (
id int,
callerph text,
sr_number text,
callid text,
start_time text,
plan_id int,
PRIMARY KEY((sr_number), callerph, start_time,)
);
Now you can use range query on the last column as
select * from sr_number_callrecord where sr_number = '1234' and callerph = '+91123' and start_time >= '1234';

Cassandra: Is there a limit to amount of data that a collection column can hold?

In the below table, what is the maximum size phone_numbers column can accommodate ?
Like normal columns, is it 2GB ?
Is it 64K*64K as mentioned here
CREATE TABLE d2.employee (
id int PRIMARY KEY,
doj timestamp,
name text,
phone_numbers map<text, text>
)
Collection types in Cassandra are represented as a set of distinct cells in the internal data model: you will have a cell for each key of your phone_numbers column. Therefore they are not normal columns, but a set of columns. You can verify this by executing the following command in cassandra-cli (1001 stands for a valid employee id):
use d2;
get employee[1001];
The good answer is your point 2.

Create a super column using CQL3

I am upgrading my thrift api to cql3. My data contains SuperColumns as follows:
- User //column family
- Division/name //my row key
-DivHead //SuperColumn
- name //Columns
- address //Columns
I understand all the column families to be changed to tables. And the primary key becomes the rowkey. So rest are the columns.
But my data has supercolumns. how do I create supercolumns using CQL3?
CREATE TABLE user (
rowkey varchar,
division text,
head_name text,
address text,
PRIMARY KEY (rowkey, division)
)
OR
CREATE TABLE user (
rowkey varchar,
division text,
head_name text,
head_address text,
PRIMARY KEY ((rowkey, division))
)
Under the covers the first example will have each rowkey assigned to the same partition. Each rowkey will have a set of logical rows, one for each division. Those rows will contain two columns: head_name and head_address. You can query based on the rowkey and get all divisions (sorted!). Or you can query a rowkey with a range of divisions or a single division and get a subset of the divisions with their division head and address.
The second example will have one partition for each rowkey and division combination. Each such partition will be one logical row as well. The single row for each composite key will have two columns: head_name and head_address. To make a query, you must provide BOTH the rowkey and the division.
EDIT: Cleared up some bad grammar.

Resources