Confusion over data model in cassandra - cassandra

Hello we have a table in Cassandra whose structure is as below
CREATE TABLE dmp.user_profiles_6 (
vuid text PRIMARY KEY,
brand_model text,
first_seen timestamp,
last_seen timestamp,
total_day_count int,
total_usage_count int,
user_type text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.1
AND speculative_retry = '99PERCENTILE';
I read a few articles about data modeling in Cassandra from datastax. In in it said that primary key consists of partition key and clustering key.
Now in above case we have a vuid column which is an identifier for every unique user. It is primary key. We have 400M unique users. So now does it mean that Cassandra is making 400M partitions? Then this must degrade the performance. In one datastax article about data modeling an example table shows primary key on a uuid column which is unique and having a very high cardinality. I am totally confused, can anyone help me identify which column can be set as partition key and which as cluster key?
Queries can be as below:
1. Select record directly on basis of vuid
2. Select vuids on basis of range of last seen or first seen

Select record directly on basis of vuid >>
Your table does that. It already has vuid as a primary key.
Select vuids on basis of range of last seen or first seen >>
There are two options here:
Either add last_seen or first_seen in clustering columns (you can do range selection on clustering columns only)
In this case you need to provide vuid along with last_seen and first_seen on the query. I don't think you want that.
OR
Create another table which has the same data(Yes,in C* we create another table for different query with same data and change the keys as per query. Welcome to data duplication). In this table you have to have to add a dummy column as primary key and make the last_seen and first_seen as clustering keys.You pass these seen dates in query to fetch vuid.
Hope this is clear.

you need to create 3 tables as below.
table 1:-
CREATE TABLE dmp.user_profiles_ZZZZ (
Dummy_column uuid ,
vuid text,
........other colums
PRIMARY KEY((Dummy_column,vuid))
) .....
table 2:-
CREATE TABLE dmp.user_profiles_YYYY (
Dummy_column uuid ,
.......other colums
PRIMARY KEY((Dummy_column),first_seen)
) .....
CREATE TABLE dmp.user_profiles_XXXX (
Dummy_column uuid ,
.....other colums
PRIMARY KEY((Dummy_column),last_seen)
) .....

In Cassandra(Query Driven model), tables are created to satisfy the query this is different from relation database Data modeling.
In cassandra, Primary Key consists of 2 type of keys
1.Partition key -> defines the partitions
2.Clustring key -> defines the order in partition
depending on the uses.
if the column mentioned in Partition key and clustring key are not enough to provide the uniqueness then we need to add Primary key of the relationship in the
Primary key.
Apart from the as a tip:-
[Column name XX] = ? -> equality check than add column name in Partition key
[Column name yy] >= ? -> Range check add column name in Clustring key
here in question its not mentioned what is your query which should be served.
Please share the query based on that table can be created.

Related

is there a way to index map type column in cassandra

I have a table susbcriber, which will contain millions of data.
Table schema is as below in cassandra -
CREATE TABLE susbcriber (
id int PRIMARY KEY,
age_identifier text,
alternate_mobile_identifier text,
android_identifier text,
batch_id text,
circle text,
city_identifier text,
country text,
country_identifier text,
created_at text,
deleted_at text,
email_identifier text,
gender_identifier text,
ios_identifier text,
list_master_id int,
list_subscriber_id text,
mobile_identifier text,
operator text,
partition_id text,
raw_data map<text, text>,
region_identifier text,
unique_identifier text,
updated_at text,
web_push_identifier text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
I have to make filter query mostly on 'raw_data map<text, text>,' this column contains JSON values and keys, How I can model the data so that select and update have to be fast in performance ?
I am trying to achieve some bulk update operations.
Any suggestion is highly appreciated.
Yeah you can.
Map is used to store dynamic data in table
You can have the index based upon Keys ,Entries or values of a map
There are three options I mentioned below.
If your use case is to search the keys of dynamic data then use first...
if you want to search on the values of known key in map then use second...
if you don't know the keys just want to search the values in the map then use third.
Create index idx_first on <keyspaceName.tableName> (Keys (<mapColumn>))
Create index idx_second on <keyspaceName.tableName> (Entries (<mapColumn>))
Create index idx_third on <keyspaceName.tableName> (Values (<mapColumn>))
If data is already in the map you dont really need to keep the values in their own columns as well, and if its just a key to a map its easier on cassandra to represent it as a clustering key instead of a collection like:
CREATE TABLE susbcriber_data (
id int,
key text,
value text,
PRIMARY KEY((id), key))
Then you can query by any id and key. If you are looking where a specific key has a value than
CREATE TABLE susbcriber_data_by_value (
id int,
shard int,
key text,
value text,
PRIMARY KEY((key, shard), value, id))
Then when you insert you set shard to be id % 12 or some value such that your partitions do not get to large (need to some guessing based on expected load). Then to see all the values where key = value you need to query all 12 of those shards (async call to each and merge). Although if your cardinality for the key/value pairs is low enough the shard might be unnecessary. Then you will have a list of the ids which you can lookup. If you want to avoid the lookup you can add an additional key and value to that table but your data may explode quite a bit depending on the number of keys you have in your map and keeping everything updated will be painful.
An option that I would not recommend but is available is to index the map ie:
CREATE INDEX raw_data_idx ON susbcriber ( ENTRIES (raw_data) );
SELECT * FROM susbcriber WHERE raw_data['ios_identifier'] = 'id';
Keeping in mind the issues with secondary indexes.

Cassandra does not support DELETE on indexed columns

Say I have a cassandra table xyz with the following schema :
create table xyz(
xyzid uuid,
name text,
fileid int,
sid int,
PRIMARY KEY(xyzid));
I create index on columns fileid , sid:
CREATE INDEX file_index ON xyz (fileid);
CREATE INDEX sid_index ON xyz (sid);
I insert data :
INSERT INTO xyz (xyzid, name , fileid , sid ) VALUES ( now(), 'p120' , 1, 100);
INSERT INTO xyz (xyzid, name , fileid , ssid ) VALUES ( now(), 'p120' , 1, 101);
INSERT INTO xyz (xyzid, name , fileid , sid ) VALUES ( now(), 'p122' , 2, 101);
I want to delete data using the indexed columns :
DELETE from xyz WHERE fileid=1 and sid=101;
Why do I get this error ?
InvalidRequest: code=2200 [Invalid query] message="Non PRIMARY KEY fileid found in where clause"
Is it mandatory to specify the primary key in the where clause for delete queries ?
Does Cassandra supports deletes using secondary index s ?
What has to be done to delete data using secondary index s ?
Any suggestions that could help .
I am using Data Stax Community Cassandra 2.1.8 but I also want to know whether delete using indexed columns is supported by Data Stax Community Cassandra 3.2.1
Thanks
Let me try and answer your questions in order:
1) Yes, if you are going to use a where clause in a CQL statement then the PARTITION KEY must be an equality operator in the where clause. Other than that you are only allowed to filter on clustering columns specified in your primary key. (Unless you have a secondary index)
2) No it does not. See this post for some more information as it is essentially the same problem.
Why can cassandra "select" on secondary key, but not update using secondary key? (1.2.8+)
3) Why not add sid as a clustering column in your primary key. This would allow you to do the delete or query using both as you have shown.
create table xyz(
xyzid uuid,
name text,
fileid int,
sid int,
PRIMARY KEY(xyzid, sid));
4) In general using secondary indexes is considered an anti-pattern (a bit less so with SASI indexes in C* 3.4) so my question is can you add these fields as clustering columns to your primary key? How are you querying these secondary indexes?
I suppose you can perform delete in two steps:
Select data by secondary index and get primary index column values
(xyzid) from query result
Perform delete by primary index values.

How to perform query with cassandra's timestamp column as WHERE condition

I have the following Cassandra table
cqlsh:mydb> describe table events;
CREATE TABLE mydb.events (
id uuid PRIMARY KEY,
country text,
insert_timestamp timestamp
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX country_index ON mydb.events (country);
CREATE INDEX insert_timestamp_index ON mydb.events (insert_timestamp);
As you can see, index is already created on insert_timestamp column.
I had gone through https://stackoverflow.com/a/18698386/3238864
I though the following is the correct query
cqlsh:mydb> select * from events where insert_timestamp >= '2016-03-01 08:27:22+0000';
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: 'insert_timestamp >= <value>'"
cqlsh:mydb> select * from events where insert_timestamp >= '2016-03-01 08:27:22+0000' ALLOW FILTERING;
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: 'insert_timestamp >= <value>'"
But, query with country column as WHERE condition does work.
cqlsh:mydb> select * from events where country = 'my';
id | country | insert_timestamp
--------------------------------------+---------+--------------------------
53167d6a-e125-46ff-bacf-f5b267de0258 | my | 2016-03-01 08:27:22+0000
Any idea why query with timestamp as condition doesn't work? Is there anything wrong with my query syntax?
Any idea why query with timestamp as condition doesn't work? Is there anything wrong with my query syntax?
Native Cassandra secondary index is limited to = predicate. To enable inequality predicates you need to add ALLOW FILTERING but it will perform a full cluster scan :-(
If you can afford to wait for a couple of weeks, Cassandra 3.4 will be released with the new SASI secondary index which is much more efficient for range queries: https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
Index in cassandra are quite different from index in relational DB. One of the difference is, range query in a cassandra index is not allowed at all. Usually, range query only works with clustering keys (it also could work with partition keys if ByteOrderPartitioner is used, but it is not common), meaning you have to design your columnfamilies carefully for your potential query patterns. There are already many discussions in StackOverflow for the same topic.
To understand when to use cassandra's index (it is designed for quite specific cases) and its limitations, this is a good post,
Direct queries on secondary indices support only =, CONTAINS or
CONTAINS KEY restrictions.
Secondary index queries allow you to restrict the returned results
using the =, >, >=, <= and <, CONTAINS and CONTAINS KEY restrictions
on non-indexed columns using filtering.
So your query will work once you add ALLOW FILTERING to it.
select * from events where insert_timestamp >= '2016-03-01 08:27:22+0000' ALLOW FILTERING;
The link that you have mentioned in your question has timestamp column as clustering key. Hence it is working there.
As per the comment RangeQuery on secondary index is not alllowed upto 2.2.x version
FYI:
When Cassandra must perform a secondary index query, it will contact all the nodes to check the part of the secondary index located on each node.
Hence it is considered as anti-pattern in cassandra to have index on high cardinality column like timestamp.
You Should consider changing your data model to suit your queries.
Using cequel ORM
now = DateTime.now
today = DateTime.new(now.year, now.month, now.day, 0, 0, 0, now.zone)
tommorrow = today + (60 * 60 * 24);
MyObject.allow_filtering!.where("done_date" => today..tommorrow).select( "*" )
Has worked for me.

Cassandra range slicing on composite key

I have columnfamily with composite key like this
CREATE TABLE sometable(
keya varchar,
keyb varchar,
keyc varchar,
keyd varchar,
value int,
date timestamp,
PRIMARY KEY (keya,keyb,keyc,keyd,date)
);
What I need to do is to
SELECT * FROM sometable
WHERE
keya = 'abc' AND
keyb = 'def' AND
date < '2014-01-01'
And that is giving me this error
Bad Request: PRIMARY KEY part date cannot be restricted (preceding part keyd is either not restricted or by a non-EQ relation)
What's the best way to solve this? Do I need to alter my columnfamily?
I also need to query those table with all keya, keyb, keyc, and date.
You cannot do it in cassandra. Moreover, such a range slicing is costlier too. You are trying to slice through a set of equalities that have the lower priority according to your schema.
I also need to query those table with all keya, keyb, keyc, and date.
If you are considering to solve this problem, considering having this schema. What i would suggest is to have the keys in a separate schema
create table (
timeuuid id,
keyType text,
primary key (timeuuid,keyType))
Use the timeuuid to store the values and do a range scan based on that.
create table(
timeuuid prevTableId,
value int,
date timestamp,
primary key(prevTableId,date))
Guess , in this way, your table is normalized for better scalability in your use case and may save a lot of disk space if keys are repetitive too.

Get a random row in Cassandra using Datastax and CQL

I am NoSQL n00b, and just trying things out. I have the following keyspace with a single table in cassandra 2.0.2
CREATE KEYSPACE PersonDB WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
};
USE PersonDB;
CREATE TABLE Persons (
id int,
lastname text,
firstname text,
PRIMARY KEY (id)
)
I have close to 500 entries in the Persons table. I want to select any random row from the table. Is there an efficient way to do it in CQL? I am using groovy to invoke APIs exposed by datastax.
If want to get "any" row you can just use LIMIT.
select * from persons LIMIT 1;
You would get the row with the lower hash of the partition key (id).
It will not be random, it will depend on your partitioner, but you would get A row.

Resources