Is it possible to limit text size in cassandra like you would do in sql when creating a table?
username character varying(20)
My CQL query:
CREATE TABLE users(user_id uuid PRIMARY KEY, username text, date_created
bigint, profile_pic text, num_followers integer, name text);
Is it possible to limit text size in cassandra like you would do in sql when creating a table?
No, Cassandra does not allow you to limit the size of a VARCHAR/TEXT when creating a table.
user_id uuid PRIMARY KEY,
Out of curiosity, why is user_id (a UUID) the sole PRIMARY KEY? Do you need to support a lot of queries by user_id?
If not, then you should consider switching it to partition on something that provides a little more query flexibility, and maybe use user_id as a clustering key (to ensure uniqueness).
Related
I have the following table, called inbox_items:
USE zwoop_chat
CREATE TABLE IF NOT EXISTS inbox_items (
postId text,
userId text,
partnerId text,
fromUserId text,
fromNickName text,
fromAvatar text,
toUserId text,
toNickName text,
toAvatar text,
unread int static,
lastMessage text,
lastMessageDate timestamp,
PRIMARY KEY ((postId, userId), lastMessageDate)
) WITH CLUSTERING ORDER BY (lastMessageDate DESC);
The problem with this table is that I want to query it, both by postId and userId, as well as by userId only.
In other words, I have an inbox per post, but I have an inbox per user as well.
Afaik there is no good way to achieve this because:
The partition key(s) uniquely determine the node where the data is stored. I.e. all partition keys corresponding the where clause should be present.
Secondary index is no good fit for keys with high cardinality (in this case, postId has high cardinality)
The solution I currently see is to duplicate the table with different keys.
This feels like such an overkill though.
Is there a better solution I'm missing?
Assuming partitioning by userid alone would not generate partitions that are too large, you partition by userid, and have postid in the clustering key. You specified that you would query by :
The problem with this table is that I want to query it, both by postId and userId, as well as by userId only.
So in this instance, you do not need postid within the partition key, but within the clustering key. The only issue is if you intend to query by postid alone as well - but that was not mentioned.
If the partition by userid will result in partitions that are too large, there is additional bucketing techniques available.
I have some data in Cassandra. Say
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp
}
My application in addition to querying this data by primary key id, needs to query it by updated_on timestamp as well. To fulfil the query by time use case I have tried the following.
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp,
updated_on_minute timestamp
}
Secondary index on the updated_on_minute field. As I understand, secondary indexes are not recommended for high cardinality cases (which is my case, because I could have a lot of data at the same minute mark). Moreover I have data that gets frequently updated, which means the updated_on_minute will keep revving.
MaterializedView with updated_on_minute as the partition key and a id as the clustering key. I am on version 3.9 of cassandra and had just begun using these, but alas I find these release notes for 3.11x (https://github.com/apache/cassandra/blob/cassandra-3.11/NEWS.txt), which declare them purely experimental and not meant for production clusters.
So then what are my options? Do I just need to maintain my own tables to track data that comes in timewise? Would love some input on this.
Thanks in advance.
As always have been the case, create additional table to query by a different partition key.
In your case the table would be
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
Primary key(updated_on, id)
}
Write to both tables mytable_by_timetamp and mytable_by_id. Use the corresponding table to READ from based on the partition key either updated_on or id.
It’s absolutely fine to duplicate data based on the use case (query) it’s trying solve.
Edited:
In case there is a fear about huge partition, you can always bucket into smaller partitions. For example the table above could be broken down into
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
updated_min timestamp,
Primary key(updated_min, id)
}
Here I have chosen every minute as the bucket size. Depending on how many updates you receive, you can change it to seconds (updated_sec) to reduce the partition size further.
Here is a simple example of the user table in cassandra. What is best strategy to create a primary key.
My requirements are
search by uuid
search by username
search by email
All the keys mentioned will be high cardinality keys. Also at any moment I will be having only one of them to search
PRIMARY KEY(uid,username,email)
What if I have only the username ?, Then the above primary key is not use ful. I am not able visualize a solution to achieve this using compound primary key?
what are other options? should we go with a new table with username to uid, then search the user table. ?
From all articles out there on the internet recommends not to create secondary index for high cardinality keys
CREATE TABLE medicscity.user (
uid uuid,
fname text,
lname text,
user_id text,
email_id text,
password text,
city text,
state_id int,
country_id int,
dob timestamp,
zipcode text,
PRIMARY KEY (??)
)
How do we solve this kind of situation ?
Yes, you need to go with duplicate tables.
If ever in Cassandra you face a situation in which you will have to query a table based on column1, column2 or column3 independently. You will have to duplicate the tables.
Now, how much duplication you have to use, is individual choice.
Like, in this example, you can either duplicate table with full data.
Or, you can simply create a new table column1 (partition), column2, column 3 as primary key in main table.
Create a new table with primary key of column1, column2, column3 and partition key on column2.
Another one with same primary key and partition key on column3.
So, your data duplicate will be row, but in this case you will end up querying data twice. One from duplicate table, and one from full fledged table.
Big data technology, is there to speed up computation and let your system scale horizontally, and it comes at the expense of disk/storage. I mean just look at everything, even its base of replication factor does duplication of data.
Your PRIMARY KEY(uuid,username,email) don't fit your requirement. Because you can't search for the clustering column without fill the Partition Key, and even the second clustering column without fill the first clustering column.
e.g. you cannot search for username without uuid in WHERE clause and cannot search for email without uuid and username too.
All you need is the denormalization and duplicate data.
Denormalization and duplication of data is a fact of life with Cassandra. Don’t be afraid of it. Disk space is generally the cheapest resource (compared to CPU, memory, disk IOPs, or network), and Cassandra is architected around that fact. In order to get the most efficient reads, you often need to duplicate data.
In your case, you need to create 3 tables that have the same column (data that you want to get), but these 3 tables will have different PRIMARY KEY, one have uuid as PK, one have username as PK, and one have email as PK. :)
What criteria should be considered when selecting a rowid for a column family in cassandra? I want to migrate a relational database which does not contain any primary key. In that case what should be the best rowid selection?
Use natural keys that can be derived from the dataset if possible (e.g. phone_number for phone book, user_name for user table). If thats not possible, use a UUID.
There are many things to consider when consider the primary key of the cassandra system
Understand the difference between primary and partition key
CREATE TABLE users (
user_name varchar PRIMARY KEY,
password varchar,
);
In the above case primary and partition keys are the same.
CREATE TABLE users (
user_name varchar,
user_email varchar,
password varchar,
PRIMARY KEY (user_name, user_email)
);
Here Primary key is the user_name and user_email together, where as user_name is the partition keys.
CREATE TABLE users (
user_name varchar,
user_email varchar,
password varchar,
PRIMARY KEY ((user_name, user_email))
);
Here the primary key and partition keys are both equal to user_name,user_email
Carefully define your partition key. Partition keys are used for lookups by cassandra, so you must define your partition key by looking at your select queries.
Cassandra organizes data where partition keys are used for lookups, using the previous example
For the first case:
user_name ---> email:password email:data_of_birth
ABC --> abc#gmail.com:abc123 abc#gmail.com:22/02/1950 abc#yahoo.com:def123...
In the second case:
user_name,email ---> password data_of_birth ABC,abc#gmail.com --> abc123 22/02/1950
Making partition key more complex containing many data will make sure that you have many rows instead of a single row with many columns. It might be beneficial to balance the number of rows you might induce vs the number of columns each row might have. Having incredible large of small rows might not be too beneficial for reads
Partition keys indicate how data is distributed across nodes, so consider whether you have hotspots and decide whether you want to break it further.
Case 1:
All users named ABC will be in a single node
Case 2:
Users named ABC might or might not be in the single node, depending on the key that is generated along with their email.
Your partition key(s) should be how you want to store the data and how you will always look it up. You can only retrieve data by partition key, so it's important to choose something that you will naturally look up (this is why sometimes data is denormalized in Cassandra by storing it in multiple tables that mimic materialized views).
The clustering column key(s), if any, are mostly useful if you sometimes want to retrieve all the data in a partition and sometimes only want some of it. This is great for things like timeseries data because you can cluster the data on a timeuuid, store it sorted, and then do efficient range queries over the data.
Does this simple schema makes sense on Cassandra context? Or I can just use the unique constraint index instead of a manual indexing through partition key for username and email? I understood that to guarantees normal index efficiency on Cassandra the consult must includes the partition key, so if I want to execute a "get by" on a table with millions of rows without stipulating the partition key just the index column, it may not be as fast as it should be, so the manual index by creating new partition keys become a better choice. Is this notion correct? The only problem with manual indexing is that you'll need to do it manually, if you delete a row on "users" you will need to get the respective values for the respective indexed column before deleting to be able to delete the indexes together, and may also need to batch it. Did I mis-concept Cassandra?
CREATE TABLE users (
id uuid PRIMARY KEY,
username text,
email text,
password_hash text,
password_salt text,
display_name text,
timezone int,
created_at timestamp,
last_login_at timestamp
);
CREATE TABLE usernames (
username text PRIMARY KEY,
user_id uuid
);
CREATE TABLE user_emails (
email text PRIMARY KEY,
user_id uuid
);
Manual indexing could an overhead, that is you need to maintain indexes along with data, while doing CRUD operations.
So its recommended to use secondary indexing support of Cassandra.
If you want to query on username and email columns then you should create secondary indexes on that columns. Secondary indexes are Cassandra inbuilt indexing mechanism to index non key columns.