I'm confused as to how primary keys in Cassandra allow for quick data access. Say for example I create a table of Students with the following schema columns:
I choose the primary key to be Student Id. My understanding is that all the students will be placed around the cluster based on some hash of this value. Say I also choose the Country as a Clustering Column. So Within each partition of the students (who have been split based on their Id) they will be ordered by Country (presumably alphabetically).
So if I then want to retrieve all students for a specific country will I have to visit multiple nodes in the cluster? While the students have been ordered by Country within each node there is nothing to say that all the students for a specific country have been stored on the same node? Is this type of query even supported?
If I had only added 5 students to a 5 nodes cluster would it be possible that all the students would be stored on separate nodes if the Student Id was a UUID?
So if I then want to retrieve all students for a specific country will I have to visit multiple nodes in the cluster?
Yes.
While the students have been ordered by Country within each node there is nothing to say that all the students for a specific country have been stored on the same node?
Correct.
Is this type of query even supported?
It is but that's considered an anti-pattern in Cassandra. What happens is that the coordinator (the node that receives the request from the client) will have to query ALL other nodes since it will have to scan all rows for that column family.
If I had only added 5 students to a 5 nodes cluster would it be possible that all the students would be stored on separate nodes if the Student Id was a UUID?
Yes.
The way your problem can be solved is by having a column family for each query (one for selecting by Student ID and the other for selecting by Country, each one having a different primary query) while duplicating the rows (when you create a student you have to insert it in both column families).
Related
We have a table designed to retrieve products by name in ascending order. OrganisationID and ProductType will be the compound partition key, whereas the ProductName will be the clustering key. So, the primary key structure is ((organisation_id, product_type), product_name) with clustering order by(product_name asc). All have text as a datatype.
We have 20-30 other attributes relevant to the product stored in other different columns. Out of which some 5 attributes are significant. For instance, those attributes can be description, colour, city, size and date_of_manufacturing. All the above attributes are of text datatype except for date_of_manufacturing which is a timestamp.
Let's say a user wants to filter this product based on all these 5 attributes. Can this be done using cassandra? Though we know that this can be achieved using elastic search on top of cassandra, our constraint is to use cassandra alone and achieve this. Storing same data across many tables is allowed.
Note:At any instant, only 20 products can be listed in the page, which means after applying all filters, we must display only 20 products.
Consider the problem of storing users and their contacts. There are about a 100 million users, each has a few hundred contacts and on an average contacts are 1kb in size. There may be some users with too many contacts (>5000) and there may be some contacts that are much (say 10x) bigger than the average of 1kb. Users actively add contacts and less often also delete them. Contacts are not pointers to other users but just a bundle of information.
There are two kinds of queries -
Given a user and a contact name, lookup the contact details
Given a user, look up all associated contact names
I was thinking of a contacts table like this -
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_details map<text, text>,
PRIMARY KEY ( (user_name, contact_name) )
// ^ Notice the composite primary key
}
The choice of composite primary key is due to the number and size of contacts per user. I wanted one contact per row.
This table easily addresses the query of looking up a contact's details given a user and a contact name.
I'm looking for suggestions to address the second query.
Two options (with related concerns) on my mind -
Create a second table called contact_names_by_user, with user_name as the partition key and contact_name as a clustering key. Concern: If there a user with way too many contacts (say 20k), would that result in a non-optimally wide row?
Create an index on user_name. Concern: However given the ratio of total number of users (100M) to average contacts per user (say 200), would that value be considered to have high-cardinality, hence bad for indexing?
In general, are there guideline around looking up many items (like contacts here) referred by one item (like user here) without running in wide rows or non-optimal indexes?
Creating index itself should not be a problem IMHO. Average cardinality of 200 sounds good.
Other option is you maintaining your own index like:
CREATE TABLE contacts_by_user (
user_name text PRIMARY KEY,
contacts set
)
though your index and contacts can go out of sync.
I'm so confused.
When to use them and how to determine which one to use?
If a column is index/primary key/row key, could it be duplicated?
I want to create a column family to store some many-to-many info, for example, one column is the given name and the other is surname. One given name can related to many surnames, and one surname could have different given names.
I need to query surnames by a given name, and the given names by a specified surname too.
How to create the table?
Thanks!
Cassandra is a NoSQL database, and as such has no such concept of many-to-many relationships. Ideally a table should not have anything other than a primary key. In your case the right way to model it in Cassandra is to create two tables, one with name as the primary key and the other with surname as the primary key
When you need to query by either key, you need to query the table that has that key as the primary key
EDIT:
From the Cassandra docs:
Cassandra's built-in indexes are best on a table having many rows that
contain the indexed value. The more unique values that exist in a
particular column, the more overhead you will have, on average, to
query and maintain the index. For example, suppose you had a races
table with a billion entries for cyclists in hundreds of races and
wanted to look up rank by the cyclist. Many cyclists' ranks will share
the same column value for race year. The race_year column is a good
candidate for an index.
Do not use an index in these situations:
On high-cardinality columns for a query of a huge volume of records for a small number of results.
In tables that use a counter column On a frequently updated or deleted column.
To look for a row in a large partition unless narrowly queried.
I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.
According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.
I have a column family like
object
(
obect_id,
company-id,
group_id,
family_id,
description,
..
);
I want to query that based on object id, company id ,group id and any combination of these.
My question is
should i make composite primary key
(object id, company id ,group id)
or create seperate column familis ?
only object id is unique in CF, company id can repeat in multiple rows, but group iddoes not repeat in many rows
You may well want to duplicate your data in multiple CFs depending on your query patterns. This is quite common practice.
If a common query is "Get all objects by company_id" then you might want to store all objects with in a CF with partitioned just by company_id as a row key. If you need to do individual object lookups as well, then you store that data duplicated in another CF - each object partitioned by object_id. If groups are always a subset of a specific company, perhaps you want to row key by company, but then cluster by group.
You should be designing your Cassandra schema based on the queries you need to run, rather than the data that needs to go in it.