I'm new in Cassandra and I'm trying to learn a bit more of how this DB engine works (specially the CQL part ) and compare it with Mysql.
With this in mind I was trying some query's, but there is one particular query that I can't figure out.
From what I could read it seams that it's not possible to do this query in Cassandra, but I would like to know for sure if there is somework around that.
Imagine the following table [Customer] with PRIMARY_KEY = id:
id, name, city, country, email
01, Jhon, NY, USA, jhon#
02, Mary, DC, USA, mary#
03, Smith, L, UK, smith#
.....
I want to get a listing that shows me how many customers I have per country and ORDER BY DESC.
In mySQL it would be something like
SELECT COUNT(Id), country
FROM customer
GROUP BY country
ORDER BY COUNT(Id) DESC
But in Cassandra (CQL) it seems that I can't do GROUP BY of columns that aren't PRIMARY_KEY (like the case of "country" ), is there anyway arround this ???
You need to define a secondary index on "country". Secondary indexes are used to query a table using a column that is not normally query table.
For ORDER BY you define clustering keys on 'id'.Clustering keys are responsible for sorting data within a partition.
The main thing to remember when building a table in Cassandra, is to model its PRIMARY KEY based on how you plan to query it. In any case, defining id as the PRIMARY KEY isn't very helpful for what you're trying to do.
Also, keywords like GROUP BY and ORDER BY have special requirements. ORDER BY specifically is pretty useless (IMO), unless you plan to reverse the sort direction. But you cannot pick an arbitrary column to sort your data by.
To solve for your query above, I'll create a new table, keyed on the country, city, and id columns (in that order):
CREATE TABLE customer_by_city (
id TEXT,
name TEXT,
city TEXT,
country TEXT,
email TEXT,
PRIMARY KEY (country,city,id)
) WITH CLUSTERING ORDER BY (city ASC, id DESC);
Now, I'll INSERT the rows:
INSERT INTO customer_by_city (id,name,city,country,email)
VALUES ('01', 'Jhon', 'NY', 'USA', 'jhon#gmail.com');
INSERT INTO customer_by_city (id,name,city,country,email)
VALUES ('02', 'Mary', 'DC', 'USA', 'mary#gmail.com');
INSERT INTO customer_by_city (id,name,city,country,email)
VALUES ('03', 'Smith', 'London', 'UK', 'smith#gmail.com');
SELECT COUNT(Id), country FROM customer_by_city GROUP BY country ;
system.count(id) | country
------------------+---------
2 | USA
1 | UK
(2 rows)
Warnings :
Aggregation query used without partition key
Notes:
That last message means you're running a query without a WHERE clause keyed by the partition key. That means Cassandra is going to have to check every node in the cluster to serve this query. Highly inefficient.
While it works for this example, country as a partition key may not be the best way to distribute data. After all, if most of the customers are in one particular country, then they could potentially push the bounds of the maximum partition size.
Related
I have the following table structure:
CREATE TABLE test_keyspace.persons (
id uuid,
country text,
city text,
address text,
phone_number text,
PRIMARY KEY (id, country, address)
);
My main scenario is to get person by id. But sometimes I want to get all cities inside country and all persons inside city as well.
I know that Cassandra must have at least one partition key and zero or more clustering keys, but I don't understand how to organize it to work most effectively (and generally work).
Can anybody give me advice?
So it sounds like you want to be able to query by both id and country. Typically in Cassandra, the way to build your data models is a "one table == one query" approach. In that case, you would have two tables, just keyed differently:
CREATE TABLE test_keyspace.persons_by_id (
id uuid,
country text,
city text,
address text,
phone_number text,
PRIMARY KEY (id));
TBH, you don't really to cluster on country and address, unless a person can have multiple addresses. But a single PK is a completely legit approach.
For the second table:
CREATE TABLE test_keyspace.persons_by_country (
id uuid,
country text,
city text,
address text,
phone_number text,
PRIMARY KEY (country,city,id));
This will allow you to query by country, with persons grouped/sorted by city and sorted by id. In theory, you could also serve the query by id approach here, as long as you also had the country and city. But that might not be possible in your scenario.
Duplicating data in Cassandra (NoSQL) to help queries perform better is ok. The trick becomes keeping the tables in-sync, but you can use the BATCH functionality to apply writes to both tables atomically.
In case you haven't already, you might benefit from DataStax's (free) course on data modeling - Data Modeling with Apache Cassandra and DataStax Enterprise.
Creating the following employee column family in Cassandra
Case 1:
CREATE TABLE employee (
name text,
designation text,
gender text,
created_by text,
created_date timestamp,
modified_by text,
modified_date timestamp,
PRIMARY KEY (name)
);
From UI, if i wanted to get all employee, it is not possible to
retrieve. is it true?
select * from employee; //not possible as it is partitioned by name
Case 2:
I was told to do this way to retrieve all employees.
We need to design this with a static key, to retrieve all the employees.
CREATE TABLE employee (
static_name text,
name text,
designation text,
gender text,
created_by text,
created_date timestamp,
modified_by text,
modified_date timestamp,
PRIMARY KEY (static_name,name)
);
static_name i.e.) "EMPLOYEE" will be the partition key and name will the clustering key. Primary key, combination of both static_name and name
static_name -> every time you add the employee , insert with the static value i.e) EMPLOYEE
now, you will be able to do "select all employees query"
//this will return you all the employees
select * from employee where static_name='EMPLOYEE';
is this true? can't we use case 1 to return all the employees?
Both approaches are o.k. with some catches
Approach 1:
When you say UI I guess you mean to use simple select * ... it's correct that this won't really work out of the box if you want to get every single one of them out. Especially if the data set is big. You could use pagination on a driver (I'm not 100% sure since I hadn't had a case in a while to use it) but when I needed to jump over all the partition I would use the token function i.e.:
select token(name), name from employee limit 1;
system.token(name) | name
----------------------+------
-8839064797231613815 | a
now you use the result of the token and put it into next query. This would have to be done by your program. After it would fetch all the elements that are greater than ... you would also need to start for all lower than the -8839064797231613815.
select token(name), name from employee where token(name) > -8839064797231613815 limit 1;
system.token(name) | name
----------------------+------
-8198557465434950441 | c
and then I would wrap this into a loop until I would fetch all the elements out. (I think this is also how spark cassandra does it when retrieving wide rows out from a cluster).
Disadvantage of this model is that it's really bad because it has to go all over the cluster and is more or less to be used in analytical work loads. Since you mentioned UI, It would take the user too long to get the result, so I would advise not to use approach 1 in UI related stuff.
Approach 2.
Disadvantage of the second one is that it would be what is called a hot row. Meaning every update would go to a single partition and this is most of the time bad model.
The advantage is that you could easily paginate over the one partition and get your data out by pagination functions built into the driver.
This would how ever behave just fine if you have moderate load (tens or low hundreds updates per second) and relatively low number of users, let's say for 100 000 this would work just fine. If your numbers are greater you have to somehow split up the users into multiple partitions so that the "load" gets distributed more evenly.
One possibility is to include letter of alphabet into "EMPLOYEE" ... so you would have "EMPLOYE_A", "EMPLOYEE_B" etc ... this would work relatively well. Not ideal again because of the lexicographical distribution and some partitions may get relatively larger amounts of that which is also not ideal.
One approach would be to create some artificial columns, let's say by design you say there are 10 buckets and when you insert into "EMPLOYEE" partition you just add (random bucket to the static prefix) "EMPLOYEE_1" and so on ... but when retrieving you go over specific partition until you exhaust the result.
What is the best approach to update table with duplicate data?
I have a table
table users (
id text PRIMARY KEY,
email text,
description,
salary
)
I will delete, update, insert etc to this table. But I also have a requirement to be able to search by email, and description. If I create new table with new composite keys for email, and description,
when I update my base table I do
insert into users (id, salary) values (1, 500);
I do not have the required data to also update my secondary table since all the client has is id and salary. How is the second table updated.
Other workarounds and shortcomings
I could have created a materialized view, but since the base table has only one primary key I can only add one more column. my search requirement requires more than one column.
Create secondary indexes on the columns that will be searched on. But the performance for this would be bad since the columns I will be searching on would have high cardinality. i.e. description, email, etc
So, the "correct" way of doing this is to create 3 tables. salary_by_id, salary_by_email and salary_by_description.
table salary_by_id (
id text PRIMARY KEY,
salary int
)
table salary_by_email (
email text PRIMARY KEY,
salary int
)
table salary_by_description (
description text,
id int,
salary int,
primary key (description, id)
)
The reason i added id to salary_by_description is that, from my own guessing, description won't be globally uniq, so it has to have something else in it's primary key.
Depending on the size of these tables the last one might need something extra added to it's partitioning key. And if needed you can add id, email and description to the other tables.
Now, when inserting or deleting values you need so do it in all 3 tables. If you use a driver, like in java, that supports asynchronous calls, then this doesn't cost very much extra.
Why might one want to use a clustered index in a cassandra table?
For example; in a table like this:
CREATE TABLE blah (
key text,
a text,
b timestamp,
c double,
PRIMARY KEY ((key), a, b, c)
)
The clustered part is the a, b, c part of the PRIMARY KEY.
What are the benefits? What considerations are there?
Clustering keys do three main things.
1) They affect the available query pattern of your table.
2) They determine the on-disk sort order of your table.
3) They determine the uniqueness of your primary key.
Let's say that I run an ordering system and want to store product data on my website. Additionally I have several distribution centers, as well as customer contracted pricing. So when a certain customer is on my site, they can only access products that are:
Available in a distribution center (DC) in their geographic area.
Defined in their contract (so they may not necessarily have access to all products in a DC).
To keep track of those products, I'll create a table that looks like this:
CREATE TABLE customerDCProducts (
customerid text,
dcid text,
productid text,
productname text,
productPrice int,
PRIMARY KEY (customerid, dcid, productid));
For this example, if I want to see product 123, in DC 1138, for customer B-26354, I can use this query:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354' AND dcid='1138' AND productid='123';
Maybe I want to see products available in DC 1138 for customer B-26354:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354' AND dcid='1138';
And maybe I just want to see all products in all DCs for customer B-26354:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354';
As you can see, the clustering keys of dcid and productid allow me to run high-performing queries on my partition key (customerid) that are as focused as I may need.
The drawback? If I want to query all products for a single DC, regardless of customer, I cannot. I'll need to build a different query table to support that. Even if I want to query just one product, I can't unless I also provide a customerid and dcid.
What if I want my data ordered a certain way? For this example, I'll take a cue from Patrick McFadin's article on Getting Started With Time Series Data Modeling, and build a table to keep track of the latest temperatures for weather stations.
CREATE TABLE latestTemperatures (
weatherstationid text,
eventtime timestamp,
temperature text,
PRIMARY KEY (weatherstationid,eventtime),
) WITH CLUSTERING ORDER BY (eventtime DESC);
By clustering on eventtime, and specifying a DESCending ORDER BY, I can query the recorded temperatures for a particular station like this:
SELECT * FROM latestTemperatures
WHERE weatherstationid='1234ABCD';
When those values are returned, they will be in DESCending order by eventtime.
Of course, the one question that everyone (with a RDBMS background...so yes, everyone) wants to know, is how to query all results ordered by eventtime? And again, you cannot. Of course, you can query for all rows by omitting the WHERE clause, but that won't return your data sorted in any meaningful order. It's important to remember that Cassandra can only enforce clustering order within a partition key. If you don't specify one, your data will not be ordered (at least, not in the way that you want it to be).
Let me know if you have any additional questions, and I'll be happy to explain.
I have two issues while querying Cassandra:
Query 1
> select * from a where author='Amresh' order by tweet_id DESC;
Order by with 2ndary indexes is not supported
What I learned: secondary indexes are made to be used only with a WHERE clause and not ORDER BY? If so, then how can I sort?
Query 2
> select * from a where user_id='xamry' ORDER BY tweet_device DESC;
Order by currently only supports the ordering of columns following their
declared order in the PRIMARY KEY.
What I learned: The ORDER BY column should be in the 2nd place in the primary key, maybe? If so, then what if I need to sort by multiple columns?
Table:
CREATE TABLE a(
user_id varchar,
tweet_id varchar,
tweet_device varchar,
author varchar,
body varchar,
PRIMARY KEY(user_id,tweet_id,tweet_device)
);
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('xamry', 't1', 'web', 'Amresh', 'Here is my first tweet');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('xamry', 't2', 'sms', 'Saurabh', 'Howz life Xamry');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('mevivs', 't1', 'iPad', 'Kuldeep', 'You der?');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('mevivs', 't2', 'mobile', 'Vivek', 'Yep, I suppose');
Create index user_index on a(author);
To answer your questions, let's focus on your choice of primary key for this table:
PRIMARY KEY(user_id,tweet_id,tweet_device)
As written, the user_id will be used as the partition key, which distributes your data around the cluster but also keeps all of the data for the same user ID on the same node. Within a single partition, unique rows are identified by the pair (tweet_id, tweet_device) and those rows will be automatically ordered by tweet_id because it is the second column listed in the primary key. (Or put another way, the first column in the PK that is not a part of the partition key determines the sort order of the partition.)
Query 1
The WHERE clause is author='Amresh'. Note that this clause does not involve any of the columns listed in the primary key; instead, it is filtering using a secondary index on author. Since the WHERE clause does not specify an exact value for the partition key column (user_id) using the index involves scanning all cluster nodes for possible matches. Results cannot be sorted when they come from more than one replica (node) because that would require holding the entire result set on the coordinator node before it could return any results to the client. The coordinator can't know what is the real "first" result row until it has confirmed that it has received and sorted every possible matching row.
If you need the information for a specific author name, separate from user ID, and sorted by tweet ID, then consider storing the data again in a different table. The data design philosophy with Cassandra is to store the data in the format you need when reading it and to actually denormalize (store redundant information) as necessary. This is because in Cassandra, writes are cheap (though it places the burden of managing multiple copies of the same logical data on the application developer).
Query 2
Here, the WHERE clause is user_id = 'xamry' which happens to be the partition key for this table. The good news is that this will go directly to the replica(s) holding this partition and not bother asking the other nodes. However, you cannot ORDER BY tweet_device because of what I explained at the top of this answer. Cassandra stores rows (within a single partition) sorted by a single column, generally the second column in the primary key. In your case, you can access data for user_id = 'xamry' ORDER BY tweet_id but not ordered by tweet_device. The answer, if you really need the data sorted by device, is the same as for Query 1: store it in a table where that is the second column in the primary key.
If, when looking up the tweets by user_id you only ever need them sorted by device, simply flip the order of the last two columns in your primary key. If you need to be able to sort either way, store the data twice in two different tables.
The Cassandra storage engine does not offer multi-column sorting other than the order of columns listed in your primary key.