I Know cassandra doesn't support joins, so to use cassandra we need to denormalize tables. I would like to know how?
Suppose I have two tables
<dl>
<dt>Publisher</dt>
<dd>Id : <i>Primary Key</i></dd>
<dd>Name</dd>
<dd>TimeStamp</dd>
<dd>Address</dd>
<dd>PhoneNo</dd>
<dt>Book</dt>
<dd>Id : <i>Primary Key</i></dd>
<dd>Name</dd>
<dd>ISBN</dd>
<dd>Year</dd>
<dd>PublisherId : <i>Foreign Key - Referenes Publisher table's Id</i></dd>
<dd>Cost</dd>
</dt>
</dl>
Please let me know how can I denormalize these tables in order to achieve the following operations efficiently
1. Search for all Books published by a particular publisher.
2. Search for all Publishers who published books in a given year.
3. Search for all Publishers who has not published books in a given year.
4. Search for all Publishers who has not published books till now.
I saw few articles regarding cassandra. But not able to conclude the denormalize for above operations. Please help me.
Designing a whole schema is a rather big task for one question, but in general terms denormalization means you will repeat the same data in multiple tables so that you can read a single row to get all the data you need for each type of query.
So you would create a table for each type of query, something along these lines:
Create a table partitioned by publisher id and with book id as a clustering column.
Create a table partitioned by year and with publisher id as a clustering column.
Create a table with a list of all publishers. In an application you could then read this list and programmatically subtract the rows present in the desired year read from the table 2.
I'm not sure what "published till now" means. When you insert a new book, you could check if the publisher is present in table 3. If not, then it's a new publisher.
So within each row of the data, you would repeat all the data you wanted to get back with the query (i.e. the union of all the columns in your example tables). When you insert a new book, you would insert it into all of your tables.
This sounds like it could get huge, so I'll take the first one and walk through how I would approach it. You don't have to do it this way, it's just one way to go about it. Note that you may have to create query tables for each of your 4 scenarios above. This table will solve for the first scenario only.
First of all, I'll create a type for publisher address.
CREATE TYPE address (
street text,
city text,
state text,
postalCode text
);
Next I'll create a table called booksByPublisher. I'll use my address type for publisherAddress. And I'll build my PRIMARY KEY with publisherid as the partition key, clustering on bookYear and isbn.
As you want to be able to query all books by a particular publisher, it makes sense to designate that as the partition key. It may prove helpful to have your results sorted year, or at the very least be able to look at a specific year for a specific publisher, so I have bookYear as the first clustering key. And of course, to create a unique CQL row for each book within a publisher, I'll add isbn for uniqueness.
CREATE TABLE booksByPublisher (
publisherid UUID,
publisherName text,
publisherAddress frozen<address>,
publisherPhoneNo text,
bookName text,
isbn text,
bookYear bigint,
bookCost bigint,
bookAuthor text,
PRIMARY KEY (publisherid, bookYear, isbn)
);
INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (b7b99ee9-f495-444b-b849-6cea82683d0b,'Crown Publishing',{ street: '1745 Broadway', city: 'New York', state:'NY', postalcode: '10019'},'212-782-9000','Ready Player One','978-0307887443',2005,812,'Ernest Cline');
INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (b7b99ee9-f495-444b-b849-6cea82683d0b,'Crown Publishing',{ street: '1745 Broadway', city: 'New York', state:'NY', postalcode: '10019'},'212-782-9000','Armada','978-0804137256',2015,1560,'Ernest Cline');
INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (uuid(),'The Berkley Publishing Group',{ street: '375 Hudson Street', city: 'New York', state:'NY', postalcode: '10014'},'212-333-2354','Rainbox Six','978-0425170342',1999,867,'Tom Clancy');
Now I can query all books (out of my 3 rows) published by Crown Publishing (publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b) like this:
aploetz#cqlsh:stackoverflow2> SELECT * FROM booksbypublisher
WHERE publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b;
publisherid | bookyear | isbn | bookauthor | bookcost | bookname | publisheraddress | publishername | publisherphoneno
--------------------------------------+----------+----------------+--------------+----------+------------------+-------------------------------------------------------------------------------+------------------+------------------
b7b99ee9-f495-444b-b849-6cea82683d0b | 2005 | 978-0307887443 | Ernest Cline | 812 | Ready Player One | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing | 212-782-9000
b7b99ee9-f495-444b-b849-6cea82683d0b | 2015 | 978-0804137256 | Ernest Cline | 1560 | Armada | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing | 212-782-9000
(2 rows)
If I want, I can also query for all books by Crown Publishing during 2015:
aploetz#cqlsh:stackoverflow2> SELECT * FROM booksbypublisher
WHERE publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b AND bookyear=2015;
publisherid | bookyear | isbn | bookauthor | bookcost | bookname | publisheraddress | publishername | publisherphoneno
--------------------------------------+----------+----------------+--------------+----------+----------+-------------------------------------------------------------------------------+------------------+------------------
b7b99ee9-f495-444b-b849-6cea82683d0b | 2015 | 978-0804137256 | Ernest Cline | 1560 | Armada | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing | 212-782-9000
(1 rows)
But I cannot query by just bookyear:
aploetz#cqlsh:stackoverflow2> SELECT * FROM booksbypublisher WHERE bookyear=2015;
InvalidRequest: code=2200 [Invalid query] message="Cannot execute this query as it might
involve data filtering and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW FILTERING"
And don't listen to the error message and add ALLOW FILTERING. That might work fine for a table with 3 rows (or even 300). But it won't work for a table with 3 million rows (you'll get a timeout). Cassandra works best when you query by a complete partition key. As publisherid is our partition key, this query will perform just fine. But if you need to query by bookYear, then you should create a table which uses bookYear as its partitioning key.
Related
I need to do this query for cassndra:
select * from classes where students = null allow filtering;
students is a set
but looks like set do not allow = operator.
To test this out, I followed the DataStax docs on Indexing a Collection.
> CREATE TABLE cyclist_career_teams ( id UUID PRIMARY KEY, lastname text, teams set<text> );
> CREATE INDEX team_idx ON cyclist_career_teams ( teams );
With the table created and a secondary index on the teams set, I then inserted some test data:
> SELECT lastname,teams FROM cyclist_career_teams ;
lastname | teams
-----------------+---------------------------------------------------------------------------------------------------------
Vos | {'Neiderland bloeit', 'Rabobank Womens Team', 'Rabobonk-Liv Giant', 'Rabobonk-Liv Womens Cycling Team'}
Van Der Breggen | {'Rabobonk-Liv Womens Cycling Team', 'Sengers Ladies Cycling Team', 'Team Flexpoint'}
Brand | {'AA Drink - Leontien.nl', 'Rabobonk-Liv Giant', 'Rabobonk-Liv Womens Cycling Team'}
Armistead | null
Note that for Lizzie Armistead, I intentionally omitted a value for the teams column. While CQL does not allow the equals "=" relation on set types, it does allow CONTAINS. However, attempting to use that with null yields a different error:
> SELECT lastname,teams FROM cyclist_career_teams WHERE teams CONTAINS null;
[Invalid query] message="Unsupported null value for column teams"
The reason for this behavior, is related to how Cassandra has some special treatment for null values and the "null" keyword. Essentially, writing a null creates a tombstone, which is Cassandra's structure signifying a delete.
Even if Cassandra's treatment of null was not a factor, you'd still be faced with the problem that a value of "null" is not unique and your query would have to poll each node in the cluster. Such use cases are well-known anti-patterns. Unfortunately, Cassandra is just not good at querying data (or filtering on a key value) which does not exist.
One thing you could try, would be to use a string literal to indicate an empty value, like this:
> INSERT INTO cyclist_career_teams (id,lastname) VALUES (uuid(),'Armistead',{'empty'});
> SELECT lastname,teams FROM cyclist_career_teams WHERE teams CONTAINS 'empty';
lastname | teams
-----------+-----------
Armistead | {'empty'}
(1 rows)
To be honest though, because of the afore-mentioned anti-pattern, I can't recommend this approach in good faith. But with some added application logic at creation time, an "empty" string literal could work for you.
I have a table/columnfamily in Cassandra 3.7 with sensordata.
CREATE TABLE test.sensor_data (
house_id int,
sensor_id int,
time_bucket int,
sensor_time timestamp,
sensor_reading map<int, float>,
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)
Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.
cqlsh:test> select * from sensor_data;
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.
Regardless, this shouldn't be possible.
I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.
So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.
Thanks.
Update:
This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)
I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:
SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;
I WAS able to replicate it doing by inserting the rows via an integer:
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000);
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);
This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.
Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+--------------------------+----------------
1 | 2 | 3 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-01 23:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 01:01:00+0000 | null
edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...
Adding some observations on top of what Nick mentioned,
Cassandra Primary key = one or combination of {Partition key(s) + Clustering key(s)}
Keeping in mind the concepts of partition keys used within angular brackets which can be simple (one key) or composite (multiple keys) for unique identification and clustering keys to sort data, the below have been observed.
Query using select: sufficient to query using all the partition key(s) provided, additionally can query using clustering key(s) but in the same order in which they have been mentioned in primary key during table creation.
Update using set or update: the update statement needs to have search/condition clauses which not only include all the partition key(s) but also all the clustering key(s)
Answering the question - Is there anyway this could happen under known circumstances?
Yes, it is possible when same data is inserted from different sources.
To explain further, incase one tries to insert data from code (API etc) into Cassandra and then tries inserting the same data from DataStax Studio/any tool used to perform direct querying, a duplicate record is inserted.
Incase the same data is being pushed multiple times either from code alone or querying tool alone or from another source used to do the same operation multiple times, the data behaves idempotently and is not inserted again.
The possible explanation could be the way the underlying storage engine computes internal indexes or hashes to identify a row pertaining to set of columns (since column based).
Note:
The above information of duplicacy incase same data is pushed from different sources has been observed, tested and validated.
Language used: C#
Framework: .NET Core 3
"sensor_time" is part of the primary key. It is not in "Partition Key", but is "Clustering Column". this is why you get two "rows".
However, in the disk table, both "visual rows" are stored on single Cassandra row. In reality, they are just different columns and CQL just pretend they are two "visual rows".
Clarification - I did not worked with Cassandra for a while so I might not use correct terms. When i say "visual rows", I mean what CQL result shows.
Update
You can create following experiment (please ignore and fix any syntax errors I will do).
This suppose to do table with composite primary key:
"state" is "Partition Key" and
"city" is "Clustering Column".
create table cities(
state int,
city int,
name text,
primary key((state), city)
);
insert into cities(state, city, name)values(1, 1, 'New York');
insert into cities(state, city, name)values(1, 2, 'Corona');
select * from cities where state = 1;
this will return something like:
1, 1, New York
1, 2, Corona
But on the disk this will be stored on single row like this:
+-------+-----------------+-----------------+
| state | city = 1 | city = 2 |
| +-----------------+-----------------+
| | city | name | city | name |
+-------+------+----------+------+----------+
| 1 | 1 | New York | 2 | Corona |
+-------+------+----------+------+----------+
When you have such composite primary key you can select or delete on it, e.g.
select * from cities where state = 1;
delete from cities where state = 1;
In the question, primary key is defined as:
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
this means
"house_id", "sensor_id", "time_bucket" is "Partition Key" and
"sensor_time" is the "Clustering Column".
So when you select, the real row is spitted and show as if there are several rows.
Update
http://www.planetcassandra.org/blog/primary-keys-in-cql/
The PRIMARY KEY definition is made up of two parts: the Partition Key
and the Clustering Columns. The first part maps to the storage engine
row key, while the second is used to group columns in a row. In the
storage engine the columns are grouped by prefixing their name with
the value of the clustering columns. This is a standard design pattern
when using the Thrift API. But now CQL takes care of transposing the
clustering column values to and from the non key fields in the table.
Then read the explanations in "The Composite Enchilada".
We use cassandra wide rows heavily to store per user time-series as they are perfect for that use-case. Let's assume we have a table:
create table user_events (
user_id text,
timestmp timestamp,
event text,
primary key((user_id), timestmp));
What if clashes on timestamp may happen (same user can emit two different events with the same timestamp). What is the best way to tweak this schema to resolve that assuming we have an ordering for all events present (have a sequence int for each event).
If I modify schema the following way:
create table user_events (
user_id text,
timestmp timestamp,
seq int,
event text,
primary key((user_id), timestmp, seq));
I won’t be able to do WHERE user_id = ? ORDER BY timestmp ASC, seq ASC – cassandra does not allow that.
I won’t be able to do WHERE user_id = ? ORDER BY timestmp ASC, seq ASC – cassandra does not allow that.
You might be seeing an error because you are repeating ASC. This should work:
WHERE user_id = ? ORDER BY timestmp,seq ASC
Also, as long as you have defined your primary key as PRIMARY KEY((user_id),timestmp,seq)) you don't even need to specify ORDER BY x[,y] ASC. It will cluster the data on disk in that order, and thus return it to you already sorted in that order. ORDER BY should only be necessary when you want to put your results in descending order (or whatever the opposite of how you have it defined is).
What if clashes on timestamp may happen?
I think your extra seq column should be sufficient, depending on how you plan on inserting the data. If you are setting the timestmp from the client, then you should be ok. However, look what happens when I (using your second table) INSERT rows while creating the timestamp two different ways.
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('Mal',dateof(now()),1,'commanding');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('Wash',dateof(now()),1,'piloting');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),1,'freaking out');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),3,'being weird');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River',dateof(now()),2,'killing reavers');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',1,'freaking out');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',3,'being weird');
INSERT INTO user_events(user_id,timestmp,seq,event) VALUES ('River','2015-01-13 13:14-0600',2,'killing reavers');
Querying that data by a user_id of "River" yields:
aploetz#cqlsh:stackoverflow> SELECT * FROM user_events WHERE user_id='River';
user_id | timestmp | seq | event
---------+--------------------------+-----+-----------------
River | 2015-01-13 13:14:00-0600 | 1 | freaking out
River | 2015-01-13 13:14:00-0600 | 2 | killing reavers
River | 2015-01-13 13:14:00-0600 | 3 | being weird
River | 2015-01-14 12:58:41-0600 | 1 | freaking out
River | 2015-01-14 12:58:57-0600 | 3 | being weird
River | 2015-01-14 12:58:57-0600 | 2 | killing reavers
(6 rows)
Notice that using the now() function to generate a timeuuid, and then converting that to a timestamp with dateof() causes the two rows with the timestmp "2015-01-14 12:58:57-0600" to appear to be the same. But they are not the same, as you can tell by the seq column.
So just a bit of caution on using/generating timestamps. They might look the same, but they may not be stored as the same value. Just to be on the safe side, I would use a timeuuid instead.
I have a table that has a column of list type (tags):
CREATE TABLE "Videos" (
video_id UUID,
title VARCHAR,
tags LIST<VARCHAR>,
PRIMARY KEY (video_id, upload_timestamp)
) WITH CLUSTERING ORDER BY (upload_timestamp DESC);
I have plenty of rows containing various values in the tags column, ie. ["outdoor","funny cats","funny mice"].
I want to perform a SELECT query that will return all rows that contain "funny cats" in the tags column. How can I do that?
To directly answer your question, yes there is a way to accomplish this. As of Cassandra 2.1 you can create a secondary index on a collection. First, I'll re-create your column family definition (while adding a definition for upload_timestamp timeuuid) and put some values in it.
aploetz#cqlsh:stackoverflow> SELECT * FROM videos ;
video_id | upload_timestamp | tags | title
--------------------------------------+--------------------------------------+-----------------------------------------------+---------------------------
2977b806-df76-4dd7-a57e-11d361e72ce1 | fc011080-64f9-11e4-a819-21b264d4c94d | ['sci-fi', 'action', 'adventure'] | Star Wars
ab696e1f-78c0-45e6-893f-430e88db7f46 | 8db7c4b0-64fa-11e4-a819-21b264d4c94d | ['documentary'] | The Witches of Whitewater
15e6bc0d-6195-4d8b-ad25-771966c780c8 | 1680d120-64fa-11e4-a819-21b264d4c94d | ['dark comedy', 'action', 'language warning'] | Pulp Fiction
(3 rows)
Next, I'll create a secondary index on the tags column:
aploetz#cqlsh:stackoverflow> CREATE INDEX ON videos (tags);
Now, if I want to query the videos that contain the tag "action," I can accomplish this with the CONTAINS keyword:
aploetz#cqlsh:stackoverflow> SELECT * FROM videos WHERE tags CONTAINS 'action';
video_id | upload_timestamp | tags | title
--------------------------------------+--------------------------------------+-----------------------------------------------+--------------
2977b806-df76-4dd7-a57e-11d361e72ce1 | fc011080-64f9-11e4-a819-21b264d4c94d | ['sci-fi', 'action', 'adventure'] | Star Wars
15e6bc0d-6195-4d8b-ad25-771966c780c8 | 1680d120-64fa-11e4-a819-21b264d4c94d | ['dark comedy', 'action', 'language warning'] | Pulp Fiction
(2 rows)
With this all being said, I should pass along a couple of warnings:
Secondary indexes do not perform well at scale. They exist to provide convenience, not performance. If you are expecting to have to query by tag often, then the right way to solve this would be to create a videosbytag query table, with the same data but keyed like this: PRIMARY KEY (tag,video_id)
You don't need the double-quotes in your table name. In fact, having it in quotes may cause you problems (ok, maybe minor irritations) down the road.
It could be kind of lame but in cassandra has the primary key to be unique?
For example in the following table:
CREATE TABLE users (
name text,
surname text,
age int,
adress text,
PRIMARY KEY(name, surname)
);
So if is it possible in my database to have 2 persons in my database with the same name and surname but different ages? Which means same primary key..
Yes the primary key has to be unique. Otherwise there would be no way to know which row to return when you query with a duplicate key.
In your case you can have 2 rows with the same name or with the same surname but not both.
By definition, the primary key has to be unique. But that doesn't mean you can't accomplish your goals. You just need to change your approach/terminology.
First of all, if you relax your goal of having the name+surname be a primary key, you can do the following:
CREATE TABLE users ( name text, surname text, age int, address text, PRIMARY KEY((name, surname),age) );
insert into users (name,surname,age,address) values ('name1','surname1',10,'address1');
insert into users (name,surname,age,address) values ('name1','surname1',30,'address2');
select * from users where name='name1' and surname='surname1';
name | surname | age | address
-------+----------+-----+----------
name1 | surname1 | 10 | address1
name1 | surname1 | 30 | address2
If, on the other hand, you wanted to ensure that the address is shared as well, then you probably just want to store a collection of ages in the user record. That could be achieved by:
CREATE TABLE users2 ( name text, surname text, age set<int>, address text, PRIMARY KEY(name, surname) );
insert into users2 (name,surname,age,address) values ('name1','surname1',{10,30},'address2');
select * from users2 where name='name1' and surname='surname1';
name | surname | address | age
-------+----------+----------+----------
name1 | surname1 | address2 | {10, 30}
So it comes back to what you actually need to accomplish. Hopefully the above examples give you some ideas.
The primary key is unique. With your data model, you can only have one age per (name, surname) combination.
Yes as mentioned in above comments you can have a composite key with name, surname, and age to achieve your goal but still, that won't solve the problem. Rather you can consider adding a new column userID and make that as the primary key. So even in case of name, surname and age duplicate, you don't have to revisit your data model.
CREATE TABLE users (
userId int,
name text,
surname text,
age int,
adress text,
PRIMARY KEY(userid)
);
I would state specifically that partition key should be unique.I could not get it in one place but from the following statements.
Cassandra needs all the partition key columns to be able to compute
the hash that will allow it to locate the nodes containing the
partition.
The partition key has a special use in Apache Cassandra beyond
showing the uniqueness of the record in the database..
Please note that there will not be any error if you insert same
partition key again and again as there is no constraint check.
Queries that you'll run equality searches on should be in a partition
key.
References
https://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
how Cassandra chooses the coordinator node and the replication nodes?
Insert query replaces rows having same data field in Cassandra clustering column