I am aware of fact that fields in frozen UDT column is not possible and entire records needs to update , in that case does it imply update on frozen UDT column is not possible and if there is scenario of field update of frozen UDT column , in that case one has to insert new record and delete older one ?
You are correct that you cannot update individual fields of a frozen UDT column but you can update the whole column value. You do not need to delete the previous record. It's fine to update the fields directly. Let me illustrate with an example I created on Astra.
Here is a user-defined type that stores a user's address:
CREATE TYPE address (
number int,
street text,
city text,
zip int
)
and here is the definition for the table of users:
CREATE TABLE users (
name text PRIMARY KEY,
address frozen<address>
)
In this table, there is one user with their address stored as:
cqlsh> SELECT * FROM users ;
name | address
-------+----------------------------------------------------------------
alice | {number: 100, street: 'Main Rd', city: 'Melbourne', zip: 3000}
Let's say that the street number is incorrect. If we try to update just the street number field with:
cqlsh> UPDATE users SET address = {number: 456} WHERE name = 'alice';
We'll end up with an address that only has the street number and nothing else:
cqlsh> SELECT * FROM users ;
name | address
-------+----------------------------------------------------
alice | {number: 456, street: null, city: null, zip: null}
This is because the whole value (not just the street number field) got overwritten by the update. The correct way to update the street number is to explicitly set a value for all the fields of the address with:
cqlsh> UPDATE users SET address = {number: 456, street: 'Main Rd', city: 'Melbourne', zip: 3000} WHERE name = 'alice';
so we end up with:
cqlsh> SELECT * FROM users ;
name | address
-------+----------------------------------------------------------------
alice | {number: 456, street: 'Main Rd', city: 'Melbourne', zip: 3000}
Cheers!
You can update column that is frozen UDT, but you'll need to insert all values for fields inside that UDT. So you can just do normal update of that column only
UPDATE table SET udt_col = new_value WHERE pk = ....
without need to delete something first, etc.
Basically, frozen value is just blob obtained by serializing UDT or collection, and stored as one cell inside row and having the single timestamp. That's different from the non-frozen value, where different pieces of UDT/collection could be stored in different places, and having different timestamps.
Related
My understanding of inserts and updates in Cassandra was that they were basically the same thing. That's is also what the documentation says ( https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlUpdate.html?hl=upsert )
Note: Unlike the INSERT command, the UPDATE command supports counters. Otherwise, the UPDATE and INSERT operations are identical.
So aside from support for counters they should be the same.
But then I ran across a problem where rows that where create via update would disappear if I set columns to null, whereas this doesn't happen if the rows are created with insert.
cqlsh:test> CREATE TABLE IF NOT EXISTS address_table (
... name text PRIMARY KEY,
... addresses text,
... );
cqlsh:test> insert into address_table (name, addresses) values ('Alice', 'applelane 1');
cqlsh:test> update address_table set addresses = 'broadway 2' where name = 'Bob' ;
cqlsh:test> select * from address_table;
name | addresses
-------+-------------
Bob | broadway 2
Alice | applelane 1
(2 rows)
cqlsh:test> update address_table set addresses = null where name = 'Alice' ;
cqlsh:test> update address_table set addresses = null where name = 'Bob' ;
cqlsh:test> select * from address_table;
name | addresses
-------+-----------
Alice | null
(1 rows)
The same thing happens if I skip the separate step of first creating a row. With insert I can create a row with a null value, but if I use update the row is nowhere to be found.
cqlsh:test> insert into address_table (name, addresses) values ('Caroline', null);
cqlsh:test> update address_table set addresses = null where name = 'Dexter' ;
cqlsh:test> select * from address_table;
name | addresses
----------+-----------
Caroline | null
Alice | null
(2 rows)
Can someone explain what's going on?
We're using Cassandra 3.11.3
This is expected behavior. See details in https://issues.apache.org/jira/browse/CASSANDRA-14478
INSERT adds a row marker, while UPDATE does not. What does this mean? Basically an UPDATE requests that individual cells of the row be added, but not that the row itself be added; So if one later deletes the same individual cells with DELETE, the entire row goes away. However, an "INSERT" not only adds the cells, it also requests that the row be added (this is implemented via a "row marker"). So if later all the row's individual cells are deleted, an empty row remains behind (i.e., the primary of the row which now has no content is still remembered in the table).
We are unable to delete the last row in a table with a static column.
We have tried with Cassandra 2.2, 3.0 and 3.11.2. With 1 or more as replication factor.
You can reproduce this by creating the following table:
CREATE TABLE playlists (
username text,
playlist_id bigint,
playlist_order bigint,
last_modified bigint static,
PRIMARY KEY ((username, playlist_id), playlist_order)
)
WITH CLUSTERING ORDER BY (playlist_order DESC);
Then insert some test data:
INSERT INTO
playlists (
username,
playlist_id,
playlist_order,
last_modified)
values (
'test',
123,
123,
123);
Then delete said row:
DELETE FROM playlists WHERE username = 'test' AND playlist_id = 123 AND playlist_order = 123;
Now do a select:
SELECT * FROM playlists WHERE username = 'test' AND playlist_id = 123;
Your result should look like this:
username | playlist_id | playlist_order | last_modified
----------+-------------+----------------+---------------
test | 123 | null | 123
As you can see the record is not deleted, only the clustering column is deleted. We suspect this has to do with the static column but am unable to explain it beyond that.
However if you omit the clustering key in the delete query, like so:
DELETE FROM playlists WHERE username = 'test' AND playlist_id = 123;
Then the record is deleted, but this requires unnecessary application logic to complete.
The behaviour only applies to the last record with the shared static column, you can populate the table with multiple records and delete those successfully, but the last one will always be dangling.
Static columns exist per partition so in your case the last_modified value 123 exists for all rows in the partition test:123.
Your DELETE statement will not delete the static column because you are specifying a specific row for deletion. The static column will remain even though there are no rows left in the partition.
To delete the static column you need to issue:
DELETE last_modified FROM playlists WHERE username = 'test' AND 'playlist_id' = 123;
This will remove the static column from the partition.
I Know cassandra doesn't support joins, so to use cassandra we need to denormalize tables. I would like to know how?
Suppose I have two tables
<dl>
<dt>Publisher</dt>
<dd>Id : <i>Primary Key</i></dd>
<dd>Name</dd>
<dd>TimeStamp</dd>
<dd>Address</dd>
<dd>PhoneNo</dd>
<dt>Book</dt>
<dd>Id : <i>Primary Key</i></dd>
<dd>Name</dd>
<dd>ISBN</dd>
<dd>Year</dd>
<dd>PublisherId : <i>Foreign Key - Referenes Publisher table's Id</i></dd>
<dd>Cost</dd>
</dt>
</dl>
Please let me know how can I denormalize these tables in order to achieve the following operations efficiently
1. Search for all Books published by a particular publisher.
2. Search for all Publishers who published books in a given year.
3. Search for all Publishers who has not published books in a given year.
4. Search for all Publishers who has not published books till now.
I saw few articles regarding cassandra. But not able to conclude the denormalize for above operations. Please help me.
Designing a whole schema is a rather big task for one question, but in general terms denormalization means you will repeat the same data in multiple tables so that you can read a single row to get all the data you need for each type of query.
So you would create a table for each type of query, something along these lines:
Create a table partitioned by publisher id and with book id as a clustering column.
Create a table partitioned by year and with publisher id as a clustering column.
Create a table with a list of all publishers. In an application you could then read this list and programmatically subtract the rows present in the desired year read from the table 2.
I'm not sure what "published till now" means. When you insert a new book, you could check if the publisher is present in table 3. If not, then it's a new publisher.
So within each row of the data, you would repeat all the data you wanted to get back with the query (i.e. the union of all the columns in your example tables). When you insert a new book, you would insert it into all of your tables.
This sounds like it could get huge, so I'll take the first one and walk through how I would approach it. You don't have to do it this way, it's just one way to go about it. Note that you may have to create query tables for each of your 4 scenarios above. This table will solve for the first scenario only.
First of all, I'll create a type for publisher address.
CREATE TYPE address (
street text,
city text,
state text,
postalCode text
);
Next I'll create a table called booksByPublisher. I'll use my address type for publisherAddress. And I'll build my PRIMARY KEY with publisherid as the partition key, clustering on bookYear and isbn.
As you want to be able to query all books by a particular publisher, it makes sense to designate that as the partition key. It may prove helpful to have your results sorted year, or at the very least be able to look at a specific year for a specific publisher, so I have bookYear as the first clustering key. And of course, to create a unique CQL row for each book within a publisher, I'll add isbn for uniqueness.
CREATE TABLE booksByPublisher (
publisherid UUID,
publisherName text,
publisherAddress frozen<address>,
publisherPhoneNo text,
bookName text,
isbn text,
bookYear bigint,
bookCost bigint,
bookAuthor text,
PRIMARY KEY (publisherid, bookYear, isbn)
);
INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (b7b99ee9-f495-444b-b849-6cea82683d0b,'Crown Publishing',{ street: '1745 Broadway', city: 'New York', state:'NY', postalcode: '10019'},'212-782-9000','Ready Player One','978-0307887443',2005,812,'Ernest Cline');
INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (b7b99ee9-f495-444b-b849-6cea82683d0b,'Crown Publishing',{ street: '1745 Broadway', city: 'New York', state:'NY', postalcode: '10019'},'212-782-9000','Armada','978-0804137256',2015,1560,'Ernest Cline');
INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (uuid(),'The Berkley Publishing Group',{ street: '375 Hudson Street', city: 'New York', state:'NY', postalcode: '10014'},'212-333-2354','Rainbox Six','978-0425170342',1999,867,'Tom Clancy');
Now I can query all books (out of my 3 rows) published by Crown Publishing (publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b) like this:
aploetz#cqlsh:stackoverflow2> SELECT * FROM booksbypublisher
WHERE publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b;
publisherid | bookyear | isbn | bookauthor | bookcost | bookname | publisheraddress | publishername | publisherphoneno
--------------------------------------+----------+----------------+--------------+----------+------------------+-------------------------------------------------------------------------------+------------------+------------------
b7b99ee9-f495-444b-b849-6cea82683d0b | 2005 | 978-0307887443 | Ernest Cline | 812 | Ready Player One | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing | 212-782-9000
b7b99ee9-f495-444b-b849-6cea82683d0b | 2015 | 978-0804137256 | Ernest Cline | 1560 | Armada | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing | 212-782-9000
(2 rows)
If I want, I can also query for all books by Crown Publishing during 2015:
aploetz#cqlsh:stackoverflow2> SELECT * FROM booksbypublisher
WHERE publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b AND bookyear=2015;
publisherid | bookyear | isbn | bookauthor | bookcost | bookname | publisheraddress | publishername | publisherphoneno
--------------------------------------+----------+----------------+--------------+----------+----------+-------------------------------------------------------------------------------+------------------+------------------
b7b99ee9-f495-444b-b849-6cea82683d0b | 2015 | 978-0804137256 | Ernest Cline | 1560 | Armada | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing | 212-782-9000
(1 rows)
But I cannot query by just bookyear:
aploetz#cqlsh:stackoverflow2> SELECT * FROM booksbypublisher WHERE bookyear=2015;
InvalidRequest: code=2200 [Invalid query] message="Cannot execute this query as it might
involve data filtering and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW FILTERING"
And don't listen to the error message and add ALLOW FILTERING. That might work fine for a table with 3 rows (or even 300). But it won't work for a table with 3 million rows (you'll get a timeout). Cassandra works best when you query by a complete partition key. As publisherid is our partition key, this query will perform just fine. But if you need to query by bookYear, then you should create a table which uses bookYear as its partitioning key.
In the below table, what is the maximum size phone_numbers column can accommodate ?
Like normal columns, is it 2GB ?
Is it 64K*64K as mentioned here
CREATE TABLE d2.employee (
id int PRIMARY KEY,
doj timestamp,
name text,
phone_numbers map<text, text>
)
Collection types in Cassandra are represented as a set of distinct cells in the internal data model: you will have a cell for each key of your phone_numbers column. Therefore they are not normal columns, but a set of columns. You can verify this by executing the following command in cassandra-cli (1001 stands for a valid employee id):
use d2;
get employee[1001];
The good answer is your point 2.
It could be kind of lame but in cassandra has the primary key to be unique?
For example in the following table:
CREATE TABLE users (
name text,
surname text,
age int,
adress text,
PRIMARY KEY(name, surname)
);
So if is it possible in my database to have 2 persons in my database with the same name and surname but different ages? Which means same primary key..
Yes the primary key has to be unique. Otherwise there would be no way to know which row to return when you query with a duplicate key.
In your case you can have 2 rows with the same name or with the same surname but not both.
By definition, the primary key has to be unique. But that doesn't mean you can't accomplish your goals. You just need to change your approach/terminology.
First of all, if you relax your goal of having the name+surname be a primary key, you can do the following:
CREATE TABLE users ( name text, surname text, age int, address text, PRIMARY KEY((name, surname),age) );
insert into users (name,surname,age,address) values ('name1','surname1',10,'address1');
insert into users (name,surname,age,address) values ('name1','surname1',30,'address2');
select * from users where name='name1' and surname='surname1';
name | surname | age | address
-------+----------+-----+----------
name1 | surname1 | 10 | address1
name1 | surname1 | 30 | address2
If, on the other hand, you wanted to ensure that the address is shared as well, then you probably just want to store a collection of ages in the user record. That could be achieved by:
CREATE TABLE users2 ( name text, surname text, age set<int>, address text, PRIMARY KEY(name, surname) );
insert into users2 (name,surname,age,address) values ('name1','surname1',{10,30},'address2');
select * from users2 where name='name1' and surname='surname1';
name | surname | address | age
-------+----------+----------+----------
name1 | surname1 | address2 | {10, 30}
So it comes back to what you actually need to accomplish. Hopefully the above examples give you some ideas.
The primary key is unique. With your data model, you can only have one age per (name, surname) combination.
Yes as mentioned in above comments you can have a composite key with name, surname, and age to achieve your goal but still, that won't solve the problem. Rather you can consider adding a new column userID and make that as the primary key. So even in case of name, surname and age duplicate, you don't have to revisit your data model.
CREATE TABLE users (
userId int,
name text,
surname text,
age int,
adress text,
PRIMARY KEY(userid)
);
I would state specifically that partition key should be unique.I could not get it in one place but from the following statements.
Cassandra needs all the partition key columns to be able to compute
the hash that will allow it to locate the nodes containing the
partition.
The partition key has a special use in Apache Cassandra beyond
showing the uniqueness of the record in the database..
Please note that there will not be any error if you insert same
partition key again and again as there is no constraint check.
Queries that you'll run equality searches on should be in a partition
key.
References
https://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
how Cassandra chooses the coordinator node and the replication nodes?
Insert query replaces rows having same data field in Cassandra clustering column