What happens when adding a field in UDT in Cassandra?

What happens when adding a field in UDT in Cassandra? - cassandra

For example, suppose I have a basic_info type:
CREATE TYPE basic_info (first_name text, last_name text, nationality text)
And table like this:
CREATE TABLE student_stats (id int PRIMARY KEY, grade text, basics FROZEN<basic_info>)
And I have millions of record in the table.
If I add a field in the basic_info like this:
ALTER TYPE basic_info ADD address text;
I want to ask what happens in Cassandra when you add a new field in UDT type (it's currently a column in a table)? The reason for this question is I afraid that some side effects will happen when the table contains a lot of data (millions of record). It's best if you can explain things that will happen from the start to the end.

fields of UDT are described in table system_schema.types. When you add a new field, the entry for that type is updated inside Cassandra, but no changes in data on disk will happen (SSTables are immutable). Instead, when Cassandra read data, it checks if field is present or not, and if not (because it wasn't set, or it's a new field of UDT), then it will return null for that value, but not modify data on disk.
For example, if I have following type and table that uses it:
CREATE TYPE test.udt (
id int,
t1 int
);
CREATE TABLE test.u2 (
id int PRIMARY KEY,
u udt
)
And I have some data in the table, so I get:
cqlsh> select * from test.u2; id | u ----+---------------- 5 | {id: 1, t1: 3}
If I add a field to UDT with alter type test.udt add t2 int;, I immediately see the null as a value for a new UDT field:
cqlsh> select * from test.u2;
id | u
----+--------------------------
5 | {id: 1, t1: 3, t2: null}
And if I do sstabledump on the SSTable, I can see that it contains only old data:
[
{
"partition" : {
"key" : [ "5" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 46,
"liveness_info" : { "tstamp" : "2019-07-28T09:33:12.019Z" },
"cells" : [
{ "name" : "u", "path" : [ "id" ], "value" : 1 },
{ "name" : "u", "path" : [ "t1" ], "value" : 3 }
]
}
]
}
]
See also my answer about adding/removing columns

Related

Save array of objects in cassandra

How can I save array of objects in cassandra?
I'm using a nodeJS application and using cassandra-driver to connect to Cassandra DB. I wanted to save records like below in my db:
{
"id" : "5f1811029c82a61da4a44c05",
"logs" : [
{
"conversationId" : "e9b55229-f20c-4453-9c18-a1f4442eb667",
"source" : "source1",
"destination" : "destination1",
"url" : "https://asdasdas.com",
"data" : "data1"
},
{
"conversationId" : "e9b55229-f20c-4453-9c18-a1f4442eb667",
"source" : "source2",
"destination" : "destination2",
"url" : "https://afdvfbwadvsffd.com",
"data" : "data2"
}
],
"conversationId" : "e9b55229-f20c-4453-9c18-a1f4442eb667"
}
In the above record, I can use type "text" to save values of the columns "id" and "conversationId". But not sure how can I define the schema and save data for the field "logs".

With Cassandra, you'll want to store the data in the same way that you want to query it. As you mentioned querying by conversatonid, that's going to influence how the PRIMARY KEY definition should look. Given this, conversationid, should make a good partition key. As for the clustering columns, I had to make some guesses as to cardinality. So, sourceid looked like it could be used to uniquely identify a log entry within a conversation, so I went with that next.
I thought about using id as the final clustering column, but it looks like all entries with the same conversationid would also have the same id. It might be a good idea to give each entry its own unique identifier, to help ensure uniqueness:
{
"uniqueid": "e53723ca-2ab5-441f-b360-c60eacc2c854",
"conversationId" : "e9b55229-f20c-4453-9c18-a1f4442eb667",
"source" : "source1",
"destination" : "destination1",
"url" : "https://asdasdas.com",
"data" : "data1"
},
This makes the final table definition look like this:
CREATE TABLE conversationlogs (
id TEXT,
conversationid TEXT,
uniqueid UUID,
source TEXT,
destination TEXT,
url TEXT,
data TEXT,
PRIMARY KEY (conversationid,sourceid,uniqueid));

You have a few options depending on how you want to query this data.
The first is to stringify the json in logs field and save that to the database and then convert it back to JSON after querying the data.
The second option is similar to the first, but instead of stringifying the array, you store the data as a list in the database.
The third option is to define a new table for the logs with a primary key of the conversation and clustering keys for each element of the logs. This will allow you to lookup either by the full key or query by just the primary key and retrieve all the rows that match those criteria.
CREATE TABLE conversationlogs (
conversationid uuid,
logid timeuuid,
...
PRIMARY KEY ((conversationid), logid));

Cassandra: If a field inside an UDT is set to null, does this create a tombstone in Cassandra?

Please look at the following example:
Insert
INSERT INTO my_keyspace.my_table (id, name, my_info) VALUES (
3464546,
'Sumit',
{ birthday : '1990-01-01', height : '6.2 feet', weight : '74 kg' }
);
Second Insert
INSERT INTO my_keyspace.my_table (id, name, my_info) VALUES (
3464546,
'Sumit',
{ birthday : '1990-01-01', height : '6.2 feet', weight : null }
);
Consider "id" as the Primary Key.
In the second insert "weight" attribute inside "my_info" UDT is set as null. Does this create a tombstone? How null inside an UDT is stored in the Cassandra database?

Yes Setting a column to NULL is the same as writing a tombstone in some cases.

Too many columns in Cassandra

I have 20 columns in a table in Cassandra. Will there be a performance impact in performing
select * from table where partitionKey = 'test';
I am not able to understand from this link,
https://wiki.apache.org/cassandra/CassandraLimitations
1) What will be the consequence of having too many columns (say 20) in the Cassandra tables?

Unless you have a lot of rows on the partition, I don't see an impact with having 20 columns. As stated in the documentation that you linked:
The maximum number of cells (rows x columns) in a single partition is 2 billion.
So, unless you are expecting to have more than 100 million rows in a single partition, I don't see why 20 columns would be an issue. Keep in mind that Cassandra is a column family store. This designation means that Cassandra can store a large number of columns per partition.
Having said that, I would personally recommend not to go over 100 MB per partition. It might bring you problems in the future with streaming during repairs.
===============================
To answer to your comment. Keep in mind that partitions and rows are 2 different things in Cassandra. A partition is only equal to a row if there's no clustering columns. For instance, take a look at this table creation and the values we insert, and then look at the sstabledump:
create TABLE tt2 ( foo int , bar int , mar int , PRIMARY KEY (foo , bar )) ;
insert INTO tt2 (foo , bar , mar ) VALUES ( 1, 2, 3) ;
insert INTO tt2 (foo , bar , mar ) VALUES ( 1, 3, 4) ;
sstabledump:
./cassandra/tools/bin/sstabledump ~/cassandra/data/data/tk/tt2-1386f69005bd11e89c0bbfb5c1157523/mc-1-big-Data.db
[
{
"partition" : {
"key" : [ "1" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 32,
"clustering" : [ "2" ],
"liveness_info" : { "tstamp" : "2018-01-30T12:57:36.362483Z" },
"cells" : [
{ "name" : "mar", "value" : 3 }
]
},
{
"type" : "row",
"position" : 32,
"clustering" : [ "3" ],
"liveness_info" : { "tstamp" : "2018-01-30T12:58:03.538482Z" },
"cells" : [
{ "name" : "mar", "value" : 4 }
]
}
]
}
]
Also, if you use the -d option, it might make it easier for you to see the internal representation. As you can see, for the same partition, we have 2 distinct rows:
./cassandra/tools/bin/sstabledump -d ~/cassandra/data/data/tk/tt2-1386f69005bd11e89c0bbfb5c1157523/mc-1-big-Data.db
[1]#0 Row[info=[ts=1517317056362483] ]: 2 | [mar=3 ts=1517317056362483]
[1]#32 Row[info=[ts=1517317083538482] ]: 3 | [mar=4 ts=1517317083538482]

I'm a bit of a noob with MongoDB, so would appreciate some help with figuring out the best solution/format/structure in storing some data.
Basically, the data that will be stored will be updated every second with a name, value and timestamp for a certain meter reading.
For example, one possibility is water level and temperature in a tank. The tank will have a name and then the level and temperature will be read and stored every second. Overall, there will be 100's of items (i.e. tanks), each with millions of timestamped values.
From what I've learnt so far (and please correct me if I'm wrong), there are a few options as how to structure the data:
A slightly RDMS approach:
This would consist of two collections, Items and Values
Items : {
_id : "id",
name : "name"
}
Values : {
_id : "id",
item_id : "item_id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The more document db denormalized method:
This method involves one collection of items each with an array of timestamped values
Items : {
_id : "id",
name : "name"
values : [{
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}]
}
A collection for each item
Save all the values in a collection named after that item.
ItemName : {
_id : "id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The majority of read queries will be to retrieve the timestamped values for a specified time period of an item (i.e. tank) and display in a graph. And for this, the first option makes more sense to me as I don't want to retrieve the millions of values when querying for a specific item.
Is it even possible to query for values between specific timestamps for option 2?
I will also need to query for a list of items, so maybe a combination of the first and third option with a collection for all the items and then a number of collections to store the values for each of those items?
Any feedback on this is greatly appreciated.

Don't use timestamp if you are not modifying the ObjectId.
As ObjectId itself has time stamp in it.
So you will be saving a lot of memory by it.
MongoDB Id Documentation
In case if you dont require the previous data then you can use update query in MongoDB to update the fields every second instead of storing.
If you want to store the updated data each time then instead of updating store it in flat structure.
{ "_id" : ObjectId("XXXXXX"),
"name" : "ItemName",
"value" : "ValueOfItem"
"created_at" : "timestamp"
}
Edit 1: Added timestamp as per the comments

Cassandra Hector Client: Is a RangeSlicesQuery on Composite Row Keys possible when using Random Partitioning?

Is there any way to range query rows with a composite row key when using random partitioning?
Im workling with column families created via CQL v3 like this:
CREATE TABLE products ( rowkey CompositeType(UTF8Type,UTF8Type,UTF8Type,UTF8Type)
PRIMARY KEY, prod_id varchar, class_id varchar, date varchar);
The data in the table looks like this:
RowKey: 6:3:2:19
=> (column=class_id, value=254, timestamp=1346800102625002)
=> (column=date, value=2034, timestamp=1346800102625000)
=> (column=prod_id, value=1922, timestamp=1346800102625001)
-------------------
RowKey: 0:14:1:16
=> (column=class_id, value=144, timestamp=1346797896819002)
=> (column=date, value=234, timestamp=1346797896819000)
=> (column=prod_id, value=4322, timestamp=1346797896819001)
-------------------
I’m trying to find a way to range query over these composite row keys analog to how we slice query over composite columns. Following approach sometimes actually succeeds in returning something useful depending on the start and stop key I choose.
Composite startKey = new Composite();
startKey.addComponent(0, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(1, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(2, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(3, "3", Composite.ComponentEquality.EQUAL);
Composite stopKey = new Composite();
stopKey.addComponent(0, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(1, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(2, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(3, "6" , Composite.ComponentEquality.GREATER_THAN_EQUAL);
RangeSlicesQuery<Composite, String, String> rangeSlicesQuery =
HFactory.createRangeSlicesQuery(keyspace, CompositeSerializer.get(), StringSerializer.get(), StringSerializer.get());
rangeSlicesQuery.setColumnFamily(columnFamilyName);
rangeSlicesQuery.setKeys(startKey,stopKey);
rangeSlicesQuery.setRange("", "", false, 3);
Most of the time the database returns this:
InvalidRequestException(why:start key's md5 sorts after end key's md5.
this is not allowed; you probably should not specify end key at all,
under RandomPartitioner)
Does somebody have an idea if something like this can be achieved WITHOUT using the order preserving partitioner? Do I have to build a custom row key index for this use case?
Thanks a lot!
Additional information:
What I’m trying to do is storing sales transaction data in a table which uses both composite row keys to encode date/time/place and composite columns to store information about the sold items:
The set of items per transaction varies in size and includes information about size, color and quantity of every item:
{ ... items :
[ { item_id : 43523 , size : 050 , color : 123 , qty : 1 } ,
{ item_id : 64233 , size : 048 , color : 834 , qty : 1 } ,
{ item_id : 23984 , size : 000 , color : 341 , qty : 3 } ,
… ] }
There’s also information about where and when the transaction happened including a unique transaction id:
{ trx_id : 23324827346, store_id : 8934 , date : 20110303 , time : 0947 , …
My initial approach was putting every item in a separate row and let the application group items back together by transaction id. That’s working fine. But now I’m trying to leverage the structuring capabilities of composite columns to persist the nested item data within a representation (per item) like this:
item_id:’size’ = <value> ; item_id:’color’ = <value> ; item_id:’qty’ = <value> ; …
43523:size = 050 ; 43523:color = 123 ; 43523:qty = 1 ; …
The rest of the data would be encoded in a composite row key like this:
date : time : store_id : trx_id
20110303 : 0947 : 001 : 23324827346
I need to be able to run queries like: All items which were sold between the dates 20110301 and 20110310 between times 1200 and 1400 in stores 25 - 50. What I achieved so far with composite columns was using one wide row per store and putting all the rest of the data into 3 different composite columns per item:
date:time:<type>:prod_id:transaction_id = <value> ; …
20110303:0947:size:43523:23324827346 = 050 ;
20110303:0947:color:43523:23324827346 = 123 ;
20110303:0947:qty:43523:23324827346 = 1 ;
It’s working, but it doesn’t really look highly efficient.
Is there any other alternative?

You're creating one row per partition, so it should be clear that RandomPartitioner will not give you ordered range queries.
You can do ordered ranges within a partition, which is very common, e.g. http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/ and http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string