Is it legit to store CQL tuples with null components in Cassandra 3.x - cassandra

I have to store a protocol buffer structure in Cassandra 3.x. It is defined in a .proto file as:
message Attribute
{
required string key = 1;
oneof value {
int64 integerValue = 2;
float floatValue = 3;
string stringValue = 4;
}
}
To store multiple Attributes I was thinking about this CQL definition.
CREATE TABLE ... attributes: map<text, tuple<int, float, text> ...
and in each tuple 2 of 3 components would actually be null. I haven't tested this syntax yet but are there any downsides using this approach? Maybe there is a better way, i.e. User Defined Types?

Let's try this out. I'll start with a simple table, containing a valuemap column of type map<text,tuple<int,float,text> as you have above:
CREATE TABLE tupleTest (
key text,
value text,
valuemap map<text, FROZEN<tuple<int,float,text>>>,
PRIMARY KEY (key));
I'll INSERT some data:
INSERT INTO tupletest (key,value,valuemap) VALUES ('1','A',{'a':(0,0.0,'hi')});
INSERT INTO tupletest (key,value,valuemap) VALUES ('2','B',{'b':(0,null,'hi')});
INSERT INTO tupletest (key,value,valuemap) VALUES ('3','C',{'c':(null,null,'hi')});
And then I'll SELECT it, just to see:
aploetz#cqlsh:stackoverflow> SELECT * FROM tupletest ;
key | value | valuemap
-----+-------+---------------------------
3 | C | {'c': (None, None, 'hi')}
2 | B | {'b': (0, None, 'hi')}
1 | A | {'a': (0, 0, 'hi')}
(3 rows)
The main apprehension about explicitly INSERTing NULL values into Cassandra, is that in "normal" columns they actually create tombstones. But since we are not setting an entire column to NULL, merely an element in a tuple (nested inside a map), this is not the case. In fact, they are showing as None. And when I view the underlying SSTables, I also do not see evidence that a tombstone has been written.
Normally, I'd say that explicitly INSERTing a NULL into Cassandra is a terrible, terrible idea. But in this case, it shouldn't cause you any issues. Now, as to whether or not this is considered to be "legit" or a good practice...well, my data modeling senses do not approve. I would find another way to represent the absence of a value in a tuple type, as someone (the developer who follows you) could see this and interpret that as being "ok" to explicitly INSERT NULLs into other column values.

Related

SELECT rows with primary key of multiple columns

How do I select all relevant records according to the provided list of pairs?
table:
CREATE TABLE "users_groups" (
"user_id" INTEGER NOT NULL,
"group_id" BIGINT NOT NULL,
PRIMARY KEY (user_id, group_id),
"permissions" VARCHAR(255)
);
For example, if I have the following JavaScript array of pairs that I should get from DB
[
{user_id: 1, group_id: 19},
{user_id: 1, group_id: 11},
{user_id: 5, group_id: 19}
]
Here we see that the same user_id can be in multiple groups.
I can pass with for-loop over every array element and create the following query:
SELECT * FROM users_groups
WHERE (user_id = 1 AND group_id = 19)
OR (user_id = 1 AND group_id = 11)
OR (user_id = 5 AND group_id = 19);
But is this the best solution? Let say if the array is very long. As I know query length may get ~1GB.
what is the best and quick solution to do this?
Bill Karwin's answer will work for Postgres just as well.
However, I have made the experience that joining against a VALUES clause is very often faster than a large IN list (with hundreds if not thousands of elements):
select ug.*
from user_groups ug
join (
values (1,19), (1,11), (5,19), ...
) as l(uid, guid) on l.uid = ug.user_id and l.guid = ug.group_id;
This assumes that there are no duplicates in the values provided, otherwise the JOIN would result in duplicated rows, which the IN solution would not do.
You tagged both mysql and postgresql, so I don't know which SQL database you're really using.
MySQL at least supports tuple comparisons:
SELECT * FROM users_groups WHERE (user_id, group_id) IN ((1,19), (1,11), (5,19), ...)
This kind of predicate can be optimized in MySQL 5.7 and later. See https://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#row-constructor-range-optimization
I don't know whether PostgreSQL supports this type of predicate, or if it optimizes it.

How to convert a cassandra column value to a corresponding enum while querying

We have a cassandra column of type int, this value of int corresponds to value of a enum, is there a way to convert this int column to enum string while querying.
It's not possible to make in the CQL itself, but Java driver supports corresponding functionality by using the EnumOrdinalCodec class (example from documentation):
enum State {INIT, RUNNING, STOPPING, STOPPED}
cluster.getConfiguration().getCodecRegistry()
.register(new EnumOrdinalCodec<State>(State.class));
// schema: create table ordinal_example(id int PRIMARY KEY, state int)
session.execute("insert into ordinal_example (id, state) values (1, ?)", State.INIT);

Updating to empty set

I just created a new column for my table
alter table user add (questions set<timeuuid>);
Now the table looks like
user (
google_id text PRIMARY KEY,
date_of_birth timestamp,
display_name text,
joined timestamp,
last_seen timestamp,
points int,
questions set<timeuuid>
)
Then I tried to update all those null values to empty sets, by doing
update user set questions = {} where google_id = ?;
for each google id.
However they are still null.
How can I fill that column with empty sets?
A set, list, or map needs to have at least one element because an
empty set, list, or map is stored as a null set.
source
Also, this might be helpful if you're using a client (java for instance).
I've learnt that there's not really such a thing as an empty set, or list, etc.
These display as null in cqlsh.
However, you can still add elements to them, e.g.
> select * from id_set;
set_id | set_content
-----------------------+---------------------------------
104649882895086167215 | null
105781005288147046623 | null
> update id_set set set_content = set_content + {'apple','orange'} where set_id = '105781005288147046623';
set_id | set_content
-----------------------+---------------------------------
104649882895086167215 | null
105781005288147046623 | { 'apple', 'orange' }
So even though it displays as null you can think of it as already containing the empty set.

Inserting a value on a frozen set in cassandra 3

I am currently working on a Cassandra 3 database in which one of its tables has a column that is defined like this:
column_name map<int, frozen <set<int>>>
When I have to change the value of a complete set given a map key x I just have to do this:
UPDATE keyspace.table SET column_name[x] = {1,2,3,4,5} WHERE ...
The thing is that I need to insert a value on a set given a key. I tried with this:
UPDATE keyspace.table SET column_name[x] = column_name[x] + {1} WHERE ...
But it returns:
SyntaxException: line 1:41 no viable alternative at input '[' (... SET column_name[x] = [column_name][...)
What am I doing wrong? Does anyone know how to insert data the way I need?
Since the value of map is frozen, you can't use update like this.
A frozen value serializes multiple components into a single value. Non-frozen types allow updates to individual fields. Cassandra treats the value of a frozen type as a blob. The entire value must be overwritten.
You have to read the full map get the value of the key append new item and then reinsert

Cassandra create table won't keep column order

I am creating a column family in Cassandra and I expect the column order to match the one I am specifying in the create clause.
This
CREATE TABLE cf.mycf (
timestamp timestamp,
id text,
score int,
type text,
publisher_id text,
embed_url text,
PRIMARY KEY (timestamp, id, score)
) WITH bloom_filter_fp_chance = 0.01
AND comment = ''
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE'
AND caching = {
'keys' : 'ALL',
'rows_per_partition' : 'NONE'
}
AND compression = {
'chunk_length_kb' : 64,
'crc_check_chance' : 1.0,
'sstable_compression' : 'LZ4Compressor'
}
AND compaction = {
'base_time_seconds' : 60,
'class' : 'DateTieredCompactionStrategy',
'enabled' : true,
'max_sstable_age_days' : 365,
'max_threshold' : 32,
'min_threshold' : 4,
'timestamp_resolution' : 'MICROSECONDS',
'tombstone_compaction_interval' : 86400,
'tombstone_threshold' : 0.2,
'unchecked_tombstone_compaction' : false
};
Should create a table like :
timestamp ,id ,score , type, id ,embed_url
Instead I am getting this:
timestamp timestamp,
id text,
score int,
embed_url text,
publisher_id text,
type text,
I've created quite a few tables in the same way and this never happened so any help would be appreciated.
I put the id and score as keys to show that these keep their respective position. while the actual scheme I am looking for is only the timestamp to be the primary key.
Looks like there is no such thing as fields order in cassandra.
The others columns are displayed in alphabetical order by Cassandra.
http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html
You should make a clear distinction on how you want the data to be presented and how it is effectively presented to you. Moreover, you should not rely on the ordinal position of the fields but only on their names.
In order to be efficient, and against your will (you specified an order to the columns when you modeled your schema), Cassandra needs to store the columns in a particular order, and for simplicity this reflects on how it (the CQL interface or the driver) will give back your data.
I suggest you to have a deep insight on how Cassandra stores data (column names included!) in Understanding How CQL3 Maps to Cassandra’s Internal Data Structure.
By the way, if you absolutely need to keep your order at application level (and are too lazy to specify all the fields in the SELECT instead of using SELECT *), you need to create an abstraction interface on your own, something like creating an ordered "field names" array (your order):
String myorder[] = { "timestamp", "id", "score", "type", "publisher_id", "embed_url"};
and then use this as a map in loops using ordinal values.
Keep in mind that the rendering of the CQL string in DESCRIBE in cqlsh is just a function call in the python driver iterating over the metadata. It has nothing to do with how C* stores or sends its results.
If it matters you can set the order. When you Insert you can define the order explicitly
INSERT INTO keyspace_name.table_name
( identifier, column_name, whatever, order)
VALUES ( value, value ... )
When you do a select you can define the order explicitly.
SELECT identifier, whatever, order, column_name FROM keyspace_name.table_name

Resources