Prior to CQL3 one could insert arbitrary columns such as columns that are named by a date:
cqlsh:test>CREATE TABLE seen_ships (day text PRIMARY KEY)
WITH comparator=timestamp AND default_validation=text;
cqlsh:test>INSERT INTO seen_ships (day, '2013-02-02 00:08:22')
VALUES ('Tuesday', 'Sunrise');
Per this post It seems that things are different in CQL3. Is it still somehow possible to insert arbitrary columns? Here's my failed attempt:
cqlsh:test>CREATE TABLE seen_ships (
day text,
time_seen timestamp,
shipname text,
PRIMARY KEY (day, time_seen)
);
cqlsh:test>INSERT INTO seen_ships (day, 'foo') VALUES ('Tuesday', 'bar');
Here I get Bad Request: line 1:29 no viable alternative at input 'foo'
So I try a slightly different table because maybe this is a limitation of compound keys:
cqlsh:test>CREATE TABLE seen_ships ( day text PRIMARY KEY );
cqlsh:test>INSERT INTO seen_ships (day, 'foo') VALUES ('Tuesday', 'bar');
Again with the Bad Request: line 1:29 no viable alternative at input 'foo'
What am I missing here?
There's a good blog post over on the Datastax blog about this: http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
The answer is that yes, CQL3 supports dynamic colums, just not the way it worked in earlier versions of CQL. I don't really understand your example, you mix datestamps with strings in a way I don't see how it worked in CQL2 either. If I understand you correctly you want to make a timeline of ship sightings, where the partition key (row key) is the day and each sighting is a time/name pair. Here's a suggestion:
CREATE TABLE ship_sightings (
day TEXT,
time TIMESTAMP,
ship TEXT,
PRIMARY KEY (day, time)
)
And you insert entries with
INSERT INTO ship_sightings (day, time, ship) VALUES ('Tuesday', NOW(), 'Titanic')
however, you should probably use a TIMEUUID instead of TIMESTAMP (and the primary key could be a DATE), since otherwise you might add two sightings with the same timestamp and only one will survive.
This was an example of wide rows, but then there's the issue of dynamic columns, which isn't exactly the same thing. Here's an example of that in CQL3:
CREATE TABLE ship_sightings_with_properties (
day TEXT,
time TIMEUUID,
ship TEXT,
property TEXT,
value TEXT,
PRIMARY KEY (day, time, ship, property)
)
which you can insert into like this:
INSERT INTO ship_sightings_with_properties (day, time, ship, property, value)
VALUES ('Sunday', NOW(), 'Titanic', 'Color', 'Black')
# you need to repeat the INSERT INTO for each statement, multiple VALUES isn't
# supported, but I've not included them here to make this example shorter
VALUES ('Sunday', NOW(), 'Titanic', 'Captain', 'Edward John Smith')
VALUES ('Sunday', NOW(), 'Titanic', 'Status', 'Steaming on')
VALUES ('Monday', NOW(), 'Carapathia', 'Status', 'Saving the passengers off the Titanic')
The downside with this kind of dynamic columns is that the property names will be stored multiple times (so if you have a thousand sightings in a row and each has a property called "Captain", that string is saved a thousand times). On-disk compression takes away most of that overhead, and most of the time it's nothing to worry about.
Finally a note about collections in CQL3. They're a useful feature, but they are not a way to implement wide rows or dynamic columns. First of all they have a limit of 65536 items, but Cassandra can't enforce this limit, so if you add too many elements you might not be able to read them back later. Collections are mostly for small multi-values fields -- the canonical example is an address book where each row is an entry and where entries only have a single name, but multiple phone numbers, email addresses, etc.
It is not truly dynamic column, but most times you can get away with collections. Using Map column you might store some dynamic data
Related
I have requirement to keep the old values of a row in a history table for auditing whenever we do row update. Is there any solution available in Apache Cassandra to achieve this?
I looked at the Trigger and not much mentioned in the docs. Not sure of performance issues if we use the triggers. Also if we use trigger, will it give the old value for a column when we do update?
Cassandra is best tool to keep the row history. I will try to explain it with an example. Consider the below table design -
CREATE TABLE user_by_id (
userId text,
timestamp timestamp,
name text,
fullname text,
email text,
PRIMARY KEY (userId,timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
With this kind of table design you can keep the history of the record.
Here, userid is row partition key and timestamp as clustering key. Every insert for same user will be recorded as different row. for example -
insert into user_by_id (userId,timestamp ,name, fullname, email ) values ('1',<newTimeStamp>,'x',xyz,'x#xyz.com');
insert into user_by_id (userId,timestamp ,name, fullname, email ) values ('1',<newTimeStamp>,'y',xyz,'y#xyz.com');
insert into user_by_id (userId,timestamp ,name, fullname, email ) values ('1',<newTimeStamp>,'z',xyz,'z#xyz.com');
Above insert statements are actually updating values of the name and email column. But, this will be saved in three different rows because of timestamp as a clustering key, timestamp will be different for each row. If you want to get the latest value, just use LIMIT in your select query.
This design keeps the history of the row which can be used foe audit purpose.
Obviously when dealing with time-series data which relates to some natural partition key like sensor id it can be used as a primary key. But what to do if we are interested in a global view and there is no natural candidate for the partition key? If we model the schema like this:
CREATE TABLE my_data(
year smallint,
day smallint,
date timestamp,
value text
PRIMARY KEY ((year, day), timestamp)
) WITH CLUSTERING ORDER BY (date DESC);
It is (probably) going to work just fine for most cases but given we know what year and days to fetch.
What if we don't care what day is it but we expect to see first 50 most recent items? What if we then want to see next 50 items? Is there a way to do it in Cassandra? What is the recommended way of doing this?
Keep a 2nd table of the year/days. When reading can grab from it first. When adding to my_data update that as well but keep a cache of days inserted so each app would only try the insert once per day. ie for example adding extra key so can have multiple streams not just a single table per time series:
CREATE TABLE my_data (
key blob,
year smallint,
day smallint,
date timestamp,
value text
PRIMARY KEY ((key, year, day), timestamp)
) WITH CLUSTERING ORDER BY (date DESC);
CREATE TABLE my_data_keys (
key blob,
year smallint,
day smallint,
PRIMARY KEY ((key), year, day)
)
For inserts:
INSERT INTO my_data_keys (key, year, day) VALUES (0x01, 1, 2)
INSERT INTO my_data ...
Then keep a in memory Set somewhere that you stored that key/year/data so you dont need to insert it every time. To read most recent:
SELECT year, day FROM my_data_keys WHERE key = 0x01;
driver returns iterator, for each element in it make query to my_data until 50 records reached.
If inserts are frequent enough can just work backwards from "today", issuing queries until you get 50 events. If data sparse though that can be a lot of wasted reads and another table work better.
I'm on my research for storing logs to Cassandra.
The schema for logs would be something like this.
EDIT: I've changed the schema in order to make some clarification.
CREATE TABLE log_date (
userid bigint,
time timeuuid,
reason text,
item text,
price int,
count int,
PRIMARY KEY ((userid), time) - #1
PRIMARY KEY ((userid), time, reason, item, price, count) - #2
);
A new table will be created for the day everyday.
So a table contains logs for only one day.
My querying condition is as follows.
Query all logs from a specific user on a specific day(date not time).
So the reason, item, price, count will not be used as hints or conditions for queries at all.
My Question is which PRIMARY KEY design suits better.
EDIT: And the key here is that I want to store the logs in a schematic way.
If I choose #1 so many columns would be created per log. And the possibility of having more values per log is very high. The schema above is just an example. The log can contain values like subreason, friendid and so on.
If I choose #2 one (very) composite column will be created per log, and so far I couldn't find any valuable information about the overhead of the composite columns.
Which one should I choose? Please help.
My advise is that none of your two options seems to be ideal for your time-series, the fact the you're creating a table per-day, doesn't seem optimal either.
Instead I'd recommend to create a single Table and partition by userid and day and use a time uuids as the clustered column for the event, an example of this would look like:
CREATE TABLE log_per_day (
userid bigint,
date text,
time timeuuid,
value text,
PRIMARY KEY ((userid, date), time)
)
This will allow you to have all events in a day in a single row and allow you to do your query per day per user.
By declaring the time clustered column allows to have a wide row where you can insert as a many events as you need in a day.
So the row key is a composite key of the userid and plus date in text e.g.
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID1,'my value')
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID2,'my value2')
The two inserts above will be in the same row and therefore you will be able to read in a single query.
Also if you want more information about time series I highly recommend you to check Getting Started with Time Series Data Modeling
Hope it helps,
José Luis
We have data model of article with lot of properties. Here is our table model:
CREATE TABLE articles (
organization_id bigint,
gtin text,
barcodes text,
code text,
brand text,
season text,
name text,
option text,
style text,
color text,
sizes text,
supplier text,
category text,
prices text,
last_updated timeuuid,
content_hash uuid,
markdown boolean,
PRIMARY KEY (organization_id, gtin)
) WITH COMMENT='Articles';
Where gtin uniquely identifies article and we save all articles of organization in one row. We have constraint to update each article only if something has changed. This is important since if article is changed, we update last_updated field and external devices know which articles to synchronizes since they have information when they synchronized last time.
We added one more table for that:
CREATE TABLE articles_by_last_updated (
organization_id bigint,
gtin text,
barcodes text,
code text,
brand text,
season text,
name text,
option text,
style text,
color text,
sizes text,
supplier text,
category text,
prices text,
last_updated timeuuid,
content_hash uuid,
markdown boolean,
PRIMARY KEY (organization_id, last_updated)
) WITH CLUSTERING ORDER BY (last_updated ASC) AND COMMENT='Articles by last updated field';
So we can easily return all articles updated after certain point in time. This table must be cleared from duplicates per gtin since we import articles each day and sync is done from mobile devices so we want to keep dataset small (in theory we could save everything in that table, and overwrite with latest info but that created large datasets between syncs so we started deleting from that table, and to delete we needed to know last_updated from first table)
Problems we are facing right now are:
In order to check if article fields are updated we need to do read before write (we partially solved that with content_hash field which is hash over all fields so we read and compare hash of incoming article with value in DB)
We are deleting and inserting in second table since we need unique gtins there (need only latest change to send to devices, not duplicate articles) which produces awful lot of tombstones
We have feature to add to search by many different combinations of fields
Questions:
Is cassandra good choice for this kind of data or we should move it to some other storage (or even have elasticsearch and cassandra in combination where we can index changes after time and cassandra can hold only master data per gtin)?
Can data be modeled better for our use case to avoid read before write or deletes in second table?
Update
Just to clarify use case: other devices are syncing with pagination (sending last_sync_date and skip and count) so we need table with all article information, sorted by last_updated without duplicates and searchable by last_updated
If you are using Cassandra 2.1.1 and later, then you can use the "not equal" comparison in the IF part of the UPDATE statement (see CASSANDRA-6839 JIRA issue) to make sure you update data only if anything has changed. Your statement would look something like this:
UPDATE articles
SET
barcodes = <barcodes>,
... = <...>,
last_updated = <last_updated>
WHERE
organization_id = <organization_id>
AND gtin = <gtin>
IF content_hash != <content_hash>;
For your second table, you don't need to duplicate entire data from the first table as you can do the following:
create your table like this:
CREATE TABLE articles_by_last_updated (
organization_id bigint,
last_updated timeuuid,
gtin text,
PRIMARY KEY (organization_id, last_updated)
) WITH CLUSTERING ORDER BY (last_updated ASC) AND COMMENT='Articles by last updated field';
Once you've updated the first table, you can read the last_updated value for that gtin again and if it is equal or greater than the last_updated value you passed in, then you know that the update was successful (by your or another process), so you can now go ahead and insert that retrieved last_updated value into the second table. You don't need to delete the records for this update. I assume you can create distinct updated gtin list on the application side, if you do polling (using a range query) on a regular basis, which I assume pulls a reasonable amount of data. You can TTL these new records after a few poll cycles to remove a necessity to do manual deletes for example. Then, after you found the gtins affected, then you do a second query where you pull all of the data from the first table. You can then run a second sanity check on the updated dates to avoid sending anything that is supposed to be sent on the next update (if it is necessary of course).
HTH.
I was reading the following article about Cassandra:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.UzIcL-ddVRw
and it seemed to imply you can have varying column keys in cassandra for a given row key. Is that true? And if its true, how do you allow for varying row keys.
The reason I think this might be true is because say we have a user and it can like many items and we simply want the userId to be the rowkey. We let this rowKey (userID) map to all the items that specific user might like. Each specific user might like a different number of items. Therefore, if we could have multiple column keys, one for each itemID each user likes, then we could solve the problem that way.
Therefore, is it possible to have varying length of cassandra column keys for a specific rowKey? (and how do you do it)
Providing an example and/or some cql code would be awesome!
The thing that is confusing me is that I have seen some .cql files and they define keyspaces before hand and it seems pretty inflexible on how to make it dynamic, i.e. allow it to have additional columns as we please. For example:
CREATE TABLE IF NOT EXISTS results (
test blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
How can this even allow growing columns? Don't we need to specify the name before hand anyway?Or additional custom columns as the application desires?
Yes, you can have a varying number of columns per row_key. From a relational perspective, it's not obvious that tid is the name of a variable. It acts as a placeholder for the variable column key. Note in the inserts statements below, "tid", "result", and "data" are never mentioned in the statement.
CREATE TABLE IF NOT EXISTS results (
data blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
So in your example, you need to identify the row_key, column_key, and payload of the table.
The primary key contains both the row_key and column_key.
Test is your row_key.
tid is your column_key.
data is your payload.
The following inserts are all valid:
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_1');
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591712', 'blob_2');
#notice that the column_key changed but the row_key remained the same
INSERT your_keyspace.results('row_key_2', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_3');
See here
Did you thought of exploring collection support in cassandra for handling such relations in colocated way{e.g. on same data node}.
Not sure if it helps, but what about keeping user id as row key and a map containing item id as key and some value?
-Vivel