Get timestamp of a collection item in CQL / Cassandra - cassandra

I know I can use writetime() to get the internal timestamp for a column, but is it possible to get the timestamp for a particular item in a collection (like a list)? My understanding is that collection items are internally stored as individual columns, so it seems like they should have individual timestamps.

The individual columns within a collection do indeed have individual timestamps. You can see this by examining the SSTable directly using the sstable2json function.
CREATE TABLE users (
user_id text,
emails set<text>,
first_name text,
last_name text,
PRIMARY KEY ((user_id))
)
INSERT INTO users (user_id, first_name, last_name, emails)
VALUES('frodo', 'Frodo', 'Baggins', {'f#baggins.com', 'baggins#gmail.com'});
UPDATE users
SET emails = emails + {'fb#friendsofmordor.org'} WHERE user_id = 'frodo';
Then the SSTable looks like this:
[
{"key": "66726f646f","columns": [["","",1444170199819000],
["emails","emails:!",1444170199818999,"t",1444170199],
["emails:62616767696e7340676d61696c2e636f6d","",1444170199819000],
["emails:664062616767696e732e636f6d","",1444170199819000],
["emails:666240667269656e64736f666d6f72646f722e6f7267","",1444170213268000],
["first_name","Frodo",1444170199819000],
["last_name","Baggins",1444170199819000]]}
]
You can see the 3 entries in emails corresponding to timestamps 1444170199819000, 1444170199819000 and 1444170213268000.
However it seems like its not possible to return these through CQL. You won't be able to return individual columns from the set, but it would be reasonable to return the timestamps along with the values for all of the entries in the set, however I can't find any documentation on how to do this so it doesn't look like its supported.

Collection elements have their own timestamps, but it is impossible to get those timestamps using CQL. See corresponding Jira ticket to support this functionality: CASSANDRA-8877.

Related

Insert table Mutation to different Cassandra table

I have requirement to keep the old values of a row in a history table for auditing whenever we do row update. Is there any solution available in Apache Cassandra to achieve this?
I looked at the Trigger and not much mentioned in the docs. Not sure of performance issues if we use the triggers. Also if we use trigger, will it give the old value for a column when we do update?
Cassandra is best tool to keep the row history. I will try to explain it with an example. Consider the below table design -
CREATE TABLE user_by_id (
userId text,
timestamp timestamp,
name text,
fullname text,
email text,
PRIMARY KEY (userId,timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
With this kind of table design you can keep the history of the record.
Here, userid is row partition key and timestamp as clustering key. Every insert for same user will be recorded as different row. for example -
insert into user_by_id (userId,timestamp ,name, fullname, email ) values ('1',<newTimeStamp>,'x',xyz,'x#xyz.com');
insert into user_by_id (userId,timestamp ,name, fullname, email ) values ('1',<newTimeStamp>,'y',xyz,'y#xyz.com');
insert into user_by_id (userId,timestamp ,name, fullname, email ) values ('1',<newTimeStamp>,'z',xyz,'z#xyz.com');
Above insert statements are actually updating values of the name and email column. But, this will be saved in three different rows because of timestamp as a clustering key, timestamp will be different for each row. If you want to get the latest value, just use LIMIT in your select query.
This design keeps the history of the row which can be used foe audit purpose.

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

Cassandra how can I simulate a join statement

I am new to cassandra and am coming from Postgres. I was wondering if there is a way that I can get data from 2 different tables or column family and then return the results. I have this query
select p.fullname,p.picture s.post, s.id, s.comments, s.state, s.city FROM profiles as p INNER JOIN Chats as s ON(p.id==s.profile_id) WHERE s.latitudes>=28 AND 29>= s.latitudes AND s.longitudes
">=-21 AND -23>= s.longitudes
The query has 2 tables: Profiles and Chat and they both share a common field Chats.id==Proifles.profile_id it boils down to this basically return all rows where Chat ID is equal to Profiles id. I would like to keep it that way because now updating profiles are simple and would only need to update 1 row per profile update instead of de-normalizing everything and updating thousands of records. Any help or suggestions would be great
You have to design tables in way you won't need joins. Best practice is if your table matches exactly the use case it is used for.
Cassadra has a feature called shared static columns; this allows you to bind values with partition part of primary key. Thus, you can create "joined" version of table without duplicates.
CREATE TABLE t (
p_id uuid,
p_fullname text STATIC,
p_picture text STATIC,
s_id uuid,
s_post text,
s_comments text,
s_state text,
s_city text,
PRIMARY KEY (p_id, s_id)
);

Primary Key related CQL3 Queries cases & errors when sorting

I have two issues while querying Cassandra:
Query 1
> select * from a where author='Amresh' order by tweet_id DESC;
Order by with 2ndary indexes is not supported
What I learned: secondary indexes are made to be used only with a WHERE clause and not ORDER BY? If so, then how can I sort?
Query 2
> select * from a where user_id='xamry' ORDER BY tweet_device DESC;
Order by currently only supports the ordering of columns following their
declared order in the PRIMARY KEY.
What I learned: The ORDER BY column should be in the 2nd place in the primary key, maybe? If so, then what if I need to sort by multiple columns?
Table:
CREATE TABLE a(
user_id varchar,
tweet_id varchar,
tweet_device varchar,
author varchar,
body varchar,
PRIMARY KEY(user_id,tweet_id,tweet_device)
);
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('xamry', 't1', 'web', 'Amresh', 'Here is my first tweet');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('xamry', 't2', 'sms', 'Saurabh', 'Howz life Xamry');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('mevivs', 't1', 'iPad', 'Kuldeep', 'You der?');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('mevivs', 't2', 'mobile', 'Vivek', 'Yep, I suppose');
Create index user_index on a(author);
To answer your questions, let's focus on your choice of primary key for this table:
PRIMARY KEY(user_id,tweet_id,tweet_device)
As written, the user_id will be used as the partition key, which distributes your data around the cluster but also keeps all of the data for the same user ID on the same node. Within a single partition, unique rows are identified by the pair (tweet_id, tweet_device) and those rows will be automatically ordered by tweet_id because it is the second column listed in the primary key. (Or put another way, the first column in the PK that is not a part of the partition key determines the sort order of the partition.)
Query 1
The WHERE clause is author='Amresh'. Note that this clause does not involve any of the columns listed in the primary key; instead, it is filtering using a secondary index on author. Since the WHERE clause does not specify an exact value for the partition key column (user_id) using the index involves scanning all cluster nodes for possible matches. Results cannot be sorted when they come from more than one replica (node) because that would require holding the entire result set on the coordinator node before it could return any results to the client. The coordinator can't know what is the real "first" result row until it has confirmed that it has received and sorted every possible matching row.
If you need the information for a specific author name, separate from user ID, and sorted by tweet ID, then consider storing the data again in a different table. The data design philosophy with Cassandra is to store the data in the format you need when reading it and to actually denormalize (store redundant information) as necessary. This is because in Cassandra, writes are cheap (though it places the burden of managing multiple copies of the same logical data on the application developer).
Query 2
Here, the WHERE clause is user_id = 'xamry' which happens to be the partition key for this table. The good news is that this will go directly to the replica(s) holding this partition and not bother asking the other nodes. However, you cannot ORDER BY tweet_device because of what I explained at the top of this answer. Cassandra stores rows (within a single partition) sorted by a single column, generally the second column in the primary key. In your case, you can access data for user_id = 'xamry' ORDER BY tweet_id but not ordered by tweet_device. The answer, if you really need the data sorted by device, is the same as for Query 1: store it in a table where that is the second column in the primary key.
If, when looking up the tweets by user_id you only ever need them sorted by device, simply flip the order of the last two columns in your primary key. If you need to be able to sort either way, store the data twice in two different tables.
The Cassandra storage engine does not offer multi-column sorting other than the order of columns listed in your primary key.

How to make Cassandra have a varying column key for a specific row key?

I was reading the following article about Cassandra:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.UzIcL-ddVRw
and it seemed to imply you can have varying column keys in cassandra for a given row key. Is that true? And if its true, how do you allow for varying row keys.
The reason I think this might be true is because say we have a user and it can like many items and we simply want the userId to be the rowkey. We let this rowKey (userID) map to all the items that specific user might like. Each specific user might like a different number of items. Therefore, if we could have multiple column keys, one for each itemID each user likes, then we could solve the problem that way.
Therefore, is it possible to have varying length of cassandra column keys for a specific rowKey? (and how do you do it)
Providing an example and/or some cql code would be awesome!
The thing that is confusing me is that I have seen some .cql files and they define keyspaces before hand and it seems pretty inflexible on how to make it dynamic, i.e. allow it to have additional columns as we please. For example:
CREATE TABLE IF NOT EXISTS results (
test blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
How can this even allow growing columns? Don't we need to specify the name before hand anyway?Or additional custom columns as the application desires?
Yes, you can have a varying number of columns per row_key. From a relational perspective, it's not obvious that tid is the name of a variable. It acts as a placeholder for the variable column key. Note in the inserts statements below, "tid", "result", and "data" are never mentioned in the statement.
CREATE TABLE IF NOT EXISTS results (
data blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
So in your example, you need to identify the row_key, column_key, and payload of the table.
The primary key contains both the row_key and column_key.
Test is your row_key.
tid is your column_key.
data is your payload.
The following inserts are all valid:
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_1');
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591712', 'blob_2');
#notice that the column_key changed but the row_key remained the same
INSERT your_keyspace.results('row_key_2', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_3');
See here
Did you thought of exploring collection support in cassandra for handling such relations in colocated way{e.g. on same data node}.
Not sure if it helps, but what about keeping user id as row key and a map containing item id as key and some value?
-Vivel

Resources