Accessing timestamp of a Cassandra column - cassandra

I am new to Cassandra.
I have a column Family where the columns are sorted by "LexicalUUIDType".
How can I access timestamp of each column in such a ColumnFamily?
I need to the timestamp because I have to read the oldest entry.
I can not use "TimeUUIDType" for sorting columns.
Thanks,

It depends on the library you are using. But if you are using the raw thrift api its something like (unreleased 0.7/trunk):
column.column.clock.timestamp
(To get all data you will have to use get_range_slices, start with "", and after each call use the last key as the start key in the next call)

You would have to get back all of the columns using get_slice http://wiki.apache.org/cassandra/API06#get_slice and then look at the timestamp field in each one.
Or you can make another column family sorted by timeuuid which has the corresponding column in the first cf as the value. Query cf #2 with the time you want, and use the result to get from cf #1.

Related

Cassandra Altering the table

I have a table in Cassandra say employee(id, email, role, name, password) with only id as my primary key.
I want to ...
1. Add another column (manager_id) in with a default value in it
I know that I can add a column in the table but there is no way i can provide a default value to that column through CQL. I can also not update the value for manager_id later since I need to know the id (Partition key and the values are randomly generated unique values which i don't know) to update the row. Is there any way I can achieve this?
2. Rename this table to all_employee.
I also know that its not allowed to rename a table in cassandra. So I am trying to copy the data of table(employee) to csv and copy from csv to new table (all_employee) and deleting the old table(employee). I am doing this through an automated script with cql queries in it and script works fine but will fail if it gets executed again(Which i can not restrict) since the table employee will not be there once its deleted. Essentially I am looking for "If exists" clause in COPY query which is not supported in cql. Is there any other way I can achieve the outcome?
Please note that the amount of data in the table is very small so performance in not an issue.
For #1
I dont think cassandra support default column . You need to do that from your appliaction. Write some default value every time you insert a row.
For #2
you can check if the table exists before trying to copy from it.
SELECT your_table_name FROM system_schema.tables WHERE keyspace_name='your_keyspace_name';

How to retrieve a very big cassandra table and delete some unuse data from it?

I hava created a cassandra table with 20 million records. Now I want to delete the expired data decided by one none primary key column. But it doesn't support the operation on the column. So I try to retrieve the table and get the data line by line to delete the data.Unfortunately,it is too huge to retrieve. Otherwise,I couldn't delete the whole table, how could I achieve my goal?
Your question is actually, how to get the data from the table in bulks (also called pagination).
You can do that by selecting different slices from your primary key: For example, if your primary key is some sort of ID, select a range of IDs each time, process the results and do whatever you want to do with them, then get the next range, and so on.
Another way, which depends on the driver you're working with, will be to use fetch_size. You can see a Python example here and a Java example here.

How to arrange data in Cassandra to get data in last in first out format

As we cannot sort data in Cassandra, I wanted to store data in such format that when I retrieve the data, I need to get data in ' last in first out format ' i.e if user enter comments when I retrieve data, I should first get very latest comment first and then older comments. I think it's something to do with comparator.
I have set following when configuring Cassandra:
assume posts comparator as utf8;
assume posts validator as utf8;
assume posts keys as utf8;
Please help - how should I create the column to arrange data in time format so that latest data is stored first?
Columns in a row are always sorted, and you can iterate over the columns in a row in reverse order. Given these two facs we could model the situation you're describing by storing comments in a column family called "comments" where the row key is the post ID, and the columns represent the comments to the corresponding post. The columns are timestamts (either ISO formatted dates, UNIX timestamps or time UUIDs) and the values are the comment text bodies.
If you would now get the columns for a row and specify that you wanted them in reverse order you would get what you want. How to specify reverse order depends on your driver, but it's usually just an option to the command that retrieves a row, or a column slice.
Another way, which is more hackish, would be to take the UNIX timestamp of a post, and subtract it from a large integer, like 2^31, and use that as column key. That way columns would sort in reverse order by default. It's not pretty and the above method is more elegant.
If you worry about using timestamps because there could be collisions where two comments are posted at exactly the same time, use Cassandra's time UUID type.
You need to organize your data such that the comparator is a timestamp. You store your data in natural order and specify reverse order in your slice query.

how to implement fixed number of (timeuuid) columns in cassandra (with CQL)?

Here is an example use case:
You need to store last N (let's say 1000 as fixed bucket size) user actions with all details in timeuuid based columns.
Normally, each users' actions are already in "UserAction" column family where user id as row key, and actions in timeuuid columns. You may also have "AllActions" column family which stores all actions with same timeuuid as column name and user id as column value. It's basically a relationship column family but unfortunately without any details of user actions. Querying with this column family is expensive I guess, because of random partioner. On the other hand, if you store all details in "AllActions" CF then cassandra can't handle that big row properly at one point. This is why I want to store last N user actions with all details in fixed number of timeuuid based columns.
Maybe you may have a better design solution for this use case... I like to hear that ...
If not, the question is how to implement fixed number of (timeuuid) columns in cassandra (with CQL) effectively?
After insertion we could delete old (overflow) columns if we had some sort of range support in cql's DELETE. AFAIK there is no support for this.
So, any idea? Thanks in advance...
IMHO, this is something that C* must handle itself like compaction. It's not a good idea to handle this on client side.
Maybe, we need some configuration (storage) options for column families to make them suitable for "most recent data".

Time UUID type in pycassa

I'm having problems with using the time_uuid type as a key in my columnfamily. I want to store my records, and have them ordered by when they were inserted, and then I figured that the time_uuid is a good way to go. This is how I've set up my column family:
sys.create_column_family("keyspace", "records", comparator_type=TIME_UUID_TYPE)
When I try to insert, I do this:
q=pycassa.ColumnFamily(pycassa.connect("keyspace"), "records")
myKey=pycassa.util.convert_time_to_uuid(datetime.datetime.utcnow())
q.insert(myKey,{'somedata':'comevalue'})
However, when I insert data, I always get an error:
Argument for a v1 UUID column name or value was neither a UUID, a datetime, or a number.
If I change the comparator_type to UTF8_TYPE, it works, but the order of the items when returned are not as they should be. What am I doing wrong?
The problem is that in your data model, you are using the time as a row key. Although this is possible, you won't get a meaningful ordering unless you also use the ByteOrderedPartitioner.
For this reason, most people insert time-ordered data using the time as a column name, not a row key. In this model, your insert statement would look like:
q.insert(someKey, {datetime.datetime.utcnow(): 'somevalue'})
where someKey is a key that relates to the entire time series that you're inserting (for example, a username). (Note that you don't have to convert the time to UUID, pycassa does it for you.) To store something more than a single value, use a supercolumn or a composite key.
If you really want to store the time in your row keys, then you need to specify key_validation_class, not comparator_type. comparator_type sets the type of the column names, while key_validation_class sets the type of the row keys.
sys.create_column_family("keyspace", "records", key_validation_class=TIME_UUID_TYPE)
Remember the rows will not be sorted unless you also use the ByteOrderedPartitioner.
The comparator for a column family is used for ordering the columns within each row. You are seeing that error because 'somedata' is valid utf-8 but not a valid uuid.
The ordering of the rows stored in cassandra is determined by the partitioner. Most likely you are using RandomPartitioner which distributes load evenly across your cluster but does not allow for meaningful range queries (the rows will be returned in a random order.)
http://wiki.apache.org/cassandra/FAQ#range_rp

Resources