How to arrange data in Cassandra to get data in last in first out format - cassandra

As we cannot sort data in Cassandra, I wanted to store data in such format that when I retrieve the data, I need to get data in ' last in first out format ' i.e if user enter comments when I retrieve data, I should first get very latest comment first and then older comments. I think it's something to do with comparator.
I have set following when configuring Cassandra:
assume posts comparator as utf8;
assume posts validator as utf8;
assume posts keys as utf8;
Please help - how should I create the column to arrange data in time format so that latest data is stored first?

Columns in a row are always sorted, and you can iterate over the columns in a row in reverse order. Given these two facs we could model the situation you're describing by storing comments in a column family called "comments" where the row key is the post ID, and the columns represent the comments to the corresponding post. The columns are timestamts (either ISO formatted dates, UNIX timestamps or time UUIDs) and the values are the comment text bodies.
If you would now get the columns for a row and specify that you wanted them in reverse order you would get what you want. How to specify reverse order depends on your driver, but it's usually just an option to the command that retrieves a row, or a column slice.
Another way, which is more hackish, would be to take the UNIX timestamp of a post, and subtract it from a large integer, like 2^31, and use that as column key. That way columns would sort in reverse order by default. It's not pretty and the above method is more elegant.
If you worry about using timestamps because there could be collisions where two comments are posted at exactly the same time, use Cassandra's time UUID type.

You need to organize your data such that the comparator is a timestamp. You store your data in natural order and specify reverse order in your slice query.

Related

Handling the following use case in Cassandra?

I've been given the task of modelling a simple in Cassandra. Coming from an almost solely SQL background, though, I'm having a bit of trouble figuring it out.
Basically, we have a list of feeds that we're listening to that update periodically. This can be in RSS, JSON, ATOM, XML, etc (depending on the feed).
What we want to do is periodically check for new items in each feed, convert the data into a few formats (i.e. JSON and RSS) and store that in a Cassandra store.
So, in an RBDMS, the structure would be something akin to:
Feed:
feedId
name
URL
FeedItem:
feedItemId
feedId
title
json
rss
created_time
I'm confused as to how to model that data in Cassandra to facilitate simple things such as getting x amount of items for a specific feed in descending created order (which is probably the most common query).
I've heard of one strategy that mentions having a composite key storing, in this example, the the created_time as a time-based UUID with the feed item ID but I'm still a little confused.
For example, lets say I have a series of rows whose key is basically the feedId. Inside each row, I store a range of columns as mentioned above. The question is, where does the actual data go (i.e. JSON, RSS, title)? Would I have to store all the data for that 'record' as the column value?
I think I'm confusing wide rows and narrow (short?) rows as I like the idea of the composite key but I also want to store other data with each record and I'm not sure how to meld the two together...
You can store everything in one column family. However If the data for each FeedItem is very large, you can split the data for each FeedItem into another column family.
For example, you can have 1 column familyfor Feed, and the columns of that key are FeedItem ids, something like,
Feeds # column family
FeedId1 #key
time-stamp-1-feed-item-id1 #columns have no value, or values are enough info
time-stamp-2-feed-item-id2 #to show summary info in a results list
The Feeds column allows you to quickly get the last N items from a feed, but querying for the last N items of a Feed doesn't require fetching all the data for each FeedItem, either nothing is fetched, or just a summary.
Then you can use another column family to store the actual FeedItem data,
FeedItems # column family
feed-item-id1 # key
rss # 1 column for each field of a FeedItem
title #
...
Using CQL should be easier to understand to you as per your SQL background.
Cassandra (and NoSQL in general) is very fast and you don't have real benefits from using a related table for feeds, and anyway you will not be capable of doing JOINs. Obviously you can still create two tables if that's comfortable for you, but you will have to manage linking data inside your application code.
You can use something like:
CREATE TABLE FeedItem (
feedItemId ascii PRIMARY KEY,
feedId ascii,
feedName ascii,
feedURL ascii,
title ascii,
json ascii,
rss ascii,
created_time ascii );
Here I used ascii fields for everything. You can choose to use different data types for feedItemId or created_time, and available data types can be found here, and depending on which languages and client you are using it can be transparent or require some more work to make them works.
You may want to add some secondary indexes. For example, if you want to search for feeds items from a specific feedId, something like:
SELECT * FROM FeedItem where feedId = '123';
To create the index:
CREATE INDEX FeedItem_feedId ON FeedItem (feedId);
Sorting / Ordering, alas, it's not something easy in Cassandra. Maybe reading here and here can give you some clues where to start looking for, and also that's really depending on the cassandra version you're going to use.

how to implement fixed number of (timeuuid) columns in cassandra (with CQL)?

Here is an example use case:
You need to store last N (let's say 1000 as fixed bucket size) user actions with all details in timeuuid based columns.
Normally, each users' actions are already in "UserAction" column family where user id as row key, and actions in timeuuid columns. You may also have "AllActions" column family which stores all actions with same timeuuid as column name and user id as column value. It's basically a relationship column family but unfortunately without any details of user actions. Querying with this column family is expensive I guess, because of random partioner. On the other hand, if you store all details in "AllActions" CF then cassandra can't handle that big row properly at one point. This is why I want to store last N user actions with all details in fixed number of timeuuid based columns.
Maybe you may have a better design solution for this use case... I like to hear that ...
If not, the question is how to implement fixed number of (timeuuid) columns in cassandra (with CQL) effectively?
After insertion we could delete old (overflow) columns if we had some sort of range support in cql's DELETE. AFAIK there is no support for this.
So, any idea? Thanks in advance...
IMHO, this is something that C* must handle itself like compaction. It's not a good idea to handle this on client side.
Maybe, we need some configuration (storage) options for column families to make them suitable for "most recent data".

UniData UniQuery - two WITH

Alright I have little to no knowledge of SQL language, and am wondering what are the possible reasons for the slowness of two WITH vs one WITH in unidata.
Database has around ~1 million rows.
Ie/
SELECT somewhere WITH Column1 = "str" AND WITH Column2 = "Int" 5< minutes
Compared to
SELECT somewhere WITH Column1 = "str" ~1 second
somewhere is indexed (from my knowledge)
so is there anything I'm doing wrong?
If more information is required just ask, not sure what to supply.
Also whats the difference between WITH and WHERE?
This isn't SQL, it is UniQuery.
To clarify it for you, you can't index the file (somewhere, in this case), only the columns of the file. You might find Column1 is indexed and Column2 is not. Type in LIST.INDEX somewhere to find out what columns have been indexed.
For your question, you have only compared selecting on Column1 against selecting on Column1 & Column2 and assumed the vastly slower response is purely because you selected on 2 columns. Your next text should have been to select only on Column2 and seen how slow that was.
There are are many possible reasons to explain the difference in response, aside from indexing. In UniData columns are defined as 'dictionary items' There are different types of dictionary items. The most basic is a D-type dictionary item which is just a direct reference to a field in the record. Another type is the I or V-type, which is a derived field. The derived field can be as simple as returning a constant or as complex as performing an equivalent performing a JOIN with another file and/or some form of complex calculation. This this is should be simple to see that different columns can take vastly different amounts of processing to handle.
Other reasons are how deep in the record the column is (first field references will be faster than fields later in the record) as well as potential query caching that can affect the timings of your SELECTs.
For more information, check out the database's manuals at Rocket Software.
A single column SELECT on an indexed field will not even require that any data file records are read. If you look under the hood, you'll see that the index file is a normal hash file, and the single column SELECT will simply mean that the index file record with the key "str" is read. This could return thousands and thousands of keys in less than a second.
Once you add the second column, you are probably forcing the system to read all of those thousands and thousands of records, EVEN IF THE SECOND COLUMN IS INDEXED. This is going to take a measurable amount of more time.
In general, an index on a field with a small number of unique values is of dubious use. If the second column contains data that has a large number of possible values, leading to a smaller number of records with each particular index value, then it would be best to arrange the SELECT such that the index used is on the second column. I'm not sure, but it might be possible to simply reverse the order of the columns in the SELECT statement to do this. Otherwise you might need to run two SELECT statements back to back.
As an example, assume that the file has 600,000 records with Column1 = "str", and 2,000 records with Column2 = "int":
>SELECT somewhere WITH Column2 = "int"
>>SELECT somewhere with Column1 = "str"
Will read 2,000 records and should return almost instantly.
If the combination of Column1 and Column2 is something that you'll be SELECTing on frequently, then you might want to create a new dictionary item that combines the two, and build an index on that.
That being said, it shouldn't take a U2 system 5 minutes to run through a file of a million records. There's a very good chance that the file has become badly overflowed, and needs to be resized with a larger modulo to improve performance.

Time UUID type in pycassa

I'm having problems with using the time_uuid type as a key in my columnfamily. I want to store my records, and have them ordered by when they were inserted, and then I figured that the time_uuid is a good way to go. This is how I've set up my column family:
sys.create_column_family("keyspace", "records", comparator_type=TIME_UUID_TYPE)
When I try to insert, I do this:
q=pycassa.ColumnFamily(pycassa.connect("keyspace"), "records")
myKey=pycassa.util.convert_time_to_uuid(datetime.datetime.utcnow())
q.insert(myKey,{'somedata':'comevalue'})
However, when I insert data, I always get an error:
Argument for a v1 UUID column name or value was neither a UUID, a datetime, or a number.
If I change the comparator_type to UTF8_TYPE, it works, but the order of the items when returned are not as they should be. What am I doing wrong?
The problem is that in your data model, you are using the time as a row key. Although this is possible, you won't get a meaningful ordering unless you also use the ByteOrderedPartitioner.
For this reason, most people insert time-ordered data using the time as a column name, not a row key. In this model, your insert statement would look like:
q.insert(someKey, {datetime.datetime.utcnow(): 'somevalue'})
where someKey is a key that relates to the entire time series that you're inserting (for example, a username). (Note that you don't have to convert the time to UUID, pycassa does it for you.) To store something more than a single value, use a supercolumn or a composite key.
If you really want to store the time in your row keys, then you need to specify key_validation_class, not comparator_type. comparator_type sets the type of the column names, while key_validation_class sets the type of the row keys.
sys.create_column_family("keyspace", "records", key_validation_class=TIME_UUID_TYPE)
Remember the rows will not be sorted unless you also use the ByteOrderedPartitioner.
The comparator for a column family is used for ordering the columns within each row. You are seeing that error because 'somedata' is valid utf-8 but not a valid uuid.
The ordering of the rows stored in cassandra is determined by the partitioner. Most likely you are using RandomPartitioner which distributes load evenly across your cluster but does not allow for meaningful range queries (the rows will be returned in a random order.)
http://wiki.apache.org/cassandra/FAQ#range_rp

Accessing timestamp of a Cassandra column

I am new to Cassandra.
I have a column Family where the columns are sorted by "LexicalUUIDType".
How can I access timestamp of each column in such a ColumnFamily?
I need to the timestamp because I have to read the oldest entry.
I can not use "TimeUUIDType" for sorting columns.
Thanks,
It depends on the library you are using. But if you are using the raw thrift api its something like (unreleased 0.7/trunk):
column.column.clock.timestamp
(To get all data you will have to use get_range_slices, start with "", and after each call use the last key as the start key in the next call)
You would have to get back all of the columns using get_slice http://wiki.apache.org/cassandra/API06#get_slice and then look at the timestamp field in each one.
Or you can make another column family sorted by timeuuid which has the corresponding column in the first cf as the value. Query cf #2 with the time you want, and use the result to get from cf #1.

Resources