Time UUID type in pycassa - cassandra

I'm having problems with using the time_uuid type as a key in my columnfamily. I want to store my records, and have them ordered by when they were inserted, and then I figured that the time_uuid is a good way to go. This is how I've set up my column family:
sys.create_column_family("keyspace", "records", comparator_type=TIME_UUID_TYPE)
When I try to insert, I do this:
q=pycassa.ColumnFamily(pycassa.connect("keyspace"), "records")
myKey=pycassa.util.convert_time_to_uuid(datetime.datetime.utcnow())
q.insert(myKey,{'somedata':'comevalue'})
However, when I insert data, I always get an error:
Argument for a v1 UUID column name or value was neither a UUID, a datetime, or a number.
If I change the comparator_type to UTF8_TYPE, it works, but the order of the items when returned are not as they should be. What am I doing wrong?

The problem is that in your data model, you are using the time as a row key. Although this is possible, you won't get a meaningful ordering unless you also use the ByteOrderedPartitioner.
For this reason, most people insert time-ordered data using the time as a column name, not a row key. In this model, your insert statement would look like:
q.insert(someKey, {datetime.datetime.utcnow(): 'somevalue'})
where someKey is a key that relates to the entire time series that you're inserting (for example, a username). (Note that you don't have to convert the time to UUID, pycassa does it for you.) To store something more than a single value, use a supercolumn or a composite key.
If you really want to store the time in your row keys, then you need to specify key_validation_class, not comparator_type. comparator_type sets the type of the column names, while key_validation_class sets the type of the row keys.
sys.create_column_family("keyspace", "records", key_validation_class=TIME_UUID_TYPE)
Remember the rows will not be sorted unless you also use the ByteOrderedPartitioner.

The comparator for a column family is used for ordering the columns within each row. You are seeing that error because 'somedata' is valid utf-8 but not a valid uuid.
The ordering of the rows stored in cassandra is determined by the partitioner. Most likely you are using RandomPartitioner which distributes load evenly across your cluster but does not allow for meaningful range queries (the rows will be returned in a random order.)
http://wiki.apache.org/cassandra/FAQ#range_rp

Related

Regarding suggestion of best schema for a cassandra table?

I want to have a table in Cassandra that has a partition key say column 'A', and a column say 'B' which is of 'set' type and can have up to 10000 elements in the set.
But when i retrieve a row from this table then the whole set is retrieved at once and because of that the JVM heap increases rapidly. So should i stick to this schema or go with other schema where 'A' is partition key and i make dynamic columns for each element in the set in my other schema say 'B1', 'B2' ..... 'B10,000'where each of this column is a clustering key.
Which schema is suited best and will give the optimal performance please recommend.
NOTE: cqlsh 5.0.1v
Based off of what you've described, and the documentation I've read, I would not create a collection with 10k elements. Instead I would have two tables, one with everything but the collection, and then use the primary key values of the first table, as the partition key columns of the second table; adding the element name (or whatever you can use to identify an individual element) as a clustering column.
So for a given query, if you wanted everything for a particular primary key value (including all elements), you'd query the first table with the primary key, grab whatever you need, then hit the second table as well, looping/fetching through all elements.
If the query only provides a filter on the partition key (not the primary key - i.e. retrieving multiple rows) , the first query would have to retrieve all columns that make up the primary key for each row, and then query the second table looping for all elements - nested loop here - one loop for each primary key record retrieved from the first table, and a second loop to grab all elements for each pk record.
Probably the best way to go with this. That's how I would probably tackle this.
Does that make sense?
-Jim

How to have unique key except primary key in cassandra?

I am not good in English!
There is a table in Cassandra 3.5 which all columns of a row don't come at same time. Unique of table is some columns that are unique in a row together, but some of them are null at first. I can not set them the primary key because of null value. I have identify a column with name id and type uuid in Cassandra.
How can I have a unique key with that columns together in Cassandra?
Is my data model true?
How can I solve this problem?
You can't. It's not a relational DB. Use clustering and/or partitioning keys to add an unique constraint.
See this answer
To store unique values, create a separate table having your unique value as a key. Check if it exists by requesting this table before inserting a row. But beware, even doing this, you cannot ensure it will be unique in your final table if you have two concurrent inserts.
Basically, I would recommend using Cassandra as it really is: A data store. And find a way to implement your business logic where it belongs: in your code.

How to make Cassandra have a varying column key for a specific row key?

I was reading the following article about Cassandra:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.UzIcL-ddVRw
and it seemed to imply you can have varying column keys in cassandra for a given row key. Is that true? And if its true, how do you allow for varying row keys.
The reason I think this might be true is because say we have a user and it can like many items and we simply want the userId to be the rowkey. We let this rowKey (userID) map to all the items that specific user might like. Each specific user might like a different number of items. Therefore, if we could have multiple column keys, one for each itemID each user likes, then we could solve the problem that way.
Therefore, is it possible to have varying length of cassandra column keys for a specific rowKey? (and how do you do it)
Providing an example and/or some cql code would be awesome!
The thing that is confusing me is that I have seen some .cql files and they define keyspaces before hand and it seems pretty inflexible on how to make it dynamic, i.e. allow it to have additional columns as we please. For example:
CREATE TABLE IF NOT EXISTS results (
test blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
How can this even allow growing columns? Don't we need to specify the name before hand anyway?Or additional custom columns as the application desires?
Yes, you can have a varying number of columns per row_key. From a relational perspective, it's not obvious that tid is the name of a variable. It acts as a placeholder for the variable column key. Note in the inserts statements below, "tid", "result", and "data" are never mentioned in the statement.
CREATE TABLE IF NOT EXISTS results (
data blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
So in your example, you need to identify the row_key, column_key, and payload of the table.
The primary key contains both the row_key and column_key.
Test is your row_key.
tid is your column_key.
data is your payload.
The following inserts are all valid:
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_1');
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591712', 'blob_2');
#notice that the column_key changed but the row_key remained the same
INSERT your_keyspace.results('row_key_2', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_3');
See here
Did you thought of exploring collection support in cassandra for handling such relations in colocated way{e.g. on same data node}.
Not sure if it helps, but what about keeping user id as row key and a map containing item id as key and some value?
-Vivel

Cassandra: change type from UUID to TIMEUUID

I'm trying to change the type of a column from UUID to TIMEUUID, but I'm unable to do so.
ALTER TABLE Person ALTER KEY TYPE timeuuid;
Error:
Bad Request: Cannot change key from type uuid to type timeuuid: types are incompatible.
Any idea on how to achieve this without losing the data from the column family?
It is not allowed. This is because TimeUUIDs have a different sorting pattern than regular UUIDs so Cassandra does not allow this. If you don't care about Columns being sorted by time, you can generate and use TimeUUID in your application without changing the column type in Cassandra, but the sorting may be out of order if mixed with those existing columns. If you absolutely need to do this, your only option would be to migrate your existing data and change its columns to UUID to a new column family.

How to chose Azure Table ParitionKey and RowKey for table that already has a unique attribute

My entity is a key value pair. 90% of the time i'll be retrieving the entity based on key but 10% of the time I'll also do a reverse lookup i.e. I'll search by value and get the key.
The key and value both are guaranteed to be unique and hence their combination is also guaranteed to be unique.
Is it correct to use Key as PartitionKey and Value as RowKey?
I believe this will also ensure that my data is perfectly load balanced between servers since ParitionKey is unique.
Are there any problems in the above decision?
Under any circumstance is it practical to have a hard coded partition key? I.e all rows have same partition key? and keeping the RowKey unique?
Is it doable, yes, but depending on the size of your data, I'm not so sure it's a good idea. When you query on partition key, Table Store can go directly to the exact partition and retrieve all your records. If you query on Rowkey alone, Table store has to check if the row exists in every partition of the table. so if you have 1000 key value pairs, searching by your key will read a single partition/row. If your search via your value alone, it will read all 1000 partitions!
I face a similar problem, I solved it 2 ways:
Have 2 different tables, one with partitionKey as your-key, the other with your-value as partitionKey. Storage is cheap, so duplicating data shouldn't cost much.
(What I finally did) If you're effectively returning single entites based on a unique key, just stick them in blobs(partitioned and pivoted as in point 1), because you don't need to traverse a table, so don't.

Resources