Converting 128 bit int for row key in Cassandra - cassandra

If I wish to have a comparable 128 bit integer equivalent as a row key in Cassandra, what data type is the most efficient to process this? ASCII using the full 8-bit range?
I need to be able to select row slices and ranges.

Row keys are not compared if you use Random Partitioner (the piece that determine how the keys get distributed around the cluster).
If you want to compare row keys use a Order Preserving partitioner ... but that will surely lead to an unbalanced cluster and crashes.
Column names get compared though, with other column names inside the same row.
So my advise is Bucket your columns into number intervals and insert your columns with LongType column name.

Probably just use the raw byte[] representation of the int and avoid any conversion; Comments above from le douard withstanding.

Raw byte[] comparison is not going to sort columns in numerical order. If that's what you want you should use varint (CQL) / IntegerType (Thrift)

Related

How to store Bert embeddings in cassandra

I want to use Cassandra as feature store to store precomputed Bert embedding,
Each row would consist of roughly 800 integers (ex. -0.18294132) Should I store all 800 in one large string column or 800 separate columns?
Simple read pattern, On read we would want to read every value in a row. Not sure which would be better for serialization speed.
Having everything as a separate column will be quite inefficient - each value will have its own metadata (writetime, for example) that will add significant overhead (at least 8 bytes per every value). Storing data as string will be also not very efficient, and will add the complexity on the application side.
I would suggest to store data as fronzen list of integers/longs or doubles/floats, depending on your requirements. Something like:
create table ks.bert(
rowid int primary key,
data frozen<list<int>>
);
In this case, the whole list will be effectively serialized as binary blob, occupying just one cell.

How do I most effectively compress my highly-unique columns?

I have a Spark DataFrame consisting of many double columns that are measurements, but I want a way of annotating each unique row by computing a hash of several other non-measurement columns. This hash results in garbled strings that are highly unique, and I've noticed my dataset size increases substantially when this column is present. How can I sort / lay out my data to decrease the overall dataset size?
I know that the Snappy compression protocol used on my parquet files executes best upon runs of similar data, so I think a sort over the primary key could be useful, but I also can't coalesce() the entire dataset into a single file (it's hundreds of GB in total size before the primary key creation step).
My hashing function is SHA2(128) FYI.
If you have a column that can be computed from the other columns, then simply omit that column before compression, and reconstruct it after decompression.

When i use repartition on a dataframe in pyspark it gives me one partition size to be zero and merges two types of keys together

Here is an example of it. The cell 44 output shows the count of distinct keys but when i find the partition size in cell 45 then it combines together 3 and 5. also on saving the size of one partition is still zero. Any help would be appreciated.
By default, Spark applies a HashPartitioner to the values of the column overint. Apparently, the values 3 and 5 fall into the same partition after being hashed.
You may want to choose the RangePartitioner. Or, in case you need full flexibility, you could also write your custome Partitioner class. However, this is only available on the RDD API and not the Structured API.

Limit on the number of columns in cassandra

Is there any limit on the number of columns in cassandra? I am thinking of using a unix timestamp (converted to TimeUUID) as the column key. In the worst case, I will end up having 86400 columns per row. Is this a good idea?
Having 86.400 columns per row is piece of cake for cassandra as long your columns are not too big and you don't retrieve all of them.
The maximum of column per row is 2 billion.
See http://wiki.apache.org/cassandra/CassandraLimitations
A suggestion: For column name you should use Integer data serialization, which would take just 4 bytes for 1 second precision instead of using UUID (16 bytes); as long as your timestamps are all unique and 1s precision is enough.
Column names are sorted and you can use unix time as Integer. With this you can have fast lookups on columns.
There is also timestamp associated with each column, which can be useful to set in some cases. You cannot query on it, but may provide you additional information if needed.
Assuming you're doing that for a good reason, it's totally fine.

Time UUID type in pycassa

I'm having problems with using the time_uuid type as a key in my columnfamily. I want to store my records, and have them ordered by when they were inserted, and then I figured that the time_uuid is a good way to go. This is how I've set up my column family:
sys.create_column_family("keyspace", "records", comparator_type=TIME_UUID_TYPE)
When I try to insert, I do this:
q=pycassa.ColumnFamily(pycassa.connect("keyspace"), "records")
myKey=pycassa.util.convert_time_to_uuid(datetime.datetime.utcnow())
q.insert(myKey,{'somedata':'comevalue'})
However, when I insert data, I always get an error:
Argument for a v1 UUID column name or value was neither a UUID, a datetime, or a number.
If I change the comparator_type to UTF8_TYPE, it works, but the order of the items when returned are not as they should be. What am I doing wrong?
The problem is that in your data model, you are using the time as a row key. Although this is possible, you won't get a meaningful ordering unless you also use the ByteOrderedPartitioner.
For this reason, most people insert time-ordered data using the time as a column name, not a row key. In this model, your insert statement would look like:
q.insert(someKey, {datetime.datetime.utcnow(): 'somevalue'})
where someKey is a key that relates to the entire time series that you're inserting (for example, a username). (Note that you don't have to convert the time to UUID, pycassa does it for you.) To store something more than a single value, use a supercolumn or a composite key.
If you really want to store the time in your row keys, then you need to specify key_validation_class, not comparator_type. comparator_type sets the type of the column names, while key_validation_class sets the type of the row keys.
sys.create_column_family("keyspace", "records", key_validation_class=TIME_UUID_TYPE)
Remember the rows will not be sorted unless you also use the ByteOrderedPartitioner.
The comparator for a column family is used for ordering the columns within each row. You are seeing that error because 'somedata' is valid utf-8 but not a valid uuid.
The ordering of the rows stored in cassandra is determined by the partitioner. Most likely you are using RandomPartitioner which distributes load evenly across your cluster but does not allow for meaningful range queries (the rows will be returned in a random order.)
http://wiki.apache.org/cassandra/FAQ#range_rp

Resources