Murmur3 Hash Algorithm Used in Cassandra - cassandra

I'm trying to reproduce the Murmur3 hashing in Cassandra. Does anyone know how to get at the actual hash values used in the row keys? I just need some key - hash value pairs from my data to check that my implementation of the hashing is correct.
Alex

Ask Cassandra! Insert some data in your table. Afterwards, you can use the token function in a select query to get the used token values. For example:
select token(id), id from myTable;
A composite partition key is serialized as n-times a byte array (that is always prepended with a short indicating its length) containing the serialized value of your key element and a closing 0. It's unclear to me what these closing zeros are for. Has something to do with SuperColumns...

Related

How to get a shuffled result from DynamoDB?

I'm using nodejs to connect with DynamoDB and I'm in a position that I am suppose to render shuffle result on the page from AWS DynamoDB.
is there anyway to get shuffled result from DynamoDB directly or any efficient way to shuffle it on the server;
I'm assuming that by shuffled you mean to ask DynamoDB to return a random item from your table. In this case the answer is basically: no - there is no such functionality built-in.
However, if you design your partition & range key schema in such a way that you can easily pick a random element then you can do it client side from your query. Depending on the density of data in the table this may require multiple queries to actually return a result but it can be done.
Let's say your results are of the form ABC-123 where ABC is a partition key and 123 is the range key value, then you could random partition key selection from the client and then try a query on that key. If the key returns some data then you can select one of the items at random. Again, based on density of data in each partition you may use a second random threshold for the query.
I hope this helps.

read hbase table salted with phoenix in hive with hbase serde

I have created an hbase table with Phoenix SQL create table query and also specified salt_buckets. Salting adds prefix to the the rowkey as expected.
I have created an external hive table to map to this hbase table with hbase serde The problem is when I query this table by filtering on rowkey:
where key = "value"
it doesn't work because I think salt pre-fix is also getting fetched for the key. This limits the ability to filter the data on key. The option:
"where rowkey like "%value"
works but it takes a long time as likely does the entire table scan.
My question is how can I query this table efficiently on row key values in hive (strip off salt pre-fix)?
Yes you're correct while mentioning
it doesn't work because I think salt pre-fix is also getting fetched for the key. '
One way to mitigate is to use hashing instead of random prefix.
And prefix the rowkey with the calculated hash
Using this technique you can calculate hash for the rowkey you want to scan for.:
mod(hash(rowkey),n) where n is the number of regions will remove the hotspotting issue
Using random prefix brings in the problem you mentioned in your question.
The option:
"where rowkey like "%value"
works but it takes a long time as likely does the entire table scan.
This is exactly what random prefix salting does. HBase is forced to scan the whole table to get the required value, so it would be better if you could prefix your rowkey with its calculated Hash.
But this hashing technique wont prove good in Range scans.
Now you may ask, why cant I simply replace my rowKey with its Hash and store the rowkey as separate column.
It may/may not work, but I would recommend implementing it this way because HBase is already very sensitive when it comes to Column Families.
But then again I am not clear on this solution.
You also might want to read this for more detailed explanation.

Cassandra how to filter hex values in blob field

Consider the following table:
CREATE TABLE associations (
someHash blob,
someValue int,
someOtherField text
PRIMARY KEY (someHash, someValue)
) WITH CLUSTERING ORDER BY (someValue ASC);
The inserts to this table have someHash as a hex value, like 0xA0000000000000000000000000000001, 0xA0000000000000000000000000000002, etc.
If a query needs to find all rows with 0xA0000000000, what's the recommended Cassandra way to do it?
The main problem with your query is that it does not take into account limitations of Cassandra, namely:
someHash is a partition key column
The partition key columns [in WHERE clause] support only two operators: = and IN (i.e. exact match)
In other words, your schema is designed in such a way, that effectively query should say: "let's retrieve all possible keys [from all nodes], let's filter them (type not important here) and then retrieve values for keys that match predicate". This is a full-scan of some sort and is not what Cassandra is best at. You can try using UDFs to do some data transformation (trimming someHash), but I would expect it to work well only with trivial amounts of data.
Golden rule of Cassandra is "query first": if you have such a use-case, schema should be designed accordingly - sub-key you want to query by should be actual partition key (full someHash value can be part of clustering key).
BTW, same limitation applies to most maps in programming: you can't do lookup by part of key (because of hashing).
Following your 0xA0000000000 example directly:
You could split up someHash into 48 bits (6 bytes) and 80 bits (10 bytes) parts.
PRIMARY KEY ((someHash_head, someHash_tail), someValue)
The IN will then have 16 values, from 0xA00000000000 to 0xA0000000000F.

Time UUID type in pycassa

I'm having problems with using the time_uuid type as a key in my columnfamily. I want to store my records, and have them ordered by when they were inserted, and then I figured that the time_uuid is a good way to go. This is how I've set up my column family:
sys.create_column_family("keyspace", "records", comparator_type=TIME_UUID_TYPE)
When I try to insert, I do this:
q=pycassa.ColumnFamily(pycassa.connect("keyspace"), "records")
myKey=pycassa.util.convert_time_to_uuid(datetime.datetime.utcnow())
q.insert(myKey,{'somedata':'comevalue'})
However, when I insert data, I always get an error:
Argument for a v1 UUID column name or value was neither a UUID, a datetime, or a number.
If I change the comparator_type to UTF8_TYPE, it works, but the order of the items when returned are not as they should be. What am I doing wrong?
The problem is that in your data model, you are using the time as a row key. Although this is possible, you won't get a meaningful ordering unless you also use the ByteOrderedPartitioner.
For this reason, most people insert time-ordered data using the time as a column name, not a row key. In this model, your insert statement would look like:
q.insert(someKey, {datetime.datetime.utcnow(): 'somevalue'})
where someKey is a key that relates to the entire time series that you're inserting (for example, a username). (Note that you don't have to convert the time to UUID, pycassa does it for you.) To store something more than a single value, use a supercolumn or a composite key.
If you really want to store the time in your row keys, then you need to specify key_validation_class, not comparator_type. comparator_type sets the type of the column names, while key_validation_class sets the type of the row keys.
sys.create_column_family("keyspace", "records", key_validation_class=TIME_UUID_TYPE)
Remember the rows will not be sorted unless you also use the ByteOrderedPartitioner.
The comparator for a column family is used for ordering the columns within each row. You are seeing that error because 'somedata' is valid utf-8 but not a valid uuid.
The ordering of the rows stored in cassandra is determined by the partitioner. Most likely you are using RandomPartitioner which distributes load evenly across your cluster but does not allow for meaningful range queries (the rows will be returned in a random order.)
http://wiki.apache.org/cassandra/FAQ#range_rp

How to chose Azure Table ParitionKey and RowKey for table that already has a unique attribute

My entity is a key value pair. 90% of the time i'll be retrieving the entity based on key but 10% of the time I'll also do a reverse lookup i.e. I'll search by value and get the key.
The key and value both are guaranteed to be unique and hence their combination is also guaranteed to be unique.
Is it correct to use Key as PartitionKey and Value as RowKey?
I believe this will also ensure that my data is perfectly load balanced between servers since ParitionKey is unique.
Are there any problems in the above decision?
Under any circumstance is it practical to have a hard coded partition key? I.e all rows have same partition key? and keeping the RowKey unique?
Is it doable, yes, but depending on the size of your data, I'm not so sure it's a good idea. When you query on partition key, Table Store can go directly to the exact partition and retrieve all your records. If you query on Rowkey alone, Table store has to check if the row exists in every partition of the table. so if you have 1000 key value pairs, searching by your key will read a single partition/row. If your search via your value alone, it will read all 1000 partitions!
I face a similar problem, I solved it 2 ways:
Have 2 different tables, one with partitionKey as your-key, the other with your-value as partitionKey. Storage is cheap, so duplicating data shouldn't cost much.
(What I finally did) If you're effectively returning single entites based on a unique key, just stick them in blobs(partitioned and pivoted as in point 1), because you don't need to traverse a table, so don't.

Resources