read hbase table salted with phoenix in hive with hbase serde - apache-spark

I have created an hbase table with Phoenix SQL create table query and also specified salt_buckets. Salting adds prefix to the the rowkey as expected.
I have created an external hive table to map to this hbase table with hbase serde The problem is when I query this table by filtering on rowkey:
where key = "value"
it doesn't work because I think salt pre-fix is also getting fetched for the key. This limits the ability to filter the data on key. The option:
"where rowkey like "%value"
works but it takes a long time as likely does the entire table scan.
My question is how can I query this table efficiently on row key values in hive (strip off salt pre-fix)?

Yes you're correct while mentioning
it doesn't work because I think salt pre-fix is also getting fetched for the key. '
One way to mitigate is to use hashing instead of random prefix.
And prefix the rowkey with the calculated hash
Using this technique you can calculate hash for the rowkey you want to scan for.:
mod(hash(rowkey),n) where n is the number of regions will remove the hotspotting issue
Using random prefix brings in the problem you mentioned in your question.
The option:
"where rowkey like "%value"
works but it takes a long time as likely does the entire table scan.
This is exactly what random prefix salting does. HBase is forced to scan the whole table to get the required value, so it would be better if you could prefix your rowkey with its calculated Hash.
But this hashing technique wont prove good in Range scans.
Now you may ask, why cant I simply replace my rowKey with its Hash and store the rowkey as separate column.
It may/may not work, but I would recommend implementing it this way because HBase is already very sensitive when it comes to Column Families.
But then again I am not clear on this solution.
You also might want to read this for more detailed explanation.

Related

Cassandra IN clause for non primary key column

I want to use the IN clause for the non-primary key column in Cassandra. Is it possible? if it is not is there any alternate or suggestion?
Three possible solutions
Create a secondary index. This is not recommended due to performance problems.
See if you can designate that column in the existing table as part of the primary key
Create another denormalised table that table is optimised for your query. i.e data model by query pattern
Update:
And also even after you move that to primary key, operations with IN clause can be further optimised. I found this cassandra lookup by list of primary keys in java very useful

Cassandra how to filter hex values in blob field

Consider the following table:
CREATE TABLE associations (
someHash blob,
someValue int,
someOtherField text
PRIMARY KEY (someHash, someValue)
) WITH CLUSTERING ORDER BY (someValue ASC);
The inserts to this table have someHash as a hex value, like 0xA0000000000000000000000000000001, 0xA0000000000000000000000000000002, etc.
If a query needs to find all rows with 0xA0000000000, what's the recommended Cassandra way to do it?
The main problem with your query is that it does not take into account limitations of Cassandra, namely:
someHash is a partition key column
The partition key columns [in WHERE clause] support only two operators: = and IN (i.e. exact match)
In other words, your schema is designed in such a way, that effectively query should say: "let's retrieve all possible keys [from all nodes], let's filter them (type not important here) and then retrieve values for keys that match predicate". This is a full-scan of some sort and is not what Cassandra is best at. You can try using UDFs to do some data transformation (trimming someHash), but I would expect it to work well only with trivial amounts of data.
Golden rule of Cassandra is "query first": if you have such a use-case, schema should be designed accordingly - sub-key you want to query by should be actual partition key (full someHash value can be part of clustering key).
BTW, same limitation applies to most maps in programming: you can't do lookup by part of key (because of hashing).
Following your 0xA0000000000 example directly:
You could split up someHash into 48 bits (6 bytes) and 80 bits (10 bytes) parts.
PRIMARY KEY ((someHash_head, someHash_tail), someValue)
The IN will then have 16 values, from 0xA00000000000 to 0xA0000000000F.

SparkSQL restrict queries by Cassandra partition key ranges

Imagine that my primary key is a timestamp.
I would like to restrict the query by timestamp ranges.
I don't seem to manage to make it work, even if I used token(). Also I can't create a secondary index on the partition key.
How should this be done?
Cassandra doesn't allow for range queries on partition key.
One way of dealing with this problem is changing your schema so that your timestamp value would be a clustering column. For this to work, you need to introduce a sentinel column as partition key. See this question for more detailed answers: Range Queries in Cassandra (CQL 3.0)
Another way is just to let Spark do the filtering. Range queries on primary key should work in Spark SQL. They would simply not be pushed down to Cassandra and Spark would fetch all data and filter them on the Spark side.

Is cassandra a row column database?

Im trying to learn cassandra but im confused with the terminology.
Many instances it says the row stores key/value pairs.
but, when I define a table its more like declaring a SQL table ie; you create a table and specify the column names and data types.
Can someone clarify this?
Cassandra is a column based NoSQL database. While yes at its lowest level it does store simple key-value pairs it stores these key-value pairs in collections. This grouping of keys and collections is analogous to rows and columns in a traditional relational model. Cassandra tables contain a schema and can be referenced (with restrictions) using a SQL-like language called CQL.
In your comment you ask about Apples being stored in a different table from oranges. The answer to that specific question is No it will be in the same table. However Cassandra tables have an additional concept call the Partition Key that doesn't really have an analgous concept in the relational world. Take for example the following table definition
CREATE TABLE fruit_types {
fruit text,
location text,
cost float,
PRIMARY KEY ((fruit), location)
}
In this table definition you will notice that we are defining the schema for the table. You will also notice that we are defining a PRIMARY KEY. This primary key is similar but not exactly like a relational concept. In Cassandra the PRIMAY KEY is made up of two parts the PARTITION KEY and CLUSTERING COLUMNS. The PARTITION KEY is the first fields specified in the PRIMARY KEY and can contain one or more fields delimitated by parenthesis. The purpose of the PARTITION KEY is to be hashed and used to define the node that owns the data and is also used to physically divide the information on the disk into files. The CLUSTERING COLUMNS make up the other columns listed in the PRIMARY KEY and amongst other things are used for defining how the data is physically stored on the disk inside the different files as specified by the PARTITION KEY. I suggest you do some additional reading on the PRIMARY KEY here if your interested in more detail:
https://docs.datastax.com/en/cql/3.0/cql/ddl/ddl_compound_keys_c.html
Basically cassandra storage is like sparse matrix, earlier version has a command line tool called cqlsh which can show the exact storage foot print of your columnfamily(aka table in latest version). Later community decided to keep RDBMS kind of syntax for better understanding coz the query language(CQL) syntax is similar to sql.
main storage is key(partition) (which is hash function result of chosen partition column in your table and rest of the columns will be tagged to it like sparse matrix.

How to chose Azure Table ParitionKey and RowKey for table that already has a unique attribute

My entity is a key value pair. 90% of the time i'll be retrieving the entity based on key but 10% of the time I'll also do a reverse lookup i.e. I'll search by value and get the key.
The key and value both are guaranteed to be unique and hence their combination is also guaranteed to be unique.
Is it correct to use Key as PartitionKey and Value as RowKey?
I believe this will also ensure that my data is perfectly load balanced between servers since ParitionKey is unique.
Are there any problems in the above decision?
Under any circumstance is it practical to have a hard coded partition key? I.e all rows have same partition key? and keeping the RowKey unique?
Is it doable, yes, but depending on the size of your data, I'm not so sure it's a good idea. When you query on partition key, Table Store can go directly to the exact partition and retrieve all your records. If you query on Rowkey alone, Table store has to check if the row exists in every partition of the table. so if you have 1000 key value pairs, searching by your key will read a single partition/row. If your search via your value alone, it will read all 1000 partitions!
I face a similar problem, I solved it 2 ways:
Have 2 different tables, one with partitionKey as your-key, the other with your-value as partitionKey. Storage is cheap, so duplicating data shouldn't cost much.
(What I finally did) If you're effectively returning single entites based on a unique key, just stick them in blobs(partitioned and pivoted as in point 1), because you don't need to traverse a table, so don't.

Resources