How to have unique key except primary key in cassandra? - cassandra

I am not good in English!
There is a table in Cassandra 3.5 which all columns of a row don't come at same time. Unique of table is some columns that are unique in a row together, but some of them are null at first. I can not set them the primary key because of null value. I have identify a column with name id and type uuid in Cassandra.
How can I have a unique key with that columns together in Cassandra?
Is my data model true?
How can I solve this problem?

You can't. It's not a relational DB. Use clustering and/or partitioning keys to add an unique constraint.
See this answer
To store unique values, create a separate table having your unique value as a key. Check if it exists by requesting this table before inserting a row. But beware, even doing this, you cannot ensure it will be unique in your final table if you have two concurrent inserts.
Basically, I would recommend using Cassandra as it really is: A data store. And find a way to implement your business logic where it belongs: in your code.

Related

Cassandra IN clause for non primary key column

I want to use the IN clause for the non-primary key column in Cassandra. Is it possible? if it is not is there any alternate or suggestion?
Three possible solutions
Create a secondary index. This is not recommended due to performance problems.
See if you can designate that column in the existing table as part of the primary key
Create another denormalised table that table is optimised for your query. i.e data model by query pattern
Update:
And also even after you move that to primary key, operations with IN clause can be further optimised. I found this cassandra lookup by list of primary keys in java very useful

Regarding suggestion of best schema for a cassandra table?

I want to have a table in Cassandra that has a partition key say column 'A', and a column say 'B' which is of 'set' type and can have up to 10000 elements in the set.
But when i retrieve a row from this table then the whole set is retrieved at once and because of that the JVM heap increases rapidly. So should i stick to this schema or go with other schema where 'A' is partition key and i make dynamic columns for each element in the set in my other schema say 'B1', 'B2' ..... 'B10,000'where each of this column is a clustering key.
Which schema is suited best and will give the optimal performance please recommend.
NOTE: cqlsh 5.0.1v
Based off of what you've described, and the documentation I've read, I would not create a collection with 10k elements. Instead I would have two tables, one with everything but the collection, and then use the primary key values of the first table, as the partition key columns of the second table; adding the element name (or whatever you can use to identify an individual element) as a clustering column.
So for a given query, if you wanted everything for a particular primary key value (including all elements), you'd query the first table with the primary key, grab whatever you need, then hit the second table as well, looping/fetching through all elements.
If the query only provides a filter on the partition key (not the primary key - i.e. retrieving multiple rows) , the first query would have to retrieve all columns that make up the primary key for each row, and then query the second table looping for all elements - nested loop here - one loop for each primary key record retrieved from the first table, and a second loop to grab all elements for each pk record.
Probably the best way to go with this. That's how I would probably tackle this.
Does that make sense?
-Jim

what's the difference among row key, primary key and index in cassandra?

I'm so confused.
When to use them and how to determine which one to use?
If a column is index/primary key/row key, could it be duplicated?
I want to create a column family to store some many-to-many info, for example, one column is the given name and the other is surname. One given name can related to many surnames, and one surname could have different given names.
I need to query surnames by a given name, and the given names by a specified surname too.
How to create the table?
Thanks!
Cassandra is a NoSQL database, and as such has no such concept of many-to-many relationships. Ideally a table should not have anything other than a primary key. In your case the right way to model it in Cassandra is to create two tables, one with name as the primary key and the other with surname as the primary key
When you need to query by either key, you need to query the table that has that key as the primary key
EDIT:
From the Cassandra docs:
Cassandra's built-in indexes are best on a table having many rows that
contain the indexed value. The more unique values that exist in a
particular column, the more overhead you will have, on average, to
query and maintain the index. For example, suppose you had a races
table with a billion entries for cyclists in hundreds of races and
wanted to look up rank by the cyclist. Many cyclists' ranks will share
the same column value for race year. The race_year column is a good
candidate for an index.
Do not use an index in these situations:
On high-cardinality columns for a query of a huge volume of records for a small number of results.
In tables that use a counter column On a frequently updated or deleted column.
To look for a row in a large partition unless narrowly queried.

What is the difference between a clustering column and secondary index in cassandra

I'm trying to understand the difference between these two and the scenarios in which you would prefer to use one over the other.
My specific use case is using cassandra as an event ingestion system backed by an analytics engine that interprets the event.
My model includes
event id (the partition key)
event time (a clustering column)
event type (i'm not sure whether to use clustering column or secondary index)
I figure the most common read scenario will be to get the events over a time range hence event time is the clustering column. A less frequent read scenario might involve further filtering the event query by event type.
A secondary index is pretty similar to what we know from regular relational databases. If you have a query with a where clause that uses column values that are not part of the primary key, lookup would be slow because a full row search has to be performed. Secondary indexes make it possible to service such queries efficiently. Secondary indexes are stored as extra tables, and just store extra data to make it easy to find your way in the main table.
So that's a good ol' index, which we already know about. So far, there's nothing new to cassandra and its distributed nature.
Partitioning and clustering is all about deciding how rows from the main table are spread among the nodes. This is unique to cassandara since it determines the distribution of data. So, the primary key consists of at least one column. The first column in the primary key is used as the partition key. The partition key is used to decide which node to store a row. If the primary key has additional columns, the columns are used to cluster the data on a given node - the data is stored in lexicographic order on a node by clustering columns.
This question has more specifics on clustering columns: Clustering Keys in Cassandra
So an index on a given column X makes the lookup X --> primary key efficient. The partition key (first column in the primary key) determines which node a row is stored on. Clustering columns (additional columns in the primary key) determine which order rows are stored in on their assigned node.
So your intuition sounds about right - the event ID is presumably guaranteed unique, so is great for building a primary key. Event time is a great way to order rows on disk on a given node.
If you never needed to lookup data by event type, eg, never had a query like SELECT * FROM Events WHERE Type = Warning, then you have no need for your additional indexes, but your demands for partitioning don't change. Indexes make it easy to serve queries with different predicates. Since you mentioned that you indeed were planning on performing queries like that, you do in fact likely want an index on your EventType column.
Check out the cassandra documentation: http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_compound_keys_c.html
Cassandra uses the first column name in the primary key definition as the partition key.
...
In the case of the playlists table, the song_order is the clustering column. The data for each partition is clustered by the remaining column or columns of the primary key definition. On a physical node, when rows for a partition key are stored in order based on the clustering columns

How to chose Azure Table ParitionKey and RowKey for table that already has a unique attribute

My entity is a key value pair. 90% of the time i'll be retrieving the entity based on key but 10% of the time I'll also do a reverse lookup i.e. I'll search by value and get the key.
The key and value both are guaranteed to be unique and hence their combination is also guaranteed to be unique.
Is it correct to use Key as PartitionKey and Value as RowKey?
I believe this will also ensure that my data is perfectly load balanced between servers since ParitionKey is unique.
Are there any problems in the above decision?
Under any circumstance is it practical to have a hard coded partition key? I.e all rows have same partition key? and keeping the RowKey unique?
Is it doable, yes, but depending on the size of your data, I'm not so sure it's a good idea. When you query on partition key, Table Store can go directly to the exact partition and retrieve all your records. If you query on Rowkey alone, Table store has to check if the row exists in every partition of the table. so if you have 1000 key value pairs, searching by your key will read a single partition/row. If your search via your value alone, it will read all 1000 partitions!
I face a similar problem, I solved it 2 ways:
Have 2 different tables, one with partitionKey as your-key, the other with your-value as partitionKey. Storage is cheap, so duplicating data shouldn't cost much.
(What I finally did) If you're effectively returning single entites based on a unique key, just stick them in blobs(partitioned and pivoted as in point 1), because you don't need to traverse a table, so don't.

Resources