Unenforced Unique vs enforced Unique in memsql - singlestore

I find this abit confusing. Iam using memsql column store. I try to understand if there is a way to enforce duplications on specific key (e.g eventId). I found some doc regarding Unenforced Unique but I didnt really understand its intention.

The point of unenforced unique keys is as a hint:
An unenforced unique constraint is informational: the query planner may use the unenforced unique constraint as a hint to choose better query plans.
from https://docs.memsql.com/v6.8/concepts/unenforced-unique-constraints/.
Unfortunately MemSQL does not support (enforced) unique constraints on columnstore tables.

MemSQL now supports unique constraint with version 7+ but can be applied to only single column:
https://docs.memsql.com/v7.1/guides/use-memsql/physical-schema-design/creating-a-columnstore-table/creating-a-columnstore-table/
Your columnstore table definition can contain metadata-only unenforced unique keys, single-column hash keys (which may be UNIQUE), and a FULLTEXT key. You cannot define more than one unique key.
one hack to enable UNIQUE constraint on multi columns is to use a computed column consisting of multiple columns appended and then apply UNIQUE on it which will indirectly enforce uniqueness on multiple columns.
example:
CREATE TABLE articles (
id INT UNSIGNED,
year int UNSIGNED,
title VARCHAR(200),
body TEXT,
SHARD KEY(title),
KEY (id) USING CLUSTERED COLUMNSTORE,
KEY (id) USING HASH,
UNIQUE KEY (title) USING HASH,
KEY (year) USING HASH);

Related

Apache Cassandra stock data model design

I got a lot of data regarding stock prices and I want to try Apache Cassandra out for this purpose. But I'm not quite familiar with the primary/ partition/ clustering keys.
My database columns would be:
Stock_Symbol
Price
Timestamp
My users will always filter for the Stock_Symbol (where stock_symbol=XX) and then they might filter for a certain time range (Greater/ Less than (equals)). There will be around 30.000 stock symbols.
Also, what is the big difference when using another "filter", e.g. exchange_id (only two stock exchanges are available).
Exchange_ID
Stock_Symbol
Price
Timestamp
So my users would first filter for the stock exchange (which is more or less a foreign key), then for the stock symbol (which is also more or less a foreign key). The data would be inserted/ written in this order as well.
How do I have to choose the keys?
The Quick Answer
Based on your use-case and predicted query pattern, I would recommend one of the following for your table:
PRIMARY KEY (Stock_Symbol, Timestamp)
The partition key is made of Stock_Symbol, and Timestamp is the only clustering column. This will allow WHERE to be used with those two fields. If either are to be filtered on, filtering on Stock_Symbol will be required in the query and must come as the first condition to WHERE.
Or, for the second case you listed:
PRIMARY KEY ((Exchange_ID, Stock_Symbol), Timestamp)
The partition key is composed of Exchange_ID and Stock_Symbol, and Timestamp is the only clustering column. This will allow WHERE to be used with those three fields. If any of those three are to be filtered on, filtering on both Exchange_ID and Stock_Symbol will be required in the query and must come in that order as the first two conditions to WHERE.
See the last section of this answer for a few other variations that could also be applied based on your needs.
Long Answer & Explanation
Primary Keys, Partition Keys, and Clustering Columns
Primary keys in Cassandra, similar to their role in relational databases, serve to identify records and index them in order to access them quickly. However, due to the distributed nature of records in Cassandra, they serve a secondary purpose of also determining which node that a given record should be stored on.
The primary key in a Cassandra table is further broken down into two parts - the Partition Key, which is mandatory and by default is the first column in the primary key, and optional clustering column(s), which are all fields that are in the primary key that are not a part of the partition key.
Here are some examples:
PRIMARY KEY (Exchange_ID)
Exchange_ID is the sole field in the primary key and is also the partition key. There are no additional clustering columns.
PRIMARY KEY (Exchange_ID, Timestamp, Stock_Symbol)
Exchange_ID, Timestamp, and Stock_Symbol together form a composite primary key. The partition key is Exchange_ID and Timestamp and Stock_Symbol are both clustering columns.
PRIMARY KEY ((Exchange_ID, Timestamp), Stock_Symbol)
Exchange_ID, Timestamp, and Stock_Symbol together form a composite primary key. The partition key is composed of both Exchange_ID and Timestamp. The extra parenthesis grouping Exchange_ID and Timestamp group them into a single composite partition key, and Stock_Symbol is a clustering column.
PRIMARY KEY ((Exchange_ID, Timestamp))
Exchange_ID and Timestamp together form a composite primary key. The partition key is composed of both Exchange_ID and Timestamp. There are no clustering columns.
But What Do They Do?
Internally, the partitioning key is used to calculate a token, which determines on which node a record is stored. The clustering columns are not used in determining which node to store the record on, but they are used in determining order of how records are laid out within the node - this is important when querying a range of records. Records whose clustering columns are similar in value will be stored close to each other on the same node; they "cluster" together.
Filtering in Cassandra
Due to the distributed nature of Cassandra, fields can only be filtered on if they are indexed. This can be accomplished in a few ways, usually by being a part of the primary key or by having a secondary index on the field. Secondary indexes can cause performance issues according to DataStax Documentation, so it is typically recommended to capture your use-cases using the primary key if possible.
Any field in the primary key can have a WHERE clause applied to it (unlike unindexed fields which cannot be filtered on in the general case), but there are some stipulations:
Order Matters - The primary key fields in the WHERE clause must be in the order that they are defined; if you have a primary key of (field1, field2, field3), you cannot do WHERE field2 = 'value', but rather you must include the preceding fields as well: WHERE field1 = 'value' AND field2 = 'value'.
The Entire Partition Key Must Be Present - If applying a WHERE clause to the primary key, the entire partition key must be given so that the cluster can determine what node in the cluster the requested data is located in; if you have a primary key of ((field1, field2), field3), you cannot do WHERE field1 = 'value', but rather you must include the full partition key: WHERE field1 = 'value' AND field2 = 'value'.
Applied to Your Use-Case
With the above info in mind, you can take the analysis of how users will query the database, as you've done, and use that information to design your data model, or more specifically in this case, the primary key of your table.
You mentioned that you will have about 30k unique values for Stock_Symbol and further that it will always be included in WHERE cluases. That sounds initially like a resonable candidate for a partition key, as long as queries will include only a single value that they are searching for in Stock_Symbol (e.g. WHERE Stock_Symbol = 'value' as opposed to WHERE Stock_Symbol < 'value'). If a query is intended to return multiple records with multiple values in Stock_Symbol, there is a danger that the cluster will need to retrieve data from multiple nodes, which may result in performance penalties.
Further, if your users wish to filter on Timestamp, it should also be a part of the primary key, though wanting to filter on a range indicates to me that it probably shouldn't be a part of the partition key, so it would be a good candidate for a clustering column.
This brings me to my recommendation:
PRIMARY KEY (Stock_Symbol, Timestamp)
If it were important to distribute data based on both the Stock_Symbol and the Timestamp, you could introduce a pre-calculated time-bucketed field that is based on the time but with less cardinality, such as Day_Of_Week or Month or something like that:
PRIMARY KEY ((Stock_Symbol, Day_Of_Week), Timestamp)
If you wanted to introduce another field to filtering, such as Exchange_ID, it could be a part of the partition key, which would mandate it being included in filters, or it could be a part of the clustering column, which would mean that it wouldn't be required unless subsequent fields in the primary key needed to be filtered on. As you mentioned that users will always filter by Exchange_ID and then by Stock_Symbol, it might make sense to do:
PRIMARY KEY ((Exchange_ID, Stock_Symbol), Timestamp)
Or to make it non-mandatory:
PRIMARY KEY (Stock_Symbol, Exchange_ID, Timestamp)

Is it necessary to use all the columns defined as the primary key to query a Cassandra database?

I am using Cassandra database and need to define the Primary Key which is a combination of partition key and clustering keys. The cassandra database needs to be queried based on the combination of two fields i.e. a customer number and createdAt (Unix timestamp value), as per the business requirement. These columns cannot be used as Primary key because they cannot uniquely identify a row in the database. So, is it correct to add the uuid column from database as a clustering key to make the primary key unique, so that the Primary key will become a combination of - customerNumber(Partition key), createdAt (ClusteringKey), uuid( clustering key). However the database will never be queried based on the whole primary key. It will always be queried based on the part of the Primary key i.e. Customer Number and createdAt. uuid will never be used to query the database.
So if I understand correctly, your PRIMARY KEY definition looks like this:
PRIMARY KEY (customerNumber,createdAt,uuid)
It will always be queried based on the part of the Primary key
Yes, querying by part of the PRIMARY KEY definition is fine, in your case. Cassandra tries to restrict queries to a single node, and it achieves this by ensuring that an entire partition is written to a single node (and then replicated). Because of this, you really only need to supply the partition key on your queries (customerNumber), and they should work.
Supplying an additional PRIMARY KEY component however, is helpful. In a high-throughput scenario, the smaller you can keep your result set payloads, the better.
tl;dr;
Querying by customerNumber and createdAt will be just fine.

How to have unique key except primary key in cassandra?

I am not good in English!
There is a table in Cassandra 3.5 which all columns of a row don't come at same time. Unique of table is some columns that are unique in a row together, but some of them are null at first. I can not set them the primary key because of null value. I have identify a column with name id and type uuid in Cassandra.
How can I have a unique key with that columns together in Cassandra?
Is my data model true?
How can I solve this problem?
You can't. It's not a relational DB. Use clustering and/or partitioning keys to add an unique constraint.
See this answer
To store unique values, create a separate table having your unique value as a key. Check if it exists by requesting this table before inserting a row. But beware, even doing this, you cannot ensure it will be unique in your final table if you have two concurrent inserts.
Basically, I would recommend using Cassandra as it really is: A data store. And find a way to implement your business logic where it belongs: in your code.

How to make Cassandra have a varying column key for a specific row key?

I was reading the following article about Cassandra:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.UzIcL-ddVRw
and it seemed to imply you can have varying column keys in cassandra for a given row key. Is that true? And if its true, how do you allow for varying row keys.
The reason I think this might be true is because say we have a user and it can like many items and we simply want the userId to be the rowkey. We let this rowKey (userID) map to all the items that specific user might like. Each specific user might like a different number of items. Therefore, if we could have multiple column keys, one for each itemID each user likes, then we could solve the problem that way.
Therefore, is it possible to have varying length of cassandra column keys for a specific rowKey? (and how do you do it)
Providing an example and/or some cql code would be awesome!
The thing that is confusing me is that I have seen some .cql files and they define keyspaces before hand and it seems pretty inflexible on how to make it dynamic, i.e. allow it to have additional columns as we please. For example:
CREATE TABLE IF NOT EXISTS results (
test blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
How can this even allow growing columns? Don't we need to specify the name before hand anyway?Or additional custom columns as the application desires?
Yes, you can have a varying number of columns per row_key. From a relational perspective, it's not obvious that tid is the name of a variable. It acts as a placeholder for the variable column key. Note in the inserts statements below, "tid", "result", and "data" are never mentioned in the statement.
CREATE TABLE IF NOT EXISTS results (
data blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
So in your example, you need to identify the row_key, column_key, and payload of the table.
The primary key contains both the row_key and column_key.
Test is your row_key.
tid is your column_key.
data is your payload.
The following inserts are all valid:
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_1');
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591712', 'blob_2');
#notice that the column_key changed but the row_key remained the same
INSERT your_keyspace.results('row_key_2', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_3');
See here
Did you thought of exploring collection support in cassandra for handling such relations in colocated way{e.g. on same data node}.
Not sure if it helps, but what about keeping user id as row key and a map containing item id as key and some value?
-Vivel

EF5 Navigation/Association Property with non-Primary Foreign Key

This is the same exact question as this, but instead for EF5.
Is it possible now?
We have a Users table that has an int PK, but in our other tables that have columns like InsertBy/UpdateBy, the desire is to use value of the LANID varchar column from the Users table, rather than the UserID.
No it is still not possible. FK must target PK in the principal table because EF still doesn't support unique keys (prerequisite for using non-PK columns).

Resources