Cassandra data modelling less then 1000 records to fit in one row - cassandra

We have some entity uniquely identified by generated UUID. We need to support find by name query. Also we need to support sorting to be by name.
We know that there will be no more than 1000 of entities of that type which can perfectly fit in one row. Is it viable idea to hardcode primary key, use name as clustering key and id as clustering key there to satisfy uniqueness. Lets say we need school entity. Here is example:
CREATE TABLE school (
constant text,
name text,
id uuid,
description text,
location text,
PRIMARY KEY ((constant), name, id)
);
Initial state would be give me all schools and then filtering by exact name will happen. Our reasoning behind this was to place all schools in single row for fast access, have name as clustering column for filtering and have id as clustering column to guaranty uniqueness. We can use constant = school as known hardcoded value to access this row.
What I like about this solution is that all values are in one row and we get fast reads. Also we can solve sorting easy by clustering column. What I do not like is hardcoded value for constant which seams odd. We could use name as PK but then we would have 1000 records spread across couple of partitions, probably find all without name would be slower and would not be sorted.
Question 1
Is this viable solution and are there any problems with it which we do not see? I did not see any example on Cassandra data modelling with hardcoded primary key probably for the reason so we are doubting this solution.
Question 2
Name is editable field, it will probably be changed rarely (someone can make typo or school can change name) but it can change. What is best way to achieve this? Delete insert inside batch (LTE can be applied to same row with conditional clause)?

Yes this is a good approach for such a small dataset. Just because Cassandra can partition large datasets across multiple nodes does not mean that you need to use that ability for every table. By using a constant for the partition key, you are telling Cassandra that you want the data to be stored on one node where you can access it quickly and in sorted order. Relational databases act on data in a single node all the time, so this is really not such an unusual thing to do.
For safety you will probably want to use a replication factor higher than one so that there are at least two copies of the single partition. In that way you will not lose access to the data if the one node where it is stored went down.
This approach could cause problems if you expect to have a lot of clients (i.e. thousands of clients) frequently reading and writing to this table, since it could become a hot spot. With only 1000 records you can probably keep all the rows cached in memory by setting the table to cache all keys and rows.
You probably won't find a lot of examples where this is done because people move to Cassandra for the support of large datasets where they want the scalability that comes from using multiple partitions. So examples are geared towards that.

Is this viable solution and are there any problems with it which we do not see? I did not see any example on Cassandra data modelling with hardcoded primary key probably for the reason so we are doubting this solution.
I briefly addressed this type of modeling solution earlier this year in my article: We Shall Have Order! This is what is known as a "dummy key," where each row has the same partition key. This is a shortcut that allows you to easily order all of your rows (on an unbound SELECT *) by clustering column(s).
Problems with this solution:
Cassandra allows a maximum of 2 billion column values per partition key. When using a dummy partition key, you will approach this limit with each value that you add.
Your data will all be stored in the same partition, which will create a "hot spot" (large groupings of data) in your cluster. This means that your data model will immediately void one of Cassandra's main benefits...data distribution. This will also complicate load balancing (the same nodes and ranges will keep serving all of your requests).
I can see that your model is designed around a SELECT * query. Cassandra works best when you can give it specific keys to query by. Unbound SELECT * queries (queries without WHERE clauses) are not a good idea to be doing with Cassandra, as they can lead to timeouts (as your data grows).
From reading through your question, I know that you're going to say that you're only using it for 1000 rows. That your dataset won't ever grow much beyond those 1000 rows, so you won't hit any of the roadblocks that I have mentioned.
So then I have to wonder, why are you using Cassandra? As a Cassandra MVP, that's a question I don't ask often. But you don't have an especially large data set (which is what Cassandra is designed to work with). Relying on that fact as a reason to use a product incorrectly is not really the best solution.
Honestly, I am going to recommend that you save yourself some complexity, and use a RDBMS instead. That will fit your use case significantly better than Cassandra will. Then you can update and order by whatever fields you wish.

Related

How can you query all the data with an empty set in cassandra db?

As the title says, I'm trying to query all the data I got with no value stored in it. I've been searching for a while, and the only operation allowed that I've found is CONTAINS, which doesn't fit my need.
consider the following table:
CREATE TABLE environment(
id uuid,
name varchar,
message text,
public Boolean,
participants set<varchar>,
PRIMARY KEY (id)
)
How can I get all entries in the table with an empty set? E.g. participants = {} or null?
Unfortunately, you really can't. Cassandra makes queries like this difficult by design, because there's no way it can be done without doing a full table scan (scanning each and every node). This is why a big part of Cassandra data modeling is understanding all the ways that table will be queried, and building it to support those queries.
The other issue that you'll have to deal with, is that (generally speaking) Cassandra does not allow filtering by nulls. Again, it's a design choice...it's much easier to query for data that exists, rather than data that does not exist. Although, when writing with lightweight transactions, there are ways around this one (using the IF clause).
If you knew all of the ids ahead of time, you could write something to iterate through them, SELECT and check for null on the app-side. Although that approach will be slow (but it won't stress the cluster). Probably the better approach, is to use a distributed OLAP layer like Apache Spark. It still wouldn't be fast, but this is probably the best way to handle a situation like this.

Should I set foreign keys in Cassandra tables?

I am new to Cassandra and coming from relational background. I learned Cassandra does not support JOINs hence no concept of foreign keys. Suppose I have two tables:
Users
id
name
Cities
id
name
In RDBMS world I should pass city_id into users table. Since there is no concept of joins and you are allowed to duplicate data, is it still work passing city_id into users table while I can create a table users_by_cities?
The main Cassandra concept is that you design tables based off of your queries (as writes to the table have no restrictions). The design is based off of the query filters. An application that queries a table by some ID is somewhat unnatural as the CITY_ID could be any value and typically is unknown (unless you ran a prior query to get it). Something more natural may be CITY_NAME. Anyway, assuming there are no indexes on the table (which are mere tables themselves), there are rules in Cassandra regarding the filters you provide and the table design, mainly that, at a minimum, one of the filters MUST be the partition key. The partition key helps direct cassandra to the correct node for the data (which is how the reads are optimized). If none of your filters are the partition key, you'll get an error (unless you use ALLOW FILTERING, which is a no-no). The other filters, if there are any, must be the clustering columns (you can't have a filter that is neither the partition key nor the clustering columns - again, unless you use ALLOW FILTERING).
These restrictions, coming from the RDBMS world, are unnatural and hard to adjust to, and because of them, you may have to duplicate data into very similar structures (maybe the only difference is the partition keys and clustering columns). For the most part, it is up to the application to manipulate each structure when changes occur, and the application must know which table to query based off of the filters provided. All of these are considered painful coming from a relational world (where you can do whatever you want to one structure). These "constraints" need to be weighed against the reasons why you chose Cassandra for your storage engine.
Hope this helps.
-Jim

Cassandra data model with multiple conditions

I'm new to Cassandra, so I read a dozen articles about it and thus I know the basics. All the tutorials show efficient data retrieval by 1 or 2 columns and a time range. What I could not find was how to correctly model your data if you have more conditions.
I have a big events normalised database, with quite a few columns, say:
Event type
time
email
User_age
user_country
user_language
and so on.
I would need to be able to query by all columns. So in RDBMS I would query:
SELECT email FROM table WHERE time > X AND user_age BETWEEN X AND X AND user_language = 'nl' etc..
I know I can make a separate table for each column, but then I would still need to combine the results. Maybe this is not a bad approach, but I doubt it since there are no subqueries.
My question is obviously, how can I model this kind of data correctly in Cassandra?
Thanks a lot!
I would need to be able to query by all columns.
Let me stop you right there. In Cassandra, you create your tables based on your anticipated query patterns, and usually a table supports a single query. In your case, you have "quite a few" columns and you will need to duplicate that data into a table designed to support each possible query. That is going to get big and ungainly, very quickly.
Could we just add the rest as secondary indexes? there could potentially still be millions of rows in the eventtype table + merchant_id + time selection.
Secondary indexes are intended to be used on middle-of-the-road cardinality columns. So both, extremely low and extremely high cardinality columns are bad for secondary indexes. The problem, is that Cassandra will have to pick one of your nodes as a coordinator, scan the index on each node (incurring lots of network time), and then build and return the result set. It's a prescription for poor performance, that flies in-the-face of the best practices for working with a distributed database.
In short, Cassandra is not a good solution for use cases like this. It sounds like you want to be able to do OLAP-type queries, and for that you should use a tool that is better-suited for that purpose.

Are dummy partition keys always bad?

I can't find much on the subject of dummy partition keys in Cassandra, but what I can find tends to side with the idea that you should avoid them altogether. By dummy, I mean a column whose only purpose is to contain the same value for all rows, thereby putting all data on 1 node and giving the lowest possible cardinality. For example:
dummy | id | name
-------------------------
0 | 01 | 'Oliver'
0 | 02 | 'James'
0 | 03 | 'Nicholls'
The two main points in regards to why you should avoid dummy partition keys are:
1) You end up with data "hot-spots". There is a lot of data stored on 1 node so there's more traffic around that node and you have poor distribution around the cluster.
2) Partition space is finite. If you put all data on one partition, it will eventually be incapable of storing any more data.
I can understand these points and I agree that you definitely want to avoid those situations, so I put this idea out of my mind and tried to think of a good partition key for my table. The table in question stores sites and there are two common ways that table gets queried in our system. Either a single site is requested or all sites are requested.
This puts me in a bit of an awkward situation, because the table is either queried on nothing or the site ID, and making a unique field the partition key would give me very high cardinality and high latency on queries that request all sites.
So I decided that I'd just choose an arbitrary field that would give relatively low cardinality, even though it doesn't reflect how the data will actually be queried, just because it's better than having a cardinality that is either excessively high or excessively low. This approach also has problems though.
I could partition my data on column x, but we have numerous clients, all of whom use our system differently, so x for 1 client could give the results I'm after, but could give awful results for another.
At this point I'm running out of options. I need a field in my table that will be consistent for all clients, however this field doesn't exist, so I'm now considering having a new field that will contain a random number from 1-3 and then partitioning on that field, which is essentially just a dummy field. The only difference is that I want to randomise the values a little bit as to avoid hot-spots and unbounded row growth.
I know this is a data-modelling question and it varies from system to system, and of course there are going to be situations where you have to choose the lesser of two evils (there is no perfect solution), but what I'm really focussed on with this question is:
Are dummy partition keys something that should outright never be a consideration in Cassandra, or are there situations in which they're seen as acceptable? If you think the former, then how would you approach this situation?
I can't find much on the subject of dummy partition keys in Cassandra, but what I can find tends to side with the idea that you should avoid them altogether.
I'm going to go out on a limb and guess that your search has yielded my article We Shall Have Order!, where I made my position on the use of "dummy" partition keys quite clear. Bearing that in mind, I'll try to provide some alternate solutions.
I see two potential problems to solve here. The first:
I need a field in my table that will be consistent for all clients, however this field doesn't exist
Typically this is solved by duplicating your data into another query table. That's the best way to serve multiple, varying query patterns. If you have one client (service?) that needs to query that table by site id, then you could have that table duplicated into a table called sites_by_id.
CREATE TABLE sites_by_id (
id BIGINT,
name TEXT,
PRIMARY KEY (id));
The other problem is this query pattern:
all sites are requested
Another common Cassandra anti-pattern is that of unbound SELECTs (SELECT query without a WHERE clause). I am sure you understand why these are bad, as they require all nodes/partitions to be read for completion (which is probably why you are looking into a "dummy" key). But as the table supporting these types of queries increases in size, they will only get slower and slower over time...regardless of whether you execute an unbound SELECT or use a "dummy" key.
The solution here is to perform a re-examination of your data model, and business requirements. Perhaps your data can be split up into sites by region or country? Maybe your client really only needs the sites that have been updated for this year? Obtaining some more details on the client's query requirements may help you find a good partitioning key for them to use. Otherwise, if they really do need all of them all of the time, then doanduyhai's suggestion of using Spark will better fit your use case.
or all sites are requested
So basically you have a full table scan scenario. Isn't Apache Spark over Cassandra a better fit for this use-case ? I suspect it's an analytics use-case, isn't it ?
As far as I understand, you want to access a single site by its id, in which case lookup by partition key is ideal. The other use-case which requires to fetch all the sites is best suited with Spark

data modelling in cassandra to optimize search results

I was just wondering if I could get some clue/pointers to our kind of simple data modelling problem.
It would be great if somebody can help me in the right direction.
So we have kind of a flat table ex. document
which has all kinds of meta data attached to a document like
UUID documentId,
String organizationId,
Integer totalPageCount,
String docType,
String acountNumber,
String branchNumber,
Double amount,
etc etc...
which we are storing in cassandra .
UUID is the rowkey and we have certain secondary indexes like organization Id.
This table is actaully suppose hold millions of records.
Placing proper indices helps with a lot of queries but with the generic queries I am stuck.
The problem is even with something like 100k records if I throw in a query like
select * from document where orgId='something' and amount > 5 and amount < 50 ...I am begining to see all Read time out problems.
The query still works (although quite slow) if I limit the no of records to something lets say 2000.
The above can be solved by probably placing certain parmas properly but there about dozens of those columns based on which we need to search.
I am still trying to scale it horizontally so to place mutiple records in a single row.
Hoping for a sense of direction.
This is a broad problem, and general solutions are hard to give. However, here's my 2 pennies:
You WANT queries to hit single partitions for quick querying. If you don't hit a rowkey in your query, it's a cluster wide operation. So select * from docs where orgId='something' and amount > 5 and amount < 50 means you will have issues. Hitting a partition key AND an index is way way better than hitting the index without the partition key.
Again, you don't want all docs in a single partition...that's an obvious hotspot, not to mention it can cause size issues - keeping a row around the 100mb mark is a good idea. Several thousand or even several hundred thousand metadata entries per row should be fine - though much of this depends on your specific data.
So we want to hit partition keys, but also want to take advantage of distribution, while preserving efficiency. Hmmm.....
You can create artificial buckets. Decide how many buckets you want, based on expected data volumes. Assuming a few hundred thousand per partition, n buckets gives you n * hundreds of thousands. Make the bucket id the row key. When querying, use something like:
select * from documents where bucketid in (...) and orgId='something' and amount > 5;
[Note: for this, you may want to make the docid the last clustering key, so you don't have to specify it when doing the range query.]
That will result in n fast queries hitting n partitions, where n is the number of buckets.
Also, consider limiting your results. Do you really need 2000 records at a time?
For some information, it may make sense to have separate tables (i.e. some information with one particular clustering scheme in one table, and another in another). Duplication of some information is often ok - but again, this depends on particular scenarios.
Again, it's hard to give a general answer. But does that help?
The problem is not in Cassandra, but in your data model. You need to shift from relation thinking, to a nosql-cassandra thinking. In Cassandra, you write your queries first if you want to get decent O(1) speed. Using secondary indexes in Cassandra is frankly a poor choice. This is due to the fact that your indexes are distributed.
If you don't know your queries upfront, use other technology but not Cassandra. Relational servers are really good, if you can fit all data on 1 server, otherwise have a look at ElasticSearch.
Other option is to use Datastax edition, which contains Solr for full text search.
Lastly, you can have several tables that duplicate information. This will allow you to query for a specific property . This process is called de-normalisation and the idea is that you take a property of your object, make it a primary key and insert it into its own table. The outcome is that you can query that particular table, for that particular property value in O(1) time. The downside is that you now have to duplicate data.

Resources