What is the recommended approach towards multi-tenant databases in Cassandra?

What is the recommended approach towards multi-tenant databases in Cassandra? - cassandra

I'm thinking of creating a multi-tenant app using Apache Cassandra.
I can think of three strategies:
All tenants in the same keyspace using tenant-specific fields for security
table per tenant in a single shared DB
Keyspace per tenant
The voice in my head is suggesting that I go with option 3.
Thoughts and implications, anyone?

There are several considerations that you need to take into account:
Option 1: In pure Cassandra this option will work only if access to database will be always through "proxy" - the API, for example, that will enforce filtering on tenant field. Otherwise, if you provide an CQL access, then everybody can read all data. In this case, you need also to create data model carefully, to have tenant as a part of composite partition key. DataStax Enterprise (DSE) has additional functionality called row-level access control (RLAC) that allows to set permissions on the table level.
Options 2 & 3: are quite similar, except that when you have a keyspace per tenant, then you have flexibility to setup different replication strategy - this could be useful to store customer's data in different data centers bound to different geographic regions. But in both cases there are limitations on the number of tables in the cluster - reasonable number of tables is around 200, with "hard stop" on more than 500. The reason - you need an additional resources, such as memory, to keep auxiliary data structures (bloom filter, etc.) for every table, and this will consume both heap & off-heap memory.

I've done this for a few years now at large-scale in the retail space. So my belief is that the recommended way to handle multi-tenancy in Cassandra, is not to. No matter how you do it, the tenants will be hit by the "noisy neighbor" problem. Just wait until one tenant runs a BATCH update with 60k writes batched to the same table, and everyone else's performance falls off.
But the bigger problem, is that there's no way you can guarantee that each tenant will even have a similar ratio of reads to writes. In fact they will likely be quite different. That's going to be a problem for options #1 and #2, as disk IOPs will be going to the same directory.
Option #3 is really the only way it realistically works. But again, all it takes is one ill-considered BATCH write to crush everyone. Also, want to upgrade your cluster? Now you have to coordinate it with multiple teams, instead of just one. Using SSL? Make sure multiple teams get the right certificate, instead of just one.
When we have new teams use Cassandra, each team gets their own cluster. That way, they can't hurt anyone else, and we can support them with fewer question marks about who is doing what.

Related

Choosing the ideal multi-tenancy architecture for an ASP.NET Core application

I am currently working on an application that will be hosted on Azure. As it does not make sense to have an instance of it running for each customer (you'll see why), it's going to be a multi-tenancy solution.
To be honest: I'm only starting to gather experience with web applications, so I apologize if the answer to my question is obvious.
Question: Which multi-tenancy concept will be most beneficial for my application, considering the following assumptions:
Many tenants (ideally hundreds or even more, we'll see...)
consisting of few user accounts per tenant (<5-10 in most cases, up to 200 for a hand full of tenants)
dealing with mostly small amounts of data (<100 entries in <20 tables)
changes in data occur a few times a day (approx. <50 changes per
user per day)
The application needs to stay responsive (of course)
My thoughts:
Database-per-Tenant: Does not make sense as the DB won't be utilized
much, therefore not cost effective at all
Table-per-Tenant: Could be a good solution, guess this should scale
pretty good?
Tenant-column within the entities: Could be a problem with scaling, right? Could be
better when using charding on the tenant id?
I would really appreciate your help and some "shared experience" in order to choose the not-so-painful path.

A good summary of the different models can be found here:
https://www.linkedin.com/pulse/database-design-multi-tenant-applications-dharmendar-kumar/

Based on my experience on Azure I would recommend CosmosDB with the following options:
partitioned collections: if tenants are evenly distributed and have similar requirements
collection per tenant: if some tenants have scale or special requirements
mix between the preceding two.
Cosmos DB has a lot of benefits e.g sharding, global distribution, performance, freedom of consistency models as well as a good sql support.

Microservices With CosmosDB on Azure

I've read a bit about microservices and the favored approach appears to be a separate database for each microservice. With regards to Azure's CosmosDB, would that mean a separate Table for each service? What's the best way to architect this?

There are a huge variety of factors to consider here which ultimately means there is no right answer to this question and it will be very specific to the nature of the application you're trying to build. As such, broad statements attempting to offer "general" advice and patterns should be taken with a huge grain of salt. With Cosmos a few of the many high level things to consider when making your decisions are as follows:
Partitioning: Cosmos collections support almost infinite scale based on the selection of an appropriate partition key. So, for example you could have a single collection and separate your services such that they each write to a distinct partition key. This would provide you with a form of service multi-tenancy which might be perfectly appropriate for your particular application. However, throughput is also scaled at the collection level so if certain services have much higher read and/or write requirements this may not work for you and could be an indication that that particular service should use it's own collection which can be scaled independently.
Cost: You're billed per collection with a minimum throughput requirement. Depending on the number and nature of your micro services this could result in exponentially higher costs for little gain.
Isolation: Again, depending on the nature of your application you might have a hard business requirement that data from different services be physically separate from each other which would force you to use separate collections.
The point that I'm trying to make here is that there is absolutely no right answer to this question. You need to weight the pros/cons very carefully in the context of the solution you are trying to build and select the approach that is right for you.

How does Azure DocumentDB scale? And do I need to worry about it?

I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.
So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.
Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.
So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?
I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.

Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.
A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).
Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/
Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.

With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.
Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.

You can fan out queries like this - http://stuartmcleantech.blogspot.co.uk/2016/03/scalable-querying-multiple-azure.html

Using federations to partition for multiple tenants

Given the following "facts" I have gleaned from reading around this.
Federations are separate databases from the moment they are created.
As copies of the original, they will not alter automatically if I alter the original's schema.
As separate databases you cannot cross join.
Each federation is priced as a separate db.
I will have to provide a TenantId field to each table I want to federate.
If these are correct, what are the advantages to using federation to achieve multi-tenancy over simply separate dbs? Or if there're not correct please put me straight.
Note, we have a small number of tenants, maybe 20.

Your understanding is correct.
There are a few interesting aspects of Federations that you may find useful. First it is a relatively flexible partitioning environment. For example you can group 10 tenants into the first member, and 50 in the second, based on usage patterns of your customers. Or you could simply isolate a single customer that is using the system more than the others.
Another important concept is that you can have multiple federations per database. So you could have a Customer federation and a SalesHistory federation for example.
Last but not least you may want to read this article that discusses connection pool fragmentation that occurs in traditional sharding models, but is not an issue with SQL Database Federations.

ColumnStore index benefits on Azure?

We are currently running on Azure and we have a table with hundreds of millions of rows. This table is static and will be refreshed weekly. We've looked at ColumnStore index but unfortunately it is not Azure yet so below are my questions,
Will ColumnStore index be available in Azure?
if not what other technology can we use to get the same performance
benefits as the ColumnStore index would provide?
Can we get the same query performance by using Azure Table Storage?
I'm a newbie to both Azure and Columnar databases so please bear me with me if I ask these questions.. :)

About ColumnStore, if you have bought the license, you can check with development team or ask on blogs such as ScottGu's Blog. From there only you will come to know about any feature release.
Azure Database is designed for scalability. You will need to use the Partition Key very wisely. Partition Key is like index of book, so if you want to search something in book, you can quickly refer to the index and reach the page quickly. In other words, you can group data depending upon certain criteria and store it in a single partition. So where ever you have the same criteria, your query will hit only one partition. The thing with partitions is, for a table you can any number of partition, but it is not necessary that all the partition will reside on same machine or even same farm. So when you fire a query on badly designed Azure Table, it can hit more than one server, and thus bad performance. Read about Real World: Designing a Scalable Partitioning Strategy for Windows Azure Table Storage
Hope you get what you are looking for.

As Amar pointed out, keep an eye on the team blogs for the latest in new feature announcements. The goal for SQL Azure is for it to eventually be where new features are found first. However, it will still take awhile for things to get there.
As for your peformance question, there's no simple answer for this. Windows Azure resources are designed for scale, not necessarially high performance. So its to take your scale/capacity targets into account when designing solutions. For your situation, I would encourage you to conside table storage, but this will depend on frequency access and the types of queries you need to make on the data. Just do not be surprised if you have to mave redundant copies of your data that are modelled differently, or possibly even running parrallel queries and aggregating results. This is the way table storage was designed to be used. Its cheaper then SQL Azure and its this price difference that makes redundant specialized data models possible.
This approach also has to be weighed against the cost of retraining your developers to stop thinking in RDBMS terms. :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string