Limiting Cassandra query syntax for clients - cassandra

We plan to use Cassandra 3.x and we want to allow our customers to connect to Cassandra directly for exporting the data into their data warehouses.
They will connect via ODBC from remote.
Is there any way to prevent that the customer executes huge or bad SELECT statements that will result in a high load for all nodes? We use an extra data center in our replication strategy where only customers can connect, so live system will not be affected. But we want to setup some workers that will run on this shadow system also. Most important thing is, that a connected remote client will not have any noticable impact on other remote connections or our local worker jobs. There is a materialized view already and I want to force customers to get data based on primary key only (i.e. disallow usage of ALLOW FILTERING). It would be great also, if one can limit the number of rows returned (e.g. 1 million) to prevent a pull of all data.
Is there a best practise for this use case?
I know of BlackRocks video related to multi-tenant strategy in C* which advises to use tenant_id in schema. That is what we're doing already, but how can I ensure security/isolation via ODBC connected tenants/customers? Or do I have to write an API on my own which handles security?

I would recommend to expose access via API, not via ODBC - at least you would have greater control on what is executed, and enforce tenant_id, and other checks, like limits, etc. You can try to utilize the Cassandra's CQL parser to decompose query, and put all required things back.
Theoretically, you can could utilize Apache Calcite, for example. It has implementation of JDBC driver that could be used, plus there is existing Cassandra adapter that you can modify to accomplish your task (mapping authentication into tenant_ids, etc.), but this will be quite a lot of work.

Related

Shopware 6 partitioning

Has anyone had any experience with database partitioning? We already have a lot of data and queries on it are already starting to slow down. Maybe someone has some examples? These are tables related to orders.
Shopware, since version 6.4.12.0, allows the use of database clusters, see the relevant documentation. You will have to set up a number read-only nodes first. The load of reading data will then be distributed among the read-only nodes while write operations are restricted to the primary node.
Note that in a cluster setup you should also use a lock storage that compliments the setup.
Besides using a DB cluster you can also try to reduce the load of the db server.
The first thing you should enable the HTTP-Cache, still better to additionaly also set up a reverse cache like varnish. This will greatly decrease the number of requests that hit your webserver and thus your DB server as well.
Besides all those measures explained here should improve the overall performance of your shop as well as decreasing load on the DB.
Additionally you could use Elasticsearch, so that costly search requests won't hit the Database. And use a "real" MessageQueue, so that the messages are not stored in the Database. And use Redis instead of the database for the storage of performance critical information as is documented in the articles in this category of the official docs.
The impact of all those measures probably depends on your concrete project setup, so maybe you see in the DB locks something that hints to one of the points i mentioned previously, so that would be an indicator to start in that direction. E.g. if you see a lot of search related queries Elasticsearch would be a great start, but if you see a lot of DB load coming from writing/reading/deleting messages, then the MessageQueue might be a better starting point.
All in all when you use a DB cluster with a primary and multiple replicas and use the additional services i mentioned here your shop should be able to scale quite well without the need for partitioning the actual DB.

What is the recommended approach towards multi-tenant databases in Cassandra?

I'm thinking of creating a multi-tenant app using Apache Cassandra.
I can think of three strategies:
All tenants in the same keyspace using tenant-specific fields for security
table per tenant in a single shared DB
Keyspace per tenant
The voice in my head is suggesting that I go with option 3.
Thoughts and implications, anyone?
There are several considerations that you need to take into account:
Option 1: In pure Cassandra this option will work only if access to database will be always through "proxy" - the API, for example, that will enforce filtering on tenant field. Otherwise, if you provide an CQL access, then everybody can read all data. In this case, you need also to create data model carefully, to have tenant as a part of composite partition key. DataStax Enterprise (DSE) has additional functionality called row-level access control (RLAC) that allows to set permissions on the table level.
Options 2 & 3: are quite similar, except that when you have a keyspace per tenant, then you have flexibility to setup different replication strategy - this could be useful to store customer's data in different data centers bound to different geographic regions. But in both cases there are limitations on the number of tables in the cluster - reasonable number of tables is around 200, with "hard stop" on more than 500. The reason - you need an additional resources, such as memory, to keep auxiliary data structures (bloom filter, etc.) for every table, and this will consume both heap & off-heap memory.
I've done this for a few years now at large-scale in the retail space. So my belief is that the recommended way to handle multi-tenancy in Cassandra, is not to. No matter how you do it, the tenants will be hit by the "noisy neighbor" problem. Just wait until one tenant runs a BATCH update with 60k writes batched to the same table, and everyone else's performance falls off.
But the bigger problem, is that there's no way you can guarantee that each tenant will even have a similar ratio of reads to writes. In fact they will likely be quite different. That's going to be a problem for options #1 and #2, as disk IOPs will be going to the same directory.
Option #3 is really the only way it realistically works. But again, all it takes is one ill-considered BATCH write to crush everyone. Also, want to upgrade your cluster? Now you have to coordinate it with multiple teams, instead of just one. Using SSL? Make sure multiple teams get the right certificate, instead of just one.
When we have new teams use Cassandra, each team gets their own cluster. That way, they can't hurt anyone else, and we can support them with fewer question marks about who is doing what.

Apache Cassandra - Listeners [duplicate]

I wonder if it is possible to add a listener to Cassandra getting the table and the primary key for changed entries? It would be great to have such a mechanism.
Checking Cassandra documentation I only find adding StateListener(s) to the Cluster instance.
Does anyone know how to do this without hacking Cassandras data store or encapsulate the driver and do something on my own?
Check out this future jira --
https://issues.apache.org/jira/browse/CASSANDRA-8844
If you like it vote for it : )
CDC
"In databases, change data capture (CDC) is a set of software design
patterns used to determine (and track) the data that has changed so
that action can be taken using the changed data. Also, Change data
capture (CDC) is an approach to data integration that is based on the
identification, capture and delivery of the changes made to enterprise
data sources."
-Wikipedia
As Cassandra is increasingly being used as the Source of Record (SoR)
for mission critical data in large enterprises, it is increasingly
being called upon to act as the central hub of traffic and data flow
to other systems. In order to try to address the general need, we,
propose implementing a simple data logging mechanism to enable
per-table CDC patterns.
If clients need to know about changes, the world has mostly gone to the message broker model-- a middleman which connects producers and consumers of arbitrary data. You can read about Kafka, RabbitMQ, and NATS here. There is an older DZone article here. In your case, the client writing to the database would also send out a change message. What's nice about this model is you can then pull whatever you need from the database.
Kafka is interesting because it can also store data. In some cases, you might be able to dispose of the database altogether.
Are you looking for something like triggers?
https://github.com/apache/cassandra/tree/trunk/examples/triggers
A database trigger is procedural code that is automatically executed
in response to certain events on a particular table or view in a
database. The trigger is mostly used for maintaining the integrity of
the information on the database. For example, when a new record
(representing a new worker) is added to the employees table, new
records should also be created in the tables of the taxes, vacations
and salaries.

Are there any benefits of using hazel cast over Cassandra

I want to implement session store for my web application. Here's the profile of my application.
The information associated with a session does not change a lot, but
it does change sometimes.
Session reads(session.getAttribute()) are more frequent than writes(session.setAttribute()).
I don't want to deal with master node based architecture(like redis).
Data associated with a session is small, but the number of sessions could be large.
The lookup is always in the form of key value like in hash map.
I am OK with eventual consistency.
I want to be able to specify the replication factor. i.e. number of nodes that will hold data for a given session
I am only looking for open source solutions that wouldn't incur license cost for above features.
For now I want to store upto 10,000 sessions with 10kb data per session(on average), but eventually I want to scale to 100,000 sessions or more!
In my app hazelcast is already being used for some other functionality. But I don't want that to be the deciding factor. Cassandra seems to fulfill all my requirements and it seems to be quite popular. Any reason I should chose hazelcast over cassandra?
Disclaimer: Hazelcast employee
In general I would argue that if you can exchange Hazelcast with Cassandra OR Cassandra with Hazelcast, one of the tools is misused.
We have plenty of people using them as companions, meaning Cassandra as the storage layer and Hazelcast as the caching layer, Cassandra, however, is not a cache and Hazelcast is not a database.
If you want to persist your storages to disk, go for Cassandra (maybe add caching with Hazelcast), if you just want to distribute, go with Hazelcast. Latter case especially if it "doesn't really matter" if you loose sessions once in a while if you (for some reason or another) restart the cluster.
We use both of them in our project. We use Cassandra as persistent storage and Hazelcast for temporary and frequently changed data (e.g. distributed queues and synchronization primitives).
Any reason I should chose hazelcast over cassandra?
In my opinion, Hazelcast is easier from developer point of view and it does not require as much attention as Cassandra on production environment (deep configuration and tunning, repeair, restarting...), so Hazelcast is cheaper in support.

Cassandra as an embedded service and with custom consistency level

I am thinking of building an application that uses Cassandra as its data store, but has low latency requirements. I am aware of EmbeddedCassandraService from this blog post
Is the following implementation possible and what are known pitfalls (defects, functional limitations)?
1) Run Cassandra as an embedded service, persisting data to disk (durable).
2) Java application interacts with local embedded service via one of the following. What are the pros
TMemoryBuffer (or something more appropriate?)
StorageProxy (what are the pitfalls of using this API?)
Apache Avro? (see question #5 below)
3) Java application interacts with remote Cassandra service ("backup" nodes) via Thrift (or Avro?).
4) Write must always succeed to the local embedded Cassandra service in order to be successful, and at least one of the remote (non-embedded) Cassandra nodes. Is this possible? Is it possible to define a custom / complex consistency level?
5) Side-question: Cassandra: The Definitive Guide mentions in several places that Thrift will ultimately be replaced with Avro, but seems like that's not the case just yet?
As you might guess, I am new to Cassandra, so any direction to specific documentation pages (not the wiki homepage) or sample projects are appreciated.
Unless your entire database is sitting on the local machine (i.e. a single node), you gain nothing by this configuration. Cassandra will shard your data across the cluster, so (as mentioned in one of the comments) your writes will frequently be made to another node that owns the data. Presuming you write with a consistency level of at least one, your call will block until that other node acks the write. This negates any benefit of talking to the embedded instance since you have some network latency anyway.

Resources