I am looking for documentation or general guidelines on when more Cassandra servers should be added to a ring. Should this be based on disk usage or other monitoring factors?
Currently I have some concerns about CoordinatorReadLatency, ReadLatency, and DroppedMessages.REQUEST_RESPONSE, but again I cannot find a good guide on how to interpret various components that I am monitoring. I can find good guides on performance tuning, but limited information on devops.
I understand that this question may be more relevant to Server Fault, but they don't have tags for Datastax Enterprise.
Thanks in advance
Next steps based on #bcoverston 's response
Nodetool provides access to read and write latency metrics: nodetool cfhistrograms
See docs here: http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCFhisto.html?scroll=toolsCFhisto#
Since we want to tie this into pretty graphs the nodetool source code points us to the right jmx values
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/tools/NodeTool.java#L82
Each cf has write and read latency metrics.
The question is a little open ended, and it depends on your use case. There are a lot of things to monitor, and it can be overwhelming to look at every possible setting and decide if you need to increase your cluster size.
The general advice here is that you should monitor your read and write latency, decide where your thresholds should be, and plan your capacity accordingly. Because there is no proscriptive hardware for running Cassandra, and your use case can be unique to whatever your doing there are only rules of thumb.
Sizing your cluster based on data/node can be helpful, but only if I know how big your working set is, and what your latency targets are. In addition the speed of your storage media also matters.
Sizing your cluster based on latency makes more sense. If you need to do N tx/second you can test your hardware based on your workload and see if it can meet your targets. Keep in mind that when you do this you'll want to do a long term test to see if those targets hold up in a sustained manner, and also how long it will take until performance under that load when and if it will degrade (a write heavy workload will degrade over time, and you'll want to add capacity before you start missing your targets).
Related
I know The CAP theorem:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
Cassandra is typically classified as an AP system, I heard yes it can turned to CA, but I didn't find the documentation.
How to use CA Cassandra ?
Thanks.
Generally speaking, the 'P' in CAP is what NoSQL technologies were built to solve for. This is usually accomplished by spreading data horizontally across multiple instances.
Therefore, if you wanted Cassandra to run in a "CA" CAP configuration, running it as a single node cluster would be a good first step.
I heard yes it can turned to CA, but I didn't find the documentation.
After re-reading this, it's possible that you may have confused "CA" with "CP."
It is possible to run Cassandra as a "CP" database, or at least tune it to behave more in that regard. The way to go about this, would be to set queries on the application side to use the higher levels of consistency, like [LOCAL_]QUORUM, EACH_QUORUM, or even ALL. Consistency could be tuned even higher, by increasing the replication factor (RF) in each keyspace definition. Setting RF equal to number of nodes and querying at ALL consistency would be about as high as it could be tuned to be consistent.
However, I feel compelled to mention at what a terrible, terrible idea this all is. Cassandra was engineered to be "AP." Fighting that intrinsic design is a fool's errand. I've always said, nobody wins when you try to out-Cassandra Cassandra.
If you're employing engineering time to make a datastore function in ways that are contrary to its design, then a different datastore (one you don't have to work against) might be the better choice.
What is the difference between Scylla read path and Cassandra read path? When I stress Cassandra and Scylla then Scylla read performance poor by 5 times than Cassandra using 16 core and normal HDD.
I expect better read performance on Scylla compared to Cassandra using normal HDD, because my company doesn't provide SSD's.
Can someone please confirm, is it possible to achieve better read performance using normal HDD or not?
If yes, what changes required scylla config?. Please guide me!
Some other responses focused on write performance, but this isn't what you asked about - you asked about reads.
Uncached read performance on HDDs is bound to be poor in both Cassandra and Scylla, because reads from disk each requires several seeks on the HDD, and even the best HDD cannot do more than, say, 200 of those seeks per second. Even with a RAID of several of these disks, you will rarely be able to do more than, say, 1000 requests per second. Since a modern multi-core can do orders of magnitude more CPU work than 1000 requests per second, in both Scylla and Cassandra cases, you'll likely see free CPU. So Scylla's main benefit, of using much less CPU per request, will not even matter when the disk is the performance bottleneck. In such cases I would expect Scylla's and Cassandra's performance (I am assuming that you're measuring throughput when you talk about performance?) should be roughly the same.
If, still, you're seeing better throughput from Cassandra than Scylla, there are several details that may explain why, beyond the general client mis-configuration issues raised in other responses:
If you have low amounts of data, that can fit in memory, Cassandra's caching policy is better for your workload. Cassandra uses the OS's page cache, which reads whole disk pages and may cache multiple items in one read, as well as multiple index entries. While Scylla works differently, and has a row cache - only caching the specific data read. Scylla's caching is better for large volumes of data that do not fit in memory, but much worse when the data can fit in memory, until the entire data set has been cached (after everything is cached, it becomes very efficient again).
On HDDs, the details of compaction are very important for read performance - if in one setup you have more sstables to read, it can increase the number of reads and lower the performance. This can change depending on your compaction configuration, or even randomly (depending on when compaction was run last). You can check if this explains your performance issues by doing a major compaction ("nodetool compact") on both systems and checking the read performance afterwards. You can switch the compaction strategy to LCS to ensure that random-access read performance is better, at the cost of more write work (on HDDs, this can be a worthwhile compromise).
If you are measuring scan performance (reading an entire table) instead of reading individual rows, other issues become relevant: As you may have heard, Scylla subdivides each nodes into shards (each shard is a single CPU). This is fantastic for CPU-bounded work, but could be worse for scanning tables which aren't huge, because each sstable is now smaller and the amount of contiguous data you can read before needing to seek again is lower.
I don't know which of these differences - or something else - is causing performance of your use-case to be lower in Scylla, but I please keep in mind that whatever you fix, your performance is always going to be bad with HDDs. With SDDs, we've measured in the past more than a million random-access read requests per second on a single node. HDDs cannot come anything close. If you really need optimum performance or performance per dollar, SDDs are really the way to go.
There can be various reasons why you are not getting the most out of your Scylla Cluster.
Number of concurrent connections from your clients/loaders is not high enough, or you're not using sufficient amount of loaders. In such case, some shards will be doing all the work, while others will be mostly idle. You want to keep your parallelism high.
Scylla likes have a minimum of 2 connections per shard (you can see the number of shards in /etc/scylla.d/cpuset.conf)
What's the size of your dataset? Are you reading a large amount of partitions or just a few? You might be hitting a hot partition situation
I strongly recommend reading the following docs that will provide you more insights:
https://www.scylladb.com/2019/03/27/best-practices-for-scylla-applications/
https://docs.scylladb.com/operating-scylla/benchmarking-scylla/
#Sateesh, I want to add to the answer by #TomerSan that both Cassandra and ScyllaDB utilize the same disk storage architecture (LSM). That means that they have relatively the same disk access patterns because the algorithms are largely the same. The LSM trees were built with the idea in mind that it is not necessary to do instant in-place updates. It consists of immutable data buckets that are large continuous pieces of data on disk. That means less random IO, more sequential IO for which the HDD works great (not counting utilized parallelism by modern database implementations).
All the above means that the difference that you see, is not induced by the difference in how those databases use a disk. It must be related to the configuration differences and what happens underneath. Maybe ScyllaDB tries to utilize more parallelism or more aggressively do compaction. It depends.
In order to be able to say anything specific, please share your tests, envs, and configurations.
Both databases use LSM tree but Scylla has thread-per-core architecture on top plus we use O_Direct while C* uses the page cache. Scylla also has a sophisticated IO scheduler that makes sure not to overload the disk and thus scylla_setup runs a benchmark automatically to tune. Check your output of it in io.conf.
There are far more things to review, better to send your data to the mailing list. In general, Scylla should perform better in this case as well but your disk is likely to be the bottleneck in both cases.
As a summary I would say Scylladb and cassandra have the same read / write path
memtable, commitlog, sstable.
However implementation is very different:
- cassandra rely on OS for low level IO and network (most DBMS does)
- scylladb rely on its own lib (seastar) to handle IO and network at a low level independently from OS page cache etc. This is why they can provide feature such as workload scheduling within the same cluster that would be very hard to implement in cassandra.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
we are looking for an opensource in memory database which can support indexes.
The use case is that we have lot of items that are going to grow in a big way.
Each item has a few fields on which we need to query.
Currently we store the data in application's memory. However with increasing data, we have to think about distributing/sharding the db.
We have looked at a few options
Redis cluster could be used, but it does not have the concept of
indexes or SQL like queries.
Apache Ignite is both in-memory, and distributed as well as provides
SQL queries. However, the problem is that ignite fires all
queries into all master nodes, so that the final result will be
slower than the slowest of those queries. It seems like a problem
because a non performing/slow node out of a number of nodes can
really slow down the application a lot. Further in ignite, reads are
done from the masters and slaves are not used, so that it is
difficult to scale the queries. Increasing the nodes will have
negative impact as the no of queries will increase and it will be
even slower.
Cassandra - The in-memory option in cassandra can be used, but it
seems that the max size of a table per node can be 1 GB. If
our table is more than 1 GB, we will have to resort to partitioning
which will inturn lead cassandra to make multiple queries(one per
node) and it is a problem(same as ignite). Not sure whether reads in
cassandra in-memory table can be scaled by increasing the number of
slaves.
We are open to other solutions but wondering whether the multi-query will be a problem everywhere(like hazelcast).
The ideal solution for our use case would be an in-memory database with indexes which could be read scaled by increasing the number of slaves. Making it distributed/sharded will lead to multiple queries and we are reluctant because one erring node could slow the whole system down.
Hazelcast supports indexes (sorted & unsorted) and what is important there is no Multi-Query problem with Hazelcast.
Hazelcast supports a PartitionPredicate that restricts the execution of a query to a node that is a primaryReplica of the key passed to the constructor of the PartitionPredicate. So if you know where the data resides you can just query this node. So no need to fix or implement anything to support it, you can use it right away.
It's probably not reasonable to use it all the time. Depends on your use-case.
For complex queries that scan a lot of data but return small results it's better to use OBJECT inMemoryFormat. You should get excellent execution times and low latencies.
Disclaimer: I am GridGain employee and Apache Ignite committer.
Several comments on your concerns:
1) Slow nodes will lead to problems in virtually any clustered environment, so I would not consider this as disadvantage. This is reality you should embrace and accept. It is necessary understand why it is slow and fix/upgrade it.
2) Ignite are able to perform reads from slaves both for regular cache operations [1] and for SQL queries executed over REPLICATED caches. In fact, using REPLICATED cache for reference data is one of the most important features allowing Ignite to scale smoothly.
3) As you correctly mentioned, currently query is broadcasted to all data nodes. We are going to improve it. First, we will let users to specify partitions to execute the query against [2]. Second, we are going to improve our optimizer so that it will try to calculate target data nodes in advance to avoid broadcast [3], [4]. Both improvements will be released very soon.
4) Last, but not least - persistent layer will be released in several months [5], meaning that Ignite will become distributed database with both in-memory and persistence capabilities.
[1] https://ignite.apache.org/releases/mobile/org/apache/ignite/configuration/CacheConfiguration.html#isReadFromBackup()
[2] https://issues.apache.org/jira/browse/IGNITE-4523
[3] https://issues.apache.org/jira/browse/IGNITE-4509
[4] https://issues.apache.org/jira/browse/IGNITE-4510
[5] http://apache-ignite-developers.2346864.n4.nabble.com/GridGain-Donates-Persistent-Distributed-Store-To-ASF-Apache-Ignite-tc16788.html
I can give opinions on cassandra. Max size of your table per node is configurable and tunable so it depends on the amount of the memory that you are willing to pay. Partitioning is built in into cassandra so basically cassandra manages it for you. It's relatively simple to do paritioning. Basically first part of the primary key syntax is partitioning key and it determines on which node in the cluster the data lives.
But I also guess you are aware of this since you are mentioning multiple query per node. I guess there is no nice way around it.
Just one slight remark there is no master slaves in cassandra. Every node is equal. Basically client asks any node in the cluster, this node then becomes coordinator nodes and since it gets partitioning key it knows which node to ask the data for and it gives it then to the client.
Other than that I guess you read upon cassandra enough (from what I can see in your question)
Basically it comes down to the access pattern, if you know how you are going to access your data then it's the way to go. But other databases are also pretty decent.
Indexing with cassandra usually hides some potential performance problems. Usually people avoid it because in cassandra index has to be build for every record there is on whole cluster and it's done per node. This doesn't really scale. Basically you always have to do query first no matter how ypu put it with cassandra.
Plus the in memory seems to be part of the DSE cassandra. Not the open source or community one. You have to take this into account also.
I have created two node Cassandra cluster and try to perform load test. I find that one node or two node not making much difference in the through put I have supposed if 1 node can provide me 2000 tps for insert the two node should double the amount. Is it work like that?
if it is not then what actually Scaling means and how can I relate with it latency or throughput.
Cassandra is scalable. Just your case is a bit simplified since two nodes is not really the case of high scalability. You should be aware or the token partitioning algorithm used by Cassandra. As soon as you understand it, there should not be any quesitons. There is plenty of presentations about that. E.g. this one: http://www.datastax.com/resources/tutorials/partitioning-and-replication
In case of replication factor 1 everything is simple:
Each key-value pair you save/read from/to Cassandra is a query to one of Cassandra nodes in the cluster. Data is evenly distributed among nodes (see details of partitioning algorithm). So you always have total load evenly distributed among all nodes -> more nodes you have more load they can carry (and it is linear). In this case the system should of course be configured in a right way to avoid different kinds of network bottlenecks.
In case of replication factor more than 1 the situation is a bit more complicated, however the principle is the same.
There are lot of factors that contribute to this result.
A) check your replication factor. Although not desirable, in your case you can set it to 1
B) look into the shard in your primary key. If in your tests you are not changing it, then you are loading the data skewed and that the table is not scaling out to 2 nodes.
What does it mean when we say Casssandra is scalable?
There are basically two ways to scale a database.
Vertical scaling: Increasing the resources of the existing nodes in your cluster (more RAM, faster HDDs, more cores).
Horizontal scaling: Adding additional nodes to your cluster.
Vertical scaling tends to be more of a "band-aid" or temporary solution, because it has very finite limits. Your machines will only support so much RAM or so many cores, and once you max that out you really don't have anywhere to go.
Cassandra is "scalable" because it simplifies horizontal scaling. If you find that your existing nodes are maxing-out their available resources, you can simply add another node(s), adjust your replication factor, and run a nodetool repair. If you have had to do this with other database products, you will appreciate how (relatively) easy Cassandra makes it.
In your case, it's hard to know what exactly is going on without (a lot) more detail. But if your load tests are being adequately handled by your first node, then I can see why you wouldn't notice much of a difference by adding another.
If you haven't already, check out the Cassandra Stress Tool.
Additionally, be sure to check your current methods against this article, which is appropriately titled: How not to benchmark Cassandra
I'm working on a hobby project involving a rather CPU-intensive calculation. The problem is embarrassingly parallel. This calculation will need to happen on a large number of nodes (say 1000-10000). Each node can do its work almost completely independently of the others. However, the entire system will need to answer queries from outside the system. Approximately 100000 such queries per second will have to be answered. To answer the queries, the system needs some state that is sometimes shared between two nodes. The nodes need at most 128MB RAM for their calculations.
Obviously, I'm probably not going to afford to actually build this system in the scale described above, but I'm still interested in the engineering challenge of it, and thought I'd set up a small number of nodes as proof-of-concept.
I was thinking about using something like Cassandra and CouchDB to have scalable persistent state across all nodes. If I run a distributed database server on each node, it would be very lightly loaded, but it would be very nice from an ops perspective to have all nodes be identical.
Now to my question:
Can anyone suggest a distributed database implementation that would be a good fit for a cluster of a large number of nodes, each with very little RAM?
Cassandra seems to do what I want, but http://wiki.apache.org/cassandra/CassandraHardware talks about recommending at least 4G RAM for each node.
I haven't found a figure for the memory requirements of CouchDB, but given that it is implemented in Erlang, I figure maybe it isn't so bad?
Anyway, recommendation, hints, suggestions, opinions are welcome!
You should be able to do this with cassandra, though depending on your reliability requirements, an in memory database like redis might be more appropriate.
Since the data set is so small (100 MBs of data), you should be able to run with less than 4GB of ram per node. Adding in cassandra overhead you probably need 200MB of ram for the memtable, and another 200MB of ram for the row cache (to cache the entire data set, turn off the key cache), plus another 500MB of ram for java in general, which means you could get away with 2 gigs of ram per machine.
Using a replication factor of three, you probably only need a cluster on the order of 10's of nodes to serve the number of reads/writes you require (especially since your data set is so small and all reads can be served from the row cache). If you need the computing power of 1000's of nodes, have them talk to the 10's of cassandra nodes storing you data rather than try to split cassandra to run across 1000's of nodes.
I've not used CouchDB myself, but I am told that Couch will run in as little as 256M with around 500K records. At a guess that would mean that each of your nodes might need ~512M, taking into account the extra 128M they need for their calculations. Ultimately you should download and give each a test inside a VPS, but it does sound like Couch will run in less memory than Cassandra.
Okay, after doing some more read-up after posting the question, and trying some thing out, I decided to go with MongoDB.
So far I'm happy. I have very little load, and MongoDB is using very little system resources (~200MB at most). However, my dataset isn't nearly as large as described in the question, and I am only running 1 node, so this doesn't mean anything.
CouchDB doesn't seem to support sharding out-of-the-box, so is not (it turns out) a good fit for the problem described in the question (I know there are addons for sharding).