What is the difference between scylla read path and cassandra read path? - cassandra

What is the difference between Scylla read path and Cassandra read path? When I stress Cassandra and Scylla then Scylla read performance poor by 5 times than Cassandra using 16 core and normal HDD.
I expect better read performance on Scylla compared to Cassandra using normal HDD, because my company doesn't provide SSD's.
Can someone please confirm, is it possible to achieve better read performance using normal HDD or not?
If yes, what changes required scylla config?. Please guide me!

Some other responses focused on write performance, but this isn't what you asked about - you asked about reads.
Uncached read performance on HDDs is bound to be poor in both Cassandra and Scylla, because reads from disk each requires several seeks on the HDD, and even the best HDD cannot do more than, say, 200 of those seeks per second. Even with a RAID of several of these disks, you will rarely be able to do more than, say, 1000 requests per second. Since a modern multi-core can do orders of magnitude more CPU work than 1000 requests per second, in both Scylla and Cassandra cases, you'll likely see free CPU. So Scylla's main benefit, of using much less CPU per request, will not even matter when the disk is the performance bottleneck. In such cases I would expect Scylla's and Cassandra's performance (I am assuming that you're measuring throughput when you talk about performance?) should be roughly the same.
If, still, you're seeing better throughput from Cassandra than Scylla, there are several details that may explain why, beyond the general client mis-configuration issues raised in other responses:
If you have low amounts of data, that can fit in memory, Cassandra's caching policy is better for your workload. Cassandra uses the OS's page cache, which reads whole disk pages and may cache multiple items in one read, as well as multiple index entries. While Scylla works differently, and has a row cache - only caching the specific data read. Scylla's caching is better for large volumes of data that do not fit in memory, but much worse when the data can fit in memory, until the entire data set has been cached (after everything is cached, it becomes very efficient again).
On HDDs, the details of compaction are very important for read performance - if in one setup you have more sstables to read, it can increase the number of reads and lower the performance. This can change depending on your compaction configuration, or even randomly (depending on when compaction was run last). You can check if this explains your performance issues by doing a major compaction ("nodetool compact") on both systems and checking the read performance afterwards. You can switch the compaction strategy to LCS to ensure that random-access read performance is better, at the cost of more write work (on HDDs, this can be a worthwhile compromise).
If you are measuring scan performance (reading an entire table) instead of reading individual rows, other issues become relevant: As you may have heard, Scylla subdivides each nodes into shards (each shard is a single CPU). This is fantastic for CPU-bounded work, but could be worse for scanning tables which aren't huge, because each sstable is now smaller and the amount of contiguous data you can read before needing to seek again is lower.
I don't know which of these differences - or something else - is causing performance of your use-case to be lower in Scylla, but I please keep in mind that whatever you fix, your performance is always going to be bad with HDDs. With SDDs, we've measured in the past more than a million random-access read requests per second on a single node. HDDs cannot come anything close. If you really need optimum performance or performance per dollar, SDDs are really the way to go.

There can be various reasons why you are not getting the most out of your Scylla Cluster.
Number of concurrent connections from your clients/loaders is not high enough, or you're not using sufficient amount of loaders. In such case, some shards will be doing all the work, while others will be mostly idle. You want to keep your parallelism high.
Scylla likes have a minimum of 2 connections per shard (you can see the number of shards in /etc/scylla.d/cpuset.conf)
What's the size of your dataset? Are you reading a large amount of partitions or just a few? You might be hitting a hot partition situation
I strongly recommend reading the following docs that will provide you more insights:
https://www.scylladb.com/2019/03/27/best-practices-for-scylla-applications/
https://docs.scylladb.com/operating-scylla/benchmarking-scylla/

#Sateesh, I want to add to the answer by #TomerSan that both Cassandra and ScyllaDB utilize the same disk storage architecture (LSM). That means that they have relatively the same disk access patterns because the algorithms are largely the same. The LSM trees were built with the idea in mind that it is not necessary to do instant in-place updates. It consists of immutable data buckets that are large continuous pieces of data on disk. That means less random IO, more sequential IO for which the HDD works great (not counting utilized parallelism by modern database implementations).
All the above means that the difference that you see, is not induced by the difference in how those databases use a disk. It must be related to the configuration differences and what happens underneath. Maybe ScyllaDB tries to utilize more parallelism or more aggressively do compaction. It depends.
In order to be able to say anything specific, please share your tests, envs, and configurations.

Both databases use LSM tree but Scylla has thread-per-core architecture on top plus we use O_Direct while C* uses the page cache. Scylla also has a sophisticated IO scheduler that makes sure not to overload the disk and thus scylla_setup runs a benchmark automatically to tune. Check your output of it in io.conf.
There are far more things to review, better to send your data to the mailing list. In general, Scylla should perform better in this case as well but your disk is likely to be the bottleneck in both cases.

As a summary I would say Scylladb and cassandra have the same read / write path
memtable, commitlog, sstable.
However implementation is very different:
- cassandra rely on OS for low level IO and network (most DBMS does)
- scylladb rely on its own lib (seastar) to handle IO and network at a low level independently from OS page cache etc. This is why they can provide feature such as workload scheduling within the same cluster that would be very hard to implement in cassandra.

Related

Why is it so bad to have large partitions in Cassandra?

I have seen this warning everywhere but cannot find any detailed explanation on this topic.
For starters
The maximum number of cells (rows x columns) in a single partition is
2 billion.
If you allow a partition to grow unbounded you will eventually hit this limitation.
Outside that theoretical limit, there are practical limitations tied to the impacts large partitions have on the JVM and read times. These practical limitations are constantly increasing from version to version. This practical limitation is not fixed but variable with data model, query patterns, heap size, and configurations which makes it hard to be give a straight answer on whats too large.
As of 2.1 and early 3.0 releases, the primary cost on reads and compactions comes from deserializing the index which marks a row every column_index_size_in_kb. You can increase the key_cache_size_in_mb for reads to prevent unnecessary deserialization but that reduces heap space and fills old gen. You can increase the column index size but it will increase worst case IO costs on reads. Theres also many different settings for CMS and G1 to tune the impact of a huge spike in object allocations when reading these big partitions. There are active efforts on improving this so in the future it might no longer be the bottleneck.
Repairs also only go down to (in best case scenario) the partition level. So if say you are constantly appending to a partition, and a hash of that partition on 2 nodes are compared at not an exact time (distributed system essentially guarantees this), the entire partition must be streamed over to ensure consistency. Incremental repairs can reduce impact of this, but your still streaming massive amounts of data and fluctuating disk significantly which will then need to be compacted together unnecessarily.
You can probably keep adding onto this of corner cases and scenarios that have issues. Many times large partitions are possible to read, but the tuning and corner cases involved in them are not really worth it, better to just design data model to be friendly with how Cassandra expects it. I would recommend targeting 100mb but you can go far beyond that comfortably. Into the Gbs and you will need to start consider tuning for it (depending on data model, use case etc).

Dedicated commitlog storage vs Read/Write ratio?

As we are using SSD disks to provide storage for our cluster on servers with 30 GB of memory.
There is an argument about the commitlog directory, whether to dedicate an individual disk or having it on the same data disk.
As we already using SSD disks, performance should be fine having both commitlogs and data on the same disk, as there is no mechanical moving head for writing.
However, there is another factor, that is the read/write ratio. How would such a ratio affect the performance of writing or reading when we have both commitlogs and data on the same disk?
Using SSD, when would it become important to dedicate a high performance disk for the commitlog directory?
A dedicated commitlog device usually makes a lot of sense when you have HDDs, but is less obvious if you're using SSDs.
Even if you asked only if it makes sense with SSDs setups, I will try to give some general hints about the subject, primarily based on my understandings and my own experience. I admit the focus is probably too much on HDDs, but HDDs allow a deep insight on how Cassandra stuff works and why backing a commitlog/data directory with an SSD can be a life saver.
Background: IOPS and OPS are not the same thing.
I will start from a (very) far point: Device Performance. Here's a start-point lecture about storage device performances in general. Even if the article's neutrality is under discussion, it can provide some insights about the general metrics and performance you can expect from some systems. Of course, your mileage may vary, depending on what device (type/brand/model etc...) and how much stress (intended as type of workload) you put on the device, but I think it is a good starting point for our discussion here.
The reason I prefer to start from IOPS is because it is the very starting point for understanding storage performance. The C* literature speaks about OPS, Operations Per Second, because people usually don't think in terms of IOPS, especially when looking at stats. This really hides a lot of details, the operation size for starters.
A Cassandra Operation usually consists of multiple IOPS. Cassandra documentation usually refers to spinning disks (even if SSDs are referenced too), clearly states what happens when performing reads/writes, and people tend to ignore the fact that when their software stack (that spans from up the application down to Cassandra and its data files on the storage) hit the disks the performance decreases by a huge amount just because they have failed to recognize a random workload, and even if "Cassandra is an high-performance etc... etc.. etc...".
As an example, looking at the picture in the read path documentation, you can clearly see what data structures are in memory/on disk, and how the SSTable data is accessed. Further, the row cache paragraph says:
... If row cache is enabled, desired partition data is read from the row cache, potentially saving two seeks to disk for the data...
And here's where the catch starts: these two seeks are potentially saved from Cassandra's point of view. This simply means that Cassandra won't make two requests to the storage system: it will avoid to request the partition index, and the data because everything is already in RAM, but it doesn't really translates to "the storage system will save two IO operations". Indeed, how (generic) data is retrieved from the storage device is a very different thing, and of course depends on how the files are layed-out on the disk itself: are you using EXT4, XFS, or what? Assuming no cache is available (eg for very big data set sizes you can't really cache everything...), looking for a file is IOPS consuming, and this tends to amplify the potentially saved seeks when you have data in RAM, and tends to amplify the penalty you perceive when your data is not.
You can't escape physics: HDDs pay some taxes, SSDs no.
As you already know, the main "problem" (performance-wise) of HDDs is the average seek time, that is the time the HDD needs to wait on average in order to have a target sector under the heads. Once the sector is under the heads, if the system have to read a bunch of sequential bits everything is smooth and the throughput is proportional to the rotational speed of the HDD (to be precise to the tangential speed of the platters under the head, which depends also on the track etc...).
In other terms, HDDs have an average fixed performance tax (the average seek time), and everything after is almost "free". If an application requests a bunch of sectors that are not "contiguous" (from the disk point of view, eg a fragmented file is splitted across multiple sectors, but an application can't really know this), the disk will have to wait the average seek time on average multiple times, and this fixed tax influences its maximum throughput.
The strongest argument about storage is: every device have its own maximum magic average IOPS number. This number express the number of random IOPS the device can perform. You can't force an HDD to have more IOPS on average, it's a physical problem. The OS is usually smart enough to "enqueue" sector requests in the attempt to reduce the seek times, eg ordering by ascending requested sector number (trying to exploit some sequential operations), but nothing will save the performances from a random IO workload. You have X allotted available IOPS and must face your problems with that. No matter what.
You need to take advantage of the allotted IOPS of your device, and you must be wise on how you use them.
Suppose you have an HDD that maxes out at 100 IOPS on average. If your application performs a bunch of small (say 4KB) file reads, you have an application that performs 100 * 4KB reads every second: the throughput will be around 400KB/s (unless some caching is involved, and in that case the cache saved you precious IOPS). Astonishing. This is simply because you keep paying the seek time multiple times. If you change your access pattern to something that reads 16MB (contiguous) files, you get an higher throughput because you won't pay the seek time so much, you are exploiting a sequential pattern. What changes under the hood is the Request Size of each operation.
Now an interesting question is: how are "IOPS" and "Request Size" related to? Does one request size of 16MB can be considered one IOPS? And what about a 128MB request size? This is indeed a good question. At lower level, the Request Size spans from 512 bytes (the minimum sector size) to 128KB (32*4K sectors in one request). If the operation has small size, its transfer time, the time the disk needs to fetch the data, is also small. Higher request sizes have higher transfer times obviously. However, if you are able to perform 100 4KB IOPS, you will probably be able to perform around 80 IOPS #8KB. The relation can't be linear, because the transfer time depends on the rotational speed of the disks only (the transfer time is negligible compared to the seek time), and since you are actually reading from two adjacent sectors, you'll hit the seek time penalty once per request. This translates to a throughput of around 400KB/s for 4K requests and 1.6MB/s for 8K requests. And so on.... The larger the request size, the longer it takes to transfer data, the lesser IOPS you have, the higher throughput you have. (These are random numbers, pun intended, no measurements done! Just to let you understand. I however think they are in the ballpark).
SSDs don't suffer mechanical penalties and that's why they are capable of performing much better than HDDs. They have much more IOPS, and their limits come from the onboard electronics, bus connection etc.... Having an higher IOPS device is a big plus, these can be consumed by applications that are not IOPS friendly, and the user won't notice that the applications suck. However, with SSDs, the Request Size linearly influences the number of IOPS you can perform. When you look at some device that have 100k IOPS, these are usually referred at 4K. You'll be able to perform only 6.2k requests if you perform 64K requests.
Why Cassandra has a such good read performances even with HDDs then?
Speaking from a single node point of view (because given the performance of a cluster Cassandra scales linearly with the number of nodes in the cluster), the problem lies in the question itself. This is only true if you model your data in this particular way:
You must fetch all your data with one query only.
Your data must be ordered.
If you can't fetch your data with one data, denormalize in order to retrieve it with one query only.
You fetch a relative good amount of data on every read
These are well-known Cassandra modeling rules, but the key point is that these rules do really have a reason to be applied IOPS-wise. Indeed, these rules allow Cassandra to:
Be a super fast database because it will just require the partition index and the SSTable offset index of the data: two IOPS in the best case, much more IOPS in the worst case.
Be a super fast database because it will exploit the sequential capabilities of the HDDs and will not stress the IO subsystem by issuing other IO (random) seeks.
Be a super fast database because it will just fetch more data like the point number 1.
Be a super fast database because it will exploit longer the sequential capabilities of the HDDs.
In other terms, following these basic data modeling rules allows Cassandra to be IOPS friendly when reading data back.
What happens if you screw-up your data model? Cassandra won't be IOPS friendly, and as a consequence the performances will be predictably horrible. Unless you use an SSD, which has greater IOPS and then you won't notice slowness too much.
What happens if you read/write a small amount of data (eg due to misconfigured flush sizes, small commit log etc...)? Cassandra won't be IOPS friendly, and as a consequence the performances will be predictably horrible. Unless you use an SSD, which has greater IOPS and then you won't notice slowness too much.
How a read/write ratio pattern can influence performance in a Cassandra node?
Cassandra is a complex system, with different components that interact each other. I will try to explain from my point of view what are the main points when you put everything on one device only.
Writes/Deletes/Updates in Cassandra are fast because they are simply append-only writes to the CommitLog device. Reads, on the contrary, can be very IOPS consuming. When both CommitLog and Data are on the same physical disk (either HDD or SSD), the read/write paths interact, and they both consume IOPS.
Two important questions are:
How many IOPS a read (using the read path) consumes?
How many IOPS a write consumes?
These are important question because you have to remember that your device can perform at most X IOPS, and your system will have to split these X IOPS among these operations.
It is quite difficult to answer to the "read" question because, when you request some data, Cassandra needs to locate all the SSTables needed to satisfy the request. Assuming a very big dataset size, where caching is not effective, this imply that the Cassandra read path can be very IOPS hungry. Indeed, if your data is spread into 3 different SSTables, Cassandra will have to locate all of them, and for each SSTable will follow the read path: will read the partition index, and then will read the data in the SSTable. These are at least two IOPS, because if your filesystem is not "collaborative" enough, locating a file and/or pointing at a file offset could require some more IOPS. In the end, in this example Cassandra is consuming at least six IOPS per read.
Answering the "write" question is also tricky, because compactions and flushes can be triggered. They will consume a lot of IOPS. Flushes are easy to understand: they write data from memtables to disk with a sequential pattern. Instead, compactions read data back from different SSTables on disk, and while reading the tables they flush the result out to a new disk file. This is a mixed read/write pattern, and on HDDs this is very disruptive, because will force the disk to perform multiple seeks.
Mixing percentages: TL;DR
If you have a R/W ratio of 95% reads and 5% writes, having a separate CommitLog device can be a waste of resources, because writes will hardly impact your read performances, and you write so rarely that write performance may be considered not critical.
If you have a R/W ratio of 5% reads and 95% writes, having a separate CommitLog device can be again a waste of resources, because reads will hardly impact your write performances, and your read performances will hardly suffer from a bunch of sequential appends on the commitlog.
And finally, if you have a R/W ratio of 50% reads and 50% writes, having a separate CommitLog device is NOT a waste of resources, because every write performed on the CommitLog device won't produce at least two IOPS on the data drive (one for writing, and one for going back to read).
Please note that I didn't mention compactions, because independently on your workload, when compaction triggers in, your workload will be disrupted by mixed read/write background operations on different files (consuming disk IOPS all the way), and you will suffer both on reads and writes.
All this should be clear enough for HDDs because you run out of IOPS very fast, and when you do you notice it immediately. On SSDs, however, you don't run out of IOPS that fast, but you could do if your data consists of a lot small data rows.
The reality is that getting out of IOPS on an SSD is very hard because you'll get out (by a far amount) of CPU resources, But once you do you will see your performance slowly decrease. The effect however won't be such dramatic as in the cases of HDDs. As an example, if you have a 100 IOPS HDD and you run-out of IOPS by trying to issue 500 random IO stuff, you cleary get a penalty. By calling this penalty P, if you have an SSD with 100k IOPS, to get the same penalty P you should issue 500k IOPS, which can be very difficult to do without exhausting CPU or RAM.
In general, when you run out of some type of resource in your system, you need to increase its quantity. The most important thing (to me) is not to run out of IOPS in the "Data" part of your Cassandra cluster. In the case of SSDs IOPS, it's rare enough that you'll get the limit. You'll burn your CPU well before I think. But you will if you don't tune your system, or if your workload put too much stress on the disk subsystem (eg Leveled Compaction). I'd suggest to put an ordinary HDD instead of an high performance SSD for the commitlog, saving money. But if you have a lot of very small commitlog flushes an SSD is a completely life saver, because your writers won't suffer the latency of HDDs.
Finally, in my opition, you should go in pre-production with some sort of real data, and check your IOPS requirements. If you have enough room to put the SSD there don't worry. Go and save money. If your system gets too much pressure due to compaction then having a separate device is suggested. Analyze your commitlog pattern, and if its not IOPS demanding put it on a separate disk. Moreover, if you have a virtual environment you can provision a relatively small commitlog device regardless of other factors. It won't rise the cost of your solution too much.
The actual numbers will depend highly on the type of workload you have the configuration you have etc. You can have a look at Netflix tech blog posts for ballpark numbers, e.g. #1, #2.
Dedicating a disk for commitlog directory is a sort of scale up strategy. Cassandra works well with scale out approach. You just add more nodes into the cluster to spread the load - 2nd from the linked articles has a nice graph showing near linear scalability.

Cassandra cluster - data density (data size per node) - looking for feedback and advises

I am considering the design of a Cassandra cluster.
The use case would be storing large rows of tiny samples for time series data (using KairosDB), data will be almost immutable (very rare delete, no updates). That part is working very well.
However, after several years the data will be quite large (it wil reach a maximum size of several hundreds of terabytes - over one petabyte considering the replication factor).
I am aware of advice not to use more than 5TB of data per Cassandra node because of high I/O loads during compactions and repairs (which is apparently already quite high for spinning disks).
Since we don't want to build an entire datacenter with hundreds of nodes for this use case, I am investigating if this would be workable to have high density servers on spinning disks (e.g. at least 10TB or 20TB per node using spinning disks in RAID10 or JBOD, servers would have good CPU and RAM so the system will be I/O bound).
The amount of read/write in Cassandra per second will be manageable by a small cluster without any stress. I can also mention that this is not a high performance transactional system but a datastore for storage, retrievals and some analysis, and data will be almost immutable - so even if a compaction or a repair/reconstruction that take several days of several servers at the same time it's probably not going to be an issue at all.
I am wondering if some people have an experience feedback for high server density using spinning disks and what configuration you are using (Cassandra version, data size per node, disk size per node, disk config: JBOD/RAID, type of hardware).
Thanks in advance for your feedback.
Best regards.
The risk of super dense nodes isn't necessarily maxing IO during repair and compaction - it's the inability to reliably resolve a total node failure. In your reply to Jim Meyer, you note that RAID5 is discouraged because the probability of failure during rebuild is too high - that same potential failure is the primary argument against super dense nodes.
In the days pre-vnodes, if you had a 20T node that died, and you had to restore it, you'd have to stream 20T from the neighboring (2-4) nodes, which would max out all of those nodes, increase their likelihood of failure, and it would take (hours/days) to restore the down node. In that time, you're running with reduced redundancy, which is a likely risk if you value your data.
One of the reasons vnodes were appreciated by many people is that it distributes load across more neighbors - now, streaming operations to bootstrap your replacement node come from dozens of machines, spreading the load. However, you still have the fundamental problem: you have to get 20T of data onto the node without bootstrap failing. Streaming has long been more fragile than desired, and the odds of streaming 20T without failure on cloud networks are not fantastic (though again, it's getting better and better).
Can you run 20T nodes? Sure. But what's the point? Why not run 5 4T nodes - you get more redundancy, you can scale down the CPU/memory accordingly, and you don't have to worry about re-bootstrapping 20T all at once.
Our "dense" nodes are 4T GP2 EBS volumes with Cassandra 2.1.x (x >= 7 to avoid the OOMs in 2.1.5/6). We use a single volume, because while you suggest "cassandra now supports JBOD quite well", our experience is that relying on Cassandra's balancing algorithms is unlikely to give you quite what you think it will - IO will thundering herd between devices (overwhelm one, then overwhelm the next, and so on), they'll fill asymmetrically. That, to me, is a great argument against lots of small volumes - I'd rather just see consistent usage on a single volume.
I haven't used KairosDB, but if it gives you some control over how Cassandra is used, you could look into a few things:
See if you can use incremental repairs instead of full repairs. Since your data is an immutable time series, you won't often need to repair old SSTables, so incremental repairs would just repair recent data.
Archive old data in a different keyspace, and only repair that keyspace infrequently such as when there is a topology change. For routine repairs, only repair the "hot" keyspace you use for recent data.
Experiment with using a different compaction strategy, perhaps DateTiered. This might reduce the amount of time spent on compaction since it would spend less time compacting old data.
There are other repair options that might help, for example I've found the the -local option speeds up repairs significantly if you are running multiple data centers. Or perhaps you could run limited repairs more frequently rather than performance killing full repairs on everything.
I have some Cassandra clusters that use RAID5. This has worked fine so far, but if two disks in the array fail then the node becomes unusable since writes to the array are disabled. Then someone must manually intervene to fix the failed disks or remove the node from the cluster. If you have a lot of nodes, then disk failures will be a fairly common occurrence.
If no one gives you an answer about running 20 TB nodes, I'd suggest running some experiments on your own dataset. Set up a single 20 TB node and fill it with your data. As you fill it, monitor the write throughput and see if there are intolerable drops in throughput when compactions happen, and at how many TB it becomes intolerable. Then have an empty 20 TB node join the cluster and run a full repair on the new node and see how long it takes to migrate its half of the dataset to it. This would give you an idea of how long it would take to replace a failed node in your cluster.
Hope that helps.
I would recommend to think about the data model of your application and how to partition your data. For time series data it would probably make sense to use a composite key [1] which consists of a partition key + one or more columns. Partitions are distributed across multiple servers according to the hash of the partition key (depending on the Cassandra Partitioner that you use, see cassandra.yaml).
For example, you could partition your server by device that generates the data (Pattern 1 in [2]) or by a period of time (e.g., per day) as shown in Pattern 2 in [2].
You should also be aware that the max number of values per partition is limited to 2 billion [3]. So, partitioning is highly recommended. Don't store your entire time series on a single Cassandra node in a single partition.
[1] http://www.planetcassandra.org/blog/composite-keys-in-apache-cassandra/
[2] https://academy.datastax.com/demos/getting-started-time-series-data-modeling
[3] http://wiki.apache.org/cassandra/CassandraLimitations

How to exaust a machine's resources with RethinkDB?

I'm asking this question because I would like to understand how I can run RethinkDB better, which means what kind of hardware should it be running on, what kind of filesystem it should be running on and other system configurations to maximize it's throughput.
I'm trying to fill a table as fast as I can with documents that are {"n": <counter>, "rand": <Math.random()>}. I read somewhere that this is faster with batches of 200 documents, so that's what I'm inserting. I am also using soft durability. I started one nodejs process of this and I can insert on average 10k documents per second, pretty good.
But while this is happening, rethinkdb is using about 70% of one core (I have 8 virtual cores, it's an i7-4770) and the nodejs process is using 5%. So it seems that CPU is not the bottleneck.
As soon as I start another nodejs process doing the same thing, the inserts per second on both processes drop to about 4k-5k. Again, the CPU load keeps the same.
I fired iotop and I do see a lot of action there, but not what I expected. I configured two SSDs in a RAID0, and a quick dd test says I can write and read at about 800MBps. That's far above what the actual read and actual write speed iotop reports (average read ~14MBps average write ~50MBps).
So how can I exaust my machine's resources ? What does rethinkdb need to run faster ? Why doesn't it spend more resources and have a higher throughput ?
More information on what it's running on: It's an EX40SSD from Hetzner, two SSDs in a software RAID0, ext4 filesystem (tomorrow I'll try to mount it with noatime to see if it's better). The rethinkdb configuration is everything by default, the inserts are done to a table that has only one shard and one replica. Please feel free to ask anything else relevant I might have forgotten to mention.
Thanks in advance.
What I suspect is going on here is lock contention on the actual btrees. When you're inserting a large batch of documents, the system grabs various parts of the btree in parallel to update it with the new documents. This is a set of read-write locks -- other parts of the system can still read, but if you insert another large batch in parallel, there is a high probability that it will touch similar parts of the btree, and therefore has to wait for the system to start unlocking as it inserts parts of the first batch. (This isn't specific to RethinkDB, but a problem in databases in general) This is likely why you're not hitting 100% CPU/disk throughput.
There are a few things you can try, but note that there are subtleties to various approaches. Benchmarking in general is hard.
You can try to shard the table into 32 shards and retry your benchmark. You don't actually have to create a cluster, you can shard into 32 shards on a single machine. This will result in multiple btrees, so you'll minimize contention and will be able to use more system resources. Note that while this will likely increase the throughput, increasing the number of shards also slightly increases the latency, so you might need to significantly increase parallelism before you start seeing throughput increases.
You can try not to batch writes and instead write one document at a time (which generally approximates real-world use cases a little better). Then, start hundreds of parallel clients instead of just one or two, and have all them write one document at a time in parallel. Note that you need to make sure the clients themselves aren't a bottleneck in this situation.
You can try to rerun your benchmark and also spin up clients that read from the database in parallel with the writes. In RethinkDB reads can usually go through even when you're writing to a particular document, so this will give you the opportunity to up CPU usage and bypass contention.
Pay attention to the IDs of the documents. If the database is large enough (say, millions of documents), and the IDs you're inserting are random, you're much less likely to touch the same parts of the btree so contention becomes less of an issue.
You can combine various approaches (sharding, reading + writing, various numbers of concurrent clients) to start getting a sense for how the database behaves in various scenarios.
Note that there might be things going on that you wouldn't normally be aware of. For example, RethinkDB has a log-structured storage engine that does live compaction on disk, and this might use up some IO (and CPU) cycles that you'd be surprised by if you didn't know about live compaction. There are dozens of other components like this that might compound to surprising behavior, as these systems are typically very complex under the hood.
Hope this helps -- would love to hear about your progress on the benchmarks. We do a lot of them internally, and it's an art and a science to discover the boundaries of the system's performance on different use cases.
My guess is that the bottleneck here is the disk system, but not its throughput. What's more likely is that writes are happening in chunks that are too small to be efficient, or that there are delays due to latency between individual writes.
It's also possible that the latency between individual write queries coming from the client and their processing on the server slows the system down.
Here are a few things I recommend trying:
Increase the batch size further. Your documents are very small. Therefore I think you might get a significantly higher throughput with batches of 1,000-10,000 documents. This might work especially well in combination with the next point.
Run multiple concurrent clients. You mentioned that you have run 2 clients concurrently, but this might not be enough. I recommend running 16-32 if possible.
Check the cache size RethinkDB is using. By default, RethinkDB picks the cache size as a fraction of the available memory, but that is not always reliable. I recommend passing a --cache-size <MB> parameter to RethinkDB (or adding the cache-size=<MB> parameter to the configuration file, if you're using one). I can see that your server has 32 GB of RAM. I recommend using in the range of 20000 MB (or even more) as the cache size. A larger cache reduces the number of reads, but up to a certain limit also increases the amount of unsaved data that RethinkDB can accumulate in RAM to make disk writes more efficient.
Experiment with the --io-threads <THREADS> parameter. The default is 64, but you can try increasing it to e.g. 128 and see if it has any effect.

Distributed database, many lightly loaded nodes

I'm working on a hobby project involving a rather CPU-intensive calculation. The problem is embarrassingly parallel. This calculation will need to happen on a large number of nodes (say 1000-10000). Each node can do its work almost completely independently of the others. However, the entire system will need to answer queries from outside the system. Approximately 100000 such queries per second will have to be answered. To answer the queries, the system needs some state that is sometimes shared between two nodes. The nodes need at most 128MB RAM for their calculations.
Obviously, I'm probably not going to afford to actually build this system in the scale described above, but I'm still interested in the engineering challenge of it, and thought I'd set up a small number of nodes as proof-of-concept.
I was thinking about using something like Cassandra and CouchDB to have scalable persistent state across all nodes. If I run a distributed database server on each node, it would be very lightly loaded, but it would be very nice from an ops perspective to have all nodes be identical.
Now to my question:
Can anyone suggest a distributed database implementation that would be a good fit for a cluster of a large number of nodes, each with very little RAM?
Cassandra seems to do what I want, but http://wiki.apache.org/cassandra/CassandraHardware talks about recommending at least 4G RAM for each node.
I haven't found a figure for the memory requirements of CouchDB, but given that it is implemented in Erlang, I figure maybe it isn't so bad?
Anyway, recommendation, hints, suggestions, opinions are welcome!
You should be able to do this with cassandra, though depending on your reliability requirements, an in memory database like redis might be more appropriate.
Since the data set is so small (100 MBs of data), you should be able to run with less than 4GB of ram per node. Adding in cassandra overhead you probably need 200MB of ram for the memtable, and another 200MB of ram for the row cache (to cache the entire data set, turn off the key cache), plus another 500MB of ram for java in general, which means you could get away with 2 gigs of ram per machine.
Using a replication factor of three, you probably only need a cluster on the order of 10's of nodes to serve the number of reads/writes you require (especially since your data set is so small and all reads can be served from the row cache). If you need the computing power of 1000's of nodes, have them talk to the 10's of cassandra nodes storing you data rather than try to split cassandra to run across 1000's of nodes.
I've not used CouchDB myself, but I am told that Couch will run in as little as 256M with around 500K records. At a guess that would mean that each of your nodes might need ~512M, taking into account the extra 128M they need for their calculations. Ultimately you should download and give each a test inside a VPS, but it does sound like Couch will run in less memory than Cassandra.
Okay, after doing some more read-up after posting the question, and trying some thing out, I decided to go with MongoDB.
So far I'm happy. I have very little load, and MongoDB is using very little system resources (~200MB at most). However, my dataset isn't nearly as large as described in the question, and I am only running 1 node, so this doesn't mean anything.
CouchDB doesn't seem to support sharding out-of-the-box, so is not (it turns out) a good fit for the problem described in the question (I know there are addons for sharding).

Resources