Related
What is the difference between Scylla read path and Cassandra read path? When I stress Cassandra and Scylla then Scylla read performance poor by 5 times than Cassandra using 16 core and normal HDD.
I expect better read performance on Scylla compared to Cassandra using normal HDD, because my company doesn't provide SSD's.
Can someone please confirm, is it possible to achieve better read performance using normal HDD or not?
If yes, what changes required scylla config?. Please guide me!
Some other responses focused on write performance, but this isn't what you asked about - you asked about reads.
Uncached read performance on HDDs is bound to be poor in both Cassandra and Scylla, because reads from disk each requires several seeks on the HDD, and even the best HDD cannot do more than, say, 200 of those seeks per second. Even with a RAID of several of these disks, you will rarely be able to do more than, say, 1000 requests per second. Since a modern multi-core can do orders of magnitude more CPU work than 1000 requests per second, in both Scylla and Cassandra cases, you'll likely see free CPU. So Scylla's main benefit, of using much less CPU per request, will not even matter when the disk is the performance bottleneck. In such cases I would expect Scylla's and Cassandra's performance (I am assuming that you're measuring throughput when you talk about performance?) should be roughly the same.
If, still, you're seeing better throughput from Cassandra than Scylla, there are several details that may explain why, beyond the general client mis-configuration issues raised in other responses:
If you have low amounts of data, that can fit in memory, Cassandra's caching policy is better for your workload. Cassandra uses the OS's page cache, which reads whole disk pages and may cache multiple items in one read, as well as multiple index entries. While Scylla works differently, and has a row cache - only caching the specific data read. Scylla's caching is better for large volumes of data that do not fit in memory, but much worse when the data can fit in memory, until the entire data set has been cached (after everything is cached, it becomes very efficient again).
On HDDs, the details of compaction are very important for read performance - if in one setup you have more sstables to read, it can increase the number of reads and lower the performance. This can change depending on your compaction configuration, or even randomly (depending on when compaction was run last). You can check if this explains your performance issues by doing a major compaction ("nodetool compact") on both systems and checking the read performance afterwards. You can switch the compaction strategy to LCS to ensure that random-access read performance is better, at the cost of more write work (on HDDs, this can be a worthwhile compromise).
If you are measuring scan performance (reading an entire table) instead of reading individual rows, other issues become relevant: As you may have heard, Scylla subdivides each nodes into shards (each shard is a single CPU). This is fantastic for CPU-bounded work, but could be worse for scanning tables which aren't huge, because each sstable is now smaller and the amount of contiguous data you can read before needing to seek again is lower.
I don't know which of these differences - or something else - is causing performance of your use-case to be lower in Scylla, but I please keep in mind that whatever you fix, your performance is always going to be bad with HDDs. With SDDs, we've measured in the past more than a million random-access read requests per second on a single node. HDDs cannot come anything close. If you really need optimum performance or performance per dollar, SDDs are really the way to go.
There can be various reasons why you are not getting the most out of your Scylla Cluster.
Number of concurrent connections from your clients/loaders is not high enough, or you're not using sufficient amount of loaders. In such case, some shards will be doing all the work, while others will be mostly idle. You want to keep your parallelism high.
Scylla likes have a minimum of 2 connections per shard (you can see the number of shards in /etc/scylla.d/cpuset.conf)
What's the size of your dataset? Are you reading a large amount of partitions or just a few? You might be hitting a hot partition situation
I strongly recommend reading the following docs that will provide you more insights:
https://www.scylladb.com/2019/03/27/best-practices-for-scylla-applications/
https://docs.scylladb.com/operating-scylla/benchmarking-scylla/
#Sateesh, I want to add to the answer by #TomerSan that both Cassandra and ScyllaDB utilize the same disk storage architecture (LSM). That means that they have relatively the same disk access patterns because the algorithms are largely the same. The LSM trees were built with the idea in mind that it is not necessary to do instant in-place updates. It consists of immutable data buckets that are large continuous pieces of data on disk. That means less random IO, more sequential IO for which the HDD works great (not counting utilized parallelism by modern database implementations).
All the above means that the difference that you see, is not induced by the difference in how those databases use a disk. It must be related to the configuration differences and what happens underneath. Maybe ScyllaDB tries to utilize more parallelism or more aggressively do compaction. It depends.
In order to be able to say anything specific, please share your tests, envs, and configurations.
Both databases use LSM tree but Scylla has thread-per-core architecture on top plus we use O_Direct while C* uses the page cache. Scylla also has a sophisticated IO scheduler that makes sure not to overload the disk and thus scylla_setup runs a benchmark automatically to tune. Check your output of it in io.conf.
There are far more things to review, better to send your data to the mailing list. In general, Scylla should perform better in this case as well but your disk is likely to be the bottleneck in both cases.
As a summary I would say Scylladb and cassandra have the same read / write path
memtable, commitlog, sstable.
However implementation is very different:
- cassandra rely on OS for low level IO and network (most DBMS does)
- scylladb rely on its own lib (seastar) to handle IO and network at a low level independently from OS page cache etc. This is why they can provide feature such as workload scheduling within the same cluster that would be very hard to implement in cassandra.
I am currently working on a framework for analysis application of an large scale experiment. The experiment contains about 40 instruments each generating about a GB/s with ns timestamps. The data is intended to be analysed in time chunks.
For the implemetation I would like to know how big such a "chunk" aka batch can get before Flink or Spark stop processing the data. I think it goes with out saying that I intend to recollect the processed data.
For live data analysis
In general, there is no hard limit on how much data you can process with the systems. It all depends on how many nodes you have and what kind of a query you have.
As it sounds as you would mainly want to aggregate per instrument on a given time window, your maximum scale-out is limited to 40. That's the maximum number of machines that you could throw at your problem. Then, the question arises on how big your time chunks are/how complex the aggregations become. Assuming that your aggregation requires all data of a window to be present, then the system needs to hold 1 GB per second. So if you window is one hour, the system needs to hold at least 3.6 TB of data.
If the main memory of the machines is not sufficient, data needs to be spilled to disk, which slows down processing significantly. Spark really likes to keep all data in memory, so that would be the practical limit. Flink can spill almost all data to disk, but then disk I/O becomes a bottleneck.
If you rather need to calculate small values (like sums, averages), main memory shouldn't become an issue.
For old data analysis
When analysis old data, the system can do batch processing and have much more options to handle the volume including spilling to local disk. Spark usually shines if you can keep all data of one window in main memory. If you are not certain about that or you know it will not fit into main memory, Flink is the more scalable solution. Nevertheless, I'd expect both frameworks to work well for your use case.
I'd rather look at the ecosystem and the suit for you. Which languages do you want to use? It feels like using Jupyter notebooks or Zeppelin would work best for your rather ad-hoc analysis and data exploration. Especially if you want to use Python, I'd probably give Spark a try first.
As we are using SSD disks to provide storage for our cluster on servers with 30 GB of memory.
There is an argument about the commitlog directory, whether to dedicate an individual disk or having it on the same data disk.
As we already using SSD disks, performance should be fine having both commitlogs and data on the same disk, as there is no mechanical moving head for writing.
However, there is another factor, that is the read/write ratio. How would such a ratio affect the performance of writing or reading when we have both commitlogs and data on the same disk?
Using SSD, when would it become important to dedicate a high performance disk for the commitlog directory?
A dedicated commitlog device usually makes a lot of sense when you have HDDs, but is less obvious if you're using SSDs.
Even if you asked only if it makes sense with SSDs setups, I will try to give some general hints about the subject, primarily based on my understandings and my own experience. I admit the focus is probably too much on HDDs, but HDDs allow a deep insight on how Cassandra stuff works and why backing a commitlog/data directory with an SSD can be a life saver.
Background: IOPS and OPS are not the same thing.
I will start from a (very) far point: Device Performance. Here's a start-point lecture about storage device performances in general. Even if the article's neutrality is under discussion, it can provide some insights about the general metrics and performance you can expect from some systems. Of course, your mileage may vary, depending on what device (type/brand/model etc...) and how much stress (intended as type of workload) you put on the device, but I think it is a good starting point for our discussion here.
The reason I prefer to start from IOPS is because it is the very starting point for understanding storage performance. The C* literature speaks about OPS, Operations Per Second, because people usually don't think in terms of IOPS, especially when looking at stats. This really hides a lot of details, the operation size for starters.
A Cassandra Operation usually consists of multiple IOPS. Cassandra documentation usually refers to spinning disks (even if SSDs are referenced too), clearly states what happens when performing reads/writes, and people tend to ignore the fact that when their software stack (that spans from up the application down to Cassandra and its data files on the storage) hit the disks the performance decreases by a huge amount just because they have failed to recognize a random workload, and even if "Cassandra is an high-performance etc... etc.. etc...".
As an example, looking at the picture in the read path documentation, you can clearly see what data structures are in memory/on disk, and how the SSTable data is accessed. Further, the row cache paragraph says:
... If row cache is enabled, desired partition data is read from the row cache, potentially saving two seeks to disk for the data...
And here's where the catch starts: these two seeks are potentially saved from Cassandra's point of view. This simply means that Cassandra won't make two requests to the storage system: it will avoid to request the partition index, and the data because everything is already in RAM, but it doesn't really translates to "the storage system will save two IO operations". Indeed, how (generic) data is retrieved from the storage device is a very different thing, and of course depends on how the files are layed-out on the disk itself: are you using EXT4, XFS, or what? Assuming no cache is available (eg for very big data set sizes you can't really cache everything...), looking for a file is IOPS consuming, and this tends to amplify the potentially saved seeks when you have data in RAM, and tends to amplify the penalty you perceive when your data is not.
You can't escape physics: HDDs pay some taxes, SSDs no.
As you already know, the main "problem" (performance-wise) of HDDs is the average seek time, that is the time the HDD needs to wait on average in order to have a target sector under the heads. Once the sector is under the heads, if the system have to read a bunch of sequential bits everything is smooth and the throughput is proportional to the rotational speed of the HDD (to be precise to the tangential speed of the platters under the head, which depends also on the track etc...).
In other terms, HDDs have an average fixed performance tax (the average seek time), and everything after is almost "free". If an application requests a bunch of sectors that are not "contiguous" (from the disk point of view, eg a fragmented file is splitted across multiple sectors, but an application can't really know this), the disk will have to wait the average seek time on average multiple times, and this fixed tax influences its maximum throughput.
The strongest argument about storage is: every device have its own maximum magic average IOPS number. This number express the number of random IOPS the device can perform. You can't force an HDD to have more IOPS on average, it's a physical problem. The OS is usually smart enough to "enqueue" sector requests in the attempt to reduce the seek times, eg ordering by ascending requested sector number (trying to exploit some sequential operations), but nothing will save the performances from a random IO workload. You have X allotted available IOPS and must face your problems with that. No matter what.
You need to take advantage of the allotted IOPS of your device, and you must be wise on how you use them.
Suppose you have an HDD that maxes out at 100 IOPS on average. If your application performs a bunch of small (say 4KB) file reads, you have an application that performs 100 * 4KB reads every second: the throughput will be around 400KB/s (unless some caching is involved, and in that case the cache saved you precious IOPS). Astonishing. This is simply because you keep paying the seek time multiple times. If you change your access pattern to something that reads 16MB (contiguous) files, you get an higher throughput because you won't pay the seek time so much, you are exploiting a sequential pattern. What changes under the hood is the Request Size of each operation.
Now an interesting question is: how are "IOPS" and "Request Size" related to? Does one request size of 16MB can be considered one IOPS? And what about a 128MB request size? This is indeed a good question. At lower level, the Request Size spans from 512 bytes (the minimum sector size) to 128KB (32*4K sectors in one request). If the operation has small size, its transfer time, the time the disk needs to fetch the data, is also small. Higher request sizes have higher transfer times obviously. However, if you are able to perform 100 4KB IOPS, you will probably be able to perform around 80 IOPS #8KB. The relation can't be linear, because the transfer time depends on the rotational speed of the disks only (the transfer time is negligible compared to the seek time), and since you are actually reading from two adjacent sectors, you'll hit the seek time penalty once per request. This translates to a throughput of around 400KB/s for 4K requests and 1.6MB/s for 8K requests. And so on.... The larger the request size, the longer it takes to transfer data, the lesser IOPS you have, the higher throughput you have. (These are random numbers, pun intended, no measurements done! Just to let you understand. I however think they are in the ballpark).
SSDs don't suffer mechanical penalties and that's why they are capable of performing much better than HDDs. They have much more IOPS, and their limits come from the onboard electronics, bus connection etc.... Having an higher IOPS device is a big plus, these can be consumed by applications that are not IOPS friendly, and the user won't notice that the applications suck. However, with SSDs, the Request Size linearly influences the number of IOPS you can perform. When you look at some device that have 100k IOPS, these are usually referred at 4K. You'll be able to perform only 6.2k requests if you perform 64K requests.
Why Cassandra has a such good read performances even with HDDs then?
Speaking from a single node point of view (because given the performance of a cluster Cassandra scales linearly with the number of nodes in the cluster), the problem lies in the question itself. This is only true if you model your data in this particular way:
You must fetch all your data with one query only.
Your data must be ordered.
If you can't fetch your data with one data, denormalize in order to retrieve it with one query only.
You fetch a relative good amount of data on every read
These are well-known Cassandra modeling rules, but the key point is that these rules do really have a reason to be applied IOPS-wise. Indeed, these rules allow Cassandra to:
Be a super fast database because it will just require the partition index and the SSTable offset index of the data: two IOPS in the best case, much more IOPS in the worst case.
Be a super fast database because it will exploit the sequential capabilities of the HDDs and will not stress the IO subsystem by issuing other IO (random) seeks.
Be a super fast database because it will just fetch more data like the point number 1.
Be a super fast database because it will exploit longer the sequential capabilities of the HDDs.
In other terms, following these basic data modeling rules allows Cassandra to be IOPS friendly when reading data back.
What happens if you screw-up your data model? Cassandra won't be IOPS friendly, and as a consequence the performances will be predictably horrible. Unless you use an SSD, which has greater IOPS and then you won't notice slowness too much.
What happens if you read/write a small amount of data (eg due to misconfigured flush sizes, small commit log etc...)? Cassandra won't be IOPS friendly, and as a consequence the performances will be predictably horrible. Unless you use an SSD, which has greater IOPS and then you won't notice slowness too much.
How a read/write ratio pattern can influence performance in a Cassandra node?
Cassandra is a complex system, with different components that interact each other. I will try to explain from my point of view what are the main points when you put everything on one device only.
Writes/Deletes/Updates in Cassandra are fast because they are simply append-only writes to the CommitLog device. Reads, on the contrary, can be very IOPS consuming. When both CommitLog and Data are on the same physical disk (either HDD or SSD), the read/write paths interact, and they both consume IOPS.
Two important questions are:
How many IOPS a read (using the read path) consumes?
How many IOPS a write consumes?
These are important question because you have to remember that your device can perform at most X IOPS, and your system will have to split these X IOPS among these operations.
It is quite difficult to answer to the "read" question because, when you request some data, Cassandra needs to locate all the SSTables needed to satisfy the request. Assuming a very big dataset size, where caching is not effective, this imply that the Cassandra read path can be very IOPS hungry. Indeed, if your data is spread into 3 different SSTables, Cassandra will have to locate all of them, and for each SSTable will follow the read path: will read the partition index, and then will read the data in the SSTable. These are at least two IOPS, because if your filesystem is not "collaborative" enough, locating a file and/or pointing at a file offset could require some more IOPS. In the end, in this example Cassandra is consuming at least six IOPS per read.
Answering the "write" question is also tricky, because compactions and flushes can be triggered. They will consume a lot of IOPS. Flushes are easy to understand: they write data from memtables to disk with a sequential pattern. Instead, compactions read data back from different SSTables on disk, and while reading the tables they flush the result out to a new disk file. This is a mixed read/write pattern, and on HDDs this is very disruptive, because will force the disk to perform multiple seeks.
Mixing percentages: TL;DR
If you have a R/W ratio of 95% reads and 5% writes, having a separate CommitLog device can be a waste of resources, because writes will hardly impact your read performances, and you write so rarely that write performance may be considered not critical.
If you have a R/W ratio of 5% reads and 95% writes, having a separate CommitLog device can be again a waste of resources, because reads will hardly impact your write performances, and your read performances will hardly suffer from a bunch of sequential appends on the commitlog.
And finally, if you have a R/W ratio of 50% reads and 50% writes, having a separate CommitLog device is NOT a waste of resources, because every write performed on the CommitLog device won't produce at least two IOPS on the data drive (one for writing, and one for going back to read).
Please note that I didn't mention compactions, because independently on your workload, when compaction triggers in, your workload will be disrupted by mixed read/write background operations on different files (consuming disk IOPS all the way), and you will suffer both on reads and writes.
All this should be clear enough for HDDs because you run out of IOPS very fast, and when you do you notice it immediately. On SSDs, however, you don't run out of IOPS that fast, but you could do if your data consists of a lot small data rows.
The reality is that getting out of IOPS on an SSD is very hard because you'll get out (by a far amount) of CPU resources, But once you do you will see your performance slowly decrease. The effect however won't be such dramatic as in the cases of HDDs. As an example, if you have a 100 IOPS HDD and you run-out of IOPS by trying to issue 500 random IO stuff, you cleary get a penalty. By calling this penalty P, if you have an SSD with 100k IOPS, to get the same penalty P you should issue 500k IOPS, which can be very difficult to do without exhausting CPU or RAM.
In general, when you run out of some type of resource in your system, you need to increase its quantity. The most important thing (to me) is not to run out of IOPS in the "Data" part of your Cassandra cluster. In the case of SSDs IOPS, it's rare enough that you'll get the limit. You'll burn your CPU well before I think. But you will if you don't tune your system, or if your workload put too much stress on the disk subsystem (eg Leveled Compaction). I'd suggest to put an ordinary HDD instead of an high performance SSD for the commitlog, saving money. But if you have a lot of very small commitlog flushes an SSD is a completely life saver, because your writers won't suffer the latency of HDDs.
Finally, in my opition, you should go in pre-production with some sort of real data, and check your IOPS requirements. If you have enough room to put the SSD there don't worry. Go and save money. If your system gets too much pressure due to compaction then having a separate device is suggested. Analyze your commitlog pattern, and if its not IOPS demanding put it on a separate disk. Moreover, if you have a virtual environment you can provision a relatively small commitlog device regardless of other factors. It won't rise the cost of your solution too much.
The actual numbers will depend highly on the type of workload you have the configuration you have etc. You can have a look at Netflix tech blog posts for ballpark numbers, e.g. #1, #2.
Dedicating a disk for commitlog directory is a sort of scale up strategy. Cassandra works well with scale out approach. You just add more nodes into the cluster to spread the load - 2nd from the linked articles has a nice graph showing near linear scalability.
My company and I have purchased about 80,000$ in hardware to accomplish a goal. We have about 22,000 writes/sec in multiple application database for our Cassandra Cluster. We built 2 x nodes of Dual 3.5Ghz Xeons, 128GB RAM, Areca 1883, all top of the line high throughput. We also have a SSD RAID 10 array for Commitlog/saved_caches so that is not delayed.
The issue we have is the amount of data. In about 4 days we collected 1.8TB of data. We have no intention of ever releasing data. We then got a JBOD enclosure and put 6TB Platter drives in, 10 each, 20 total for about 110TB of space. We run fine with single replication, the issue is when we run to double replication.
We would love to add more nodes, we know that is the correct way, but at 20,000$ a node its costly. My question is, is it true to say if our write speed is the issue, that adding 10 more drives in each machine should allow for double the write speeds?
Does anyone have some of similar things going on and have some tweaks they made to Cassandra.yaml?
We did run htop for a while when we were in double replication, and CPU did seem to get a bit intensive (Read 24% average but it looks pretty close to maxed). RAM is all being used, 128GBs.
ANY thoughts on the matter will be considered and investigated.
Thanks,
Ken
It is not generally true that you can increase write speed simply by increasing disks, unless you are sure that you are IO bound. Cassandra batches writes (mutations go to the commitlog first, then a table in RAM, then are batch written to sstables when that table reaches a certain threshold - linear writes, so it's generally fast, even on spinning disks). At some point, you will max out the commitlog drive, fill the memtable faster than you can flush, or simply get to the point where GC can't keep up.
There are fairly large users of Cassandra who run multiple Cassandra instances on a given server simply to get the benefits of additional nodes without "just" adding disk. By running two JVMs, you can mitigate the pause times of a single node, and still take advantage of your (oversized) hardware. This is easiest if you can assign multiple IPs to your individual servers, but running on different ports also works. This is fairly atypical, and you'll need to pay close attention to your configs to avoid stepping on each other, but it will work, and will make more efficient use of your hardware than simply running huge nodes.
If I'm reading that correctly, you only have 2 nodes total?
If you only have 2 nodes I doubt that disk bandwidth would be the problem. Cassandra is usually CPU limited more than anything else.
Writes generally go to memory, so the disk only comes into play when memtables are flushed to disk as SStables. Now the thing that will probably kill your performance is when those SStables need to be compacted. When compaction starts happening, guess what part of the system that will stress, yup, the CPU.
You will also have a problem running repairs with huge disks like that. Usually I find that sustained transaction throughput is limited by compactions and repairs more than raw write performance.
With two nodes and single replication, you'd be splitting the load between the two nodes, with half going to one and half going to the other. If you set the replication factor to two, now every write would be going to both nodes, which is like going back in time to having a single machine database.
So I think it was a bad call to buy a small number of high end machines. You would have had much better performance with more machines where each machine was less expensive. You need more machines to spread out the load and get more CPUs into the equation.
Also you mention a disk enclosure. I hope you are not trying to use network storage with Cassandra. It needs the disks to be local.
I'm asking this question because I would like to understand how I can run RethinkDB better, which means what kind of hardware should it be running on, what kind of filesystem it should be running on and other system configurations to maximize it's throughput.
I'm trying to fill a table as fast as I can with documents that are {"n": <counter>, "rand": <Math.random()>}. I read somewhere that this is faster with batches of 200 documents, so that's what I'm inserting. I am also using soft durability. I started one nodejs process of this and I can insert on average 10k documents per second, pretty good.
But while this is happening, rethinkdb is using about 70% of one core (I have 8 virtual cores, it's an i7-4770) and the nodejs process is using 5%. So it seems that CPU is not the bottleneck.
As soon as I start another nodejs process doing the same thing, the inserts per second on both processes drop to about 4k-5k. Again, the CPU load keeps the same.
I fired iotop and I do see a lot of action there, but not what I expected. I configured two SSDs in a RAID0, and a quick dd test says I can write and read at about 800MBps. That's far above what the actual read and actual write speed iotop reports (average read ~14MBps average write ~50MBps).
So how can I exaust my machine's resources ? What does rethinkdb need to run faster ? Why doesn't it spend more resources and have a higher throughput ?
More information on what it's running on: It's an EX40SSD from Hetzner, two SSDs in a software RAID0, ext4 filesystem (tomorrow I'll try to mount it with noatime to see if it's better). The rethinkdb configuration is everything by default, the inserts are done to a table that has only one shard and one replica. Please feel free to ask anything else relevant I might have forgotten to mention.
Thanks in advance.
What I suspect is going on here is lock contention on the actual btrees. When you're inserting a large batch of documents, the system grabs various parts of the btree in parallel to update it with the new documents. This is a set of read-write locks -- other parts of the system can still read, but if you insert another large batch in parallel, there is a high probability that it will touch similar parts of the btree, and therefore has to wait for the system to start unlocking as it inserts parts of the first batch. (This isn't specific to RethinkDB, but a problem in databases in general) This is likely why you're not hitting 100% CPU/disk throughput.
There are a few things you can try, but note that there are subtleties to various approaches. Benchmarking in general is hard.
You can try to shard the table into 32 shards and retry your benchmark. You don't actually have to create a cluster, you can shard into 32 shards on a single machine. This will result in multiple btrees, so you'll minimize contention and will be able to use more system resources. Note that while this will likely increase the throughput, increasing the number of shards also slightly increases the latency, so you might need to significantly increase parallelism before you start seeing throughput increases.
You can try not to batch writes and instead write one document at a time (which generally approximates real-world use cases a little better). Then, start hundreds of parallel clients instead of just one or two, and have all them write one document at a time in parallel. Note that you need to make sure the clients themselves aren't a bottleneck in this situation.
You can try to rerun your benchmark and also spin up clients that read from the database in parallel with the writes. In RethinkDB reads can usually go through even when you're writing to a particular document, so this will give you the opportunity to up CPU usage and bypass contention.
Pay attention to the IDs of the documents. If the database is large enough (say, millions of documents), and the IDs you're inserting are random, you're much less likely to touch the same parts of the btree so contention becomes less of an issue.
You can combine various approaches (sharding, reading + writing, various numbers of concurrent clients) to start getting a sense for how the database behaves in various scenarios.
Note that there might be things going on that you wouldn't normally be aware of. For example, RethinkDB has a log-structured storage engine that does live compaction on disk, and this might use up some IO (and CPU) cycles that you'd be surprised by if you didn't know about live compaction. There are dozens of other components like this that might compound to surprising behavior, as these systems are typically very complex under the hood.
Hope this helps -- would love to hear about your progress on the benchmarks. We do a lot of them internally, and it's an art and a science to discover the boundaries of the system's performance on different use cases.
My guess is that the bottleneck here is the disk system, but not its throughput. What's more likely is that writes are happening in chunks that are too small to be efficient, or that there are delays due to latency between individual writes.
It's also possible that the latency between individual write queries coming from the client and their processing on the server slows the system down.
Here are a few things I recommend trying:
Increase the batch size further. Your documents are very small. Therefore I think you might get a significantly higher throughput with batches of 1,000-10,000 documents. This might work especially well in combination with the next point.
Run multiple concurrent clients. You mentioned that you have run 2 clients concurrently, but this might not be enough. I recommend running 16-32 if possible.
Check the cache size RethinkDB is using. By default, RethinkDB picks the cache size as a fraction of the available memory, but that is not always reliable. I recommend passing a --cache-size <MB> parameter to RethinkDB (or adding the cache-size=<MB> parameter to the configuration file, if you're using one). I can see that your server has 32 GB of RAM. I recommend using in the range of 20000 MB (or even more) as the cache size. A larger cache reduces the number of reads, but up to a certain limit also increases the amount of unsaved data that RethinkDB can accumulate in RAM to make disk writes more efficient.
Experiment with the --io-threads <THREADS> parameter. The default is 64, but you can try increasing it to e.g. 128 and see if it has any effect.