Using cassandra-stress I am generating a write operation of 50 million writes and then a mixed operation (1write and 3reads) for 20minutes. I'm repeating this process number of times.
After each iteration the operation rate is decreasing and the total operation time is increasing as shown in image.
The decreased op/s usually indicates that your reads/writes takes longer to execute. The root cause cannot be determined without inspecting carefully what's your C* cluster is doing.
I can be as general as the information I have:
You have a wrong JVM settings, eg your cluster is performing poorly because C* starts experiencing long GC pauses, so check your GC log files and tune the JVM accordingly.
You did nothing wrong. C* starts to perform more and more disk IO for compactions, and the overall throughput decreases because you simply run out of IOPS. Use the nodetool compactionstats command to understand if compactions accumulates over time during your stress (and they complete successfully once the stress finishes), or they complete while you're stressing the system (so you don't have any IOPS problems at all). Check your concurrent_compactors settings in your YAML file.
Related
While running full repair on a cassandra cluster with 15 nodes, RF=3 and 3racks(single datacenter) using command ./nodetool repair -pr -full -seq I can see multiple validation compactions running at the same time (>10). Is there any way to limit simultaneous validations in cassandra 3.11.1 like we can limit normal compactions?
As the cluster size has increased, I limited repairs to run table by table and also used -pr and -seq to restrict load on the nodes. But now, the load is very high due to concurrent validation compactions. Need a way to restrict concurrent validation compactions to reduce load on nodes during repairs. I'm also exploring reaper to manage repairs but need to find some workaround for the load issues till I use reaper.
If you're seeing (validation) compactions becoming cumbersome, there are two settings that you should look at:
compaction_throughput_mb_per_sec
concurrent_compactors
compaction_throughput_mb_per_sec
This is the main tuneable setting for compaction. I mentioned this setting in a related answer here: Advise on stopping compaction to reduce slowness
I would recommend checking this setting, and then reducing it until contention is resolved. Or, you could try to set compaction throughput to 1 (the lowest setting) during the day. Then, raise it back up once business hours are over.
% bin/nodetool setcompactionthroughput 1
% bin/nodetool getcompactionthroughput
Current compaction throughput: 1 MB/s
But definitely check it first, just to see what you're running at, and maybe consider halving that and check the effect.
concurrent_compactors
So this defaults to the smaller of (number of disks, # number of cores), with a minimum of 2 and a maximum of 8. There is some solid advice out there around forcing this to a value of 1 if you're using spinning disks, and maybe set it to 4 for SSDs. The default is usually fine, but if it's too high, compactions can overwhelm disk I/O.
tl;dr;
Focus on compaction throughput for now. My advise is to check it, lower it, observe it, and repeat until things improve.
I have a Cassandra 2.1 cluster using Leveled Compaction Strategy.
Base on my calculation, the cluster will run out of space before compaction kick in automatically when it reaches next level. For that reason, I have a cron job that runs "nodetool compact" every week to run a full (major) compaction to remove tomb stoned data points.
I noticed that full compaction consumes very little CPU/network resources. With bigger data set, full compaction runs for days.
I have tried to "setcompactionthroughput" to higher number (128MB/s instead of 32MB/s by default, even tried to set it to 0 (no limit), but full compaction speed doesn't seem to change at all.
Is there anything I can tune to make it faster? Thanks in advance.
There are very few cases where you should run full compaction via nodetool compact - it causes what you're likely seeing now (a single huge data file, which never naturally compacts with other sstables, even/especially when other deletions have happened).
Recovering from the state your in isn't trivial, but is possible. If you have a lot of cpu/IO to spare, you can try toggling from STCS to LCS, and LeveledCompactionStrategy will naturally split up that huge file into thousands of tiny files, and will be much more aggressive about rewriting those files over time (so tombstones are compacted away much more regularly). This is very much CPU and IO intensive, so don't do it if you're near tipping. Also, it will duplicate all data on disk for a short period, so you'll need to be under 50% disk utilization to do this.
If you're over 50% disk utilization, you've backed yourself into a corner, and you'll probably need to add more disk temporarily in order to recover.
As the data in the Commitlog is flushed to the disk periodically after every 10 seconds by default (controlled by commitlog_sync_period_in_ms), so if all replicas crash within 10 seconds, will I lose all that data? Does it mean that, theoretically, a Cassandra Cluster can lose data?
If a node crashed right before updating the commit log on disk, then yes, you could lose up to ten seconds of data.
If you keep multiple replicas, by using a replication factor higher than 1 or have multiple data centers, then much of the lost data would be on other nodes, and would be recovered on the crashed node when it was repaired.
Also the commit log may be written in less than ten seconds it the write volume is high enough to hit size limits before the ten seconds.
If you want more durability than this (at the cost of higher latency), then you can change the commitlog_sync setting from periodic to batch. In batch mode it uses the commitlog_sync_batch_window_in_ms setting to control how often batches of writes are written to disk. In batch mode the writes are not acked until written to disk.
The ten second default for periodic mode is designed for spinning disks, since they are so slow there is a performance hit if you block acks waiting for commit log writes. For this reason if you use batch mode, they recommend a dedicated disk for the commit log so that the write head doesn't need to do any seeks to keep the added latency as low as possible.
If you are using SSDs, then you can use more aggressive timing since the latency is greatly reduced compared to a spinning disk.
Cassandra's default configuration sets the commitlog_sync mode to periodic, causing the commit log to be synced every commitlog_sync_period_in_ms milliseconds, so you can potentially lose up to that much data if all replicas crash within that window of time.
I'm asking this question because I would like to understand how I can run RethinkDB better, which means what kind of hardware should it be running on, what kind of filesystem it should be running on and other system configurations to maximize it's throughput.
I'm trying to fill a table as fast as I can with documents that are {"n": <counter>, "rand": <Math.random()>}. I read somewhere that this is faster with batches of 200 documents, so that's what I'm inserting. I am also using soft durability. I started one nodejs process of this and I can insert on average 10k documents per second, pretty good.
But while this is happening, rethinkdb is using about 70% of one core (I have 8 virtual cores, it's an i7-4770) and the nodejs process is using 5%. So it seems that CPU is not the bottleneck.
As soon as I start another nodejs process doing the same thing, the inserts per second on both processes drop to about 4k-5k. Again, the CPU load keeps the same.
I fired iotop and I do see a lot of action there, but not what I expected. I configured two SSDs in a RAID0, and a quick dd test says I can write and read at about 800MBps. That's far above what the actual read and actual write speed iotop reports (average read ~14MBps average write ~50MBps).
So how can I exaust my machine's resources ? What does rethinkdb need to run faster ? Why doesn't it spend more resources and have a higher throughput ?
More information on what it's running on: It's an EX40SSD from Hetzner, two SSDs in a software RAID0, ext4 filesystem (tomorrow I'll try to mount it with noatime to see if it's better). The rethinkdb configuration is everything by default, the inserts are done to a table that has only one shard and one replica. Please feel free to ask anything else relevant I might have forgotten to mention.
Thanks in advance.
What I suspect is going on here is lock contention on the actual btrees. When you're inserting a large batch of documents, the system grabs various parts of the btree in parallel to update it with the new documents. This is a set of read-write locks -- other parts of the system can still read, but if you insert another large batch in parallel, there is a high probability that it will touch similar parts of the btree, and therefore has to wait for the system to start unlocking as it inserts parts of the first batch. (This isn't specific to RethinkDB, but a problem in databases in general) This is likely why you're not hitting 100% CPU/disk throughput.
There are a few things you can try, but note that there are subtleties to various approaches. Benchmarking in general is hard.
You can try to shard the table into 32 shards and retry your benchmark. You don't actually have to create a cluster, you can shard into 32 shards on a single machine. This will result in multiple btrees, so you'll minimize contention and will be able to use more system resources. Note that while this will likely increase the throughput, increasing the number of shards also slightly increases the latency, so you might need to significantly increase parallelism before you start seeing throughput increases.
You can try not to batch writes and instead write one document at a time (which generally approximates real-world use cases a little better). Then, start hundreds of parallel clients instead of just one or two, and have all them write one document at a time in parallel. Note that you need to make sure the clients themselves aren't a bottleneck in this situation.
You can try to rerun your benchmark and also spin up clients that read from the database in parallel with the writes. In RethinkDB reads can usually go through even when you're writing to a particular document, so this will give you the opportunity to up CPU usage and bypass contention.
Pay attention to the IDs of the documents. If the database is large enough (say, millions of documents), and the IDs you're inserting are random, you're much less likely to touch the same parts of the btree so contention becomes less of an issue.
You can combine various approaches (sharding, reading + writing, various numbers of concurrent clients) to start getting a sense for how the database behaves in various scenarios.
Note that there might be things going on that you wouldn't normally be aware of. For example, RethinkDB has a log-structured storage engine that does live compaction on disk, and this might use up some IO (and CPU) cycles that you'd be surprised by if you didn't know about live compaction. There are dozens of other components like this that might compound to surprising behavior, as these systems are typically very complex under the hood.
Hope this helps -- would love to hear about your progress on the benchmarks. We do a lot of them internally, and it's an art and a science to discover the boundaries of the system's performance on different use cases.
My guess is that the bottleneck here is the disk system, but not its throughput. What's more likely is that writes are happening in chunks that are too small to be efficient, or that there are delays due to latency between individual writes.
It's also possible that the latency between individual write queries coming from the client and their processing on the server slows the system down.
Here are a few things I recommend trying:
Increase the batch size further. Your documents are very small. Therefore I think you might get a significantly higher throughput with batches of 1,000-10,000 documents. This might work especially well in combination with the next point.
Run multiple concurrent clients. You mentioned that you have run 2 clients concurrently, but this might not be enough. I recommend running 16-32 if possible.
Check the cache size RethinkDB is using. By default, RethinkDB picks the cache size as a fraction of the available memory, but that is not always reliable. I recommend passing a --cache-size <MB> parameter to RethinkDB (or adding the cache-size=<MB> parameter to the configuration file, if you're using one). I can see that your server has 32 GB of RAM. I recommend using in the range of 20000 MB (or even more) as the cache size. A larger cache reduces the number of reads, but up to a certain limit also increases the amount of unsaved data that RethinkDB can accumulate in RAM to make disk writes more efficient.
Experiment with the --io-threads <THREADS> parameter. The default is 64, but you can try increasing it to e.g. 128 and see if it has any effect.
I met very strange problem during testing cassandra. I have a very simple column family that stores video data (keys point to time period and there is only one column containing ~2MB video for this period).
Use Case
I start to load data using Hector API (round-robin) to 6 empty nodes (8GB RAM for Cassandra)- load is run in 4 threads adding 4 rows in second for each thread.
After a while (running load for hour or so) near 100-200 GB are added to the node (depending on the replication factor) and then one or several nodes become unreachable. (no pinging just reboot helps)
Why Compaction
I do use tiered-level compaction and monitoring the system(Debian) i can see that it actually not writes but compaction that takes almost all resources (disk, memory) and cause server to refuse writes and than fail.
After like 30-40 minutes of test compaction tasks just cannot be handled and get queued. Interesting thing is that there are no deletes and updates - so compaction just reads/writes data again and again without bringing actual value to me (like it can be compacted once in the evening).
When i slow down the pace - i.e running 2 threads with 1 second delay things go better but whether it still be working when i have 20TB not 100 GB on a node.
Is Cassandra optimized for such type of workload? How the resources are normally distributed between compaction and reads/writes?
Update
Update of network driver solved problem with unreachable cluster
Thanks,
Sergey.
Cassandra will use up to in_memory_compaction_limit_in_mb memory for a compaction. It is routine to have compaction running while reads and writes are served simultaneously. It is also normal that compaction can fall behind if you continue to throw writes at it as fast as possible; if your read workload requires that compaction be up to date or close to it at all times, then you'll need a larger cluster to spread the load around more machines.
Recommended amount of disk per node for online queries is up to 500GB, maybe 1TB if you're pushing it. Remember that this amount of data will have to be rebuilt if a node fails. Typical Cassandra workloads are CPU-bound or iops-bound, not disk-space bound, so you won't be able to make good use of that space anyway.
(It's also possible to do batch analytics against Cassandra, which we do with the Cassandra Filesystem, in which case higher disk:cpu ratios are desirable, but we use a custom compaction strategy for that as well.)
It's not clear from your report why a server would become unreachable. This is really an OS-level problem. (Are you swapping? Disabling swap would be a good first step.)