In a Cassandra database, a write needs to be logged in the Write Ahead Log first and then added to the memtable in memory. Since the Write Ahead Log is on disk, although it performs sequential writes(i.e., append only), will it still be much slower than memory access, thus become the performance bottleneck for the writes?
If I understand it correctly, Cassandra supports the mechanism to store the Write Ahead Log in OS cache, and then flush it to disk every pre-configured amount of time(say 10 seconds). However, does it mean the data changes made within this 10 seconds could be all lost if the machine crashes?
You can control if the sync of commit log using the commitlog-sync configuration. By default it's periodic, and synced to disk every 10 seconds (controlled by commitlog_sync_period_in_ms setting).
And yes, if you lose the power there is a risk that data in the commit log is lost. But Cassandra relies on the fact that you have multiple replicas, and if you did setup correctly, each replica should be in separate rack (at least, better if you have additional data centers) with separate power, etc.
Related
If you are using LMDB from only a single thread, and don't care about database persistence at all, is there any reason to open and close transactions?
Will it cause a performance issue to do all operations within a single transaction? Is there a performance hit from opening and closing too many transactions?
I am finding that my LMDB database is slowing down dramatically once it grows larger than available RAM, but neither my SSD nor CPU are near their capacity.
If the transaction is not committed, there is no guarantee that a reader(in a different process) can read the item. Write transactions should be committed at some point, so the data is available for other readers.
The database slowdown could simply be due to non sequential writes. From this post(https://ayende.com/blog/163330/degenerate-performance-scenario-for-lmdb), non sequential writes take longer.
If you don't commit your db just grows in memory, which will result in the OS starting to swap once you run out of memory, which hit's the disk, which is slow.
If you don't need persistence at all then use an in-memory hash-map, lmdb really doesn't provide you with anything in that case. If you do want persistence but don't care about loosing data then choose a reasonable commit (which depends on the value size, so experiment) ratio and commit i.e. after every 1000 values or so.
If you commit too infrequently you just incur the whole cost of disk access at a single point in time, so I think it makes more sense to spread that load a bit.
As the data in the Commitlog is flushed to the disk periodically after every 10 seconds by default (controlled by commitlog_sync_period_in_ms), so if all replicas crash within 10 seconds, will I lose all that data? Does it mean that, theoretically, a Cassandra Cluster can lose data?
If a node crashed right before updating the commit log on disk, then yes, you could lose up to ten seconds of data.
If you keep multiple replicas, by using a replication factor higher than 1 or have multiple data centers, then much of the lost data would be on other nodes, and would be recovered on the crashed node when it was repaired.
Also the commit log may be written in less than ten seconds it the write volume is high enough to hit size limits before the ten seconds.
If you want more durability than this (at the cost of higher latency), then you can change the commitlog_sync setting from periodic to batch. In batch mode it uses the commitlog_sync_batch_window_in_ms setting to control how often batches of writes are written to disk. In batch mode the writes are not acked until written to disk.
The ten second default for periodic mode is designed for spinning disks, since they are so slow there is a performance hit if you block acks waiting for commit log writes. For this reason if you use batch mode, they recommend a dedicated disk for the commit log so that the write head doesn't need to do any seeks to keep the added latency as low as possible.
If you are using SSDs, then you can use more aggressive timing since the latency is greatly reduced compared to a spinning disk.
Cassandra's default configuration sets the commitlog_sync mode to periodic, causing the commit log to be synced every commitlog_sync_period_in_ms milliseconds, so you can potentially lose up to that much data if all replicas crash within that window of time.
Insert-heavy workloads are CPU-bound in Cassandra before becoming
memory-bound. (All writes go to the commit log, but Cassandra is so
efficient in writing that the CPU is the limiting factor.
Can some body explain me this statement why I/O is not a limiting factor here? I mean as I understand it first heads to I/O and then to CPU.
I took a look at This StackOverflow question or Cassndra Incubator or Apache email chain but still its not clear for me.
Cassandra keeps a log of items, yes that part is I/O. But this log is appended continueusly. Therefore Cassandra doesn’t need to wait for HDD seek. Looking at HDD Burst write speeds - which are above 100MB/s this really doesn’t seem like a limiting factor to me. In fact the network would be limiting. But because you probably won’t reach write speeds at which the network becomes limiting, the CPU limitation kicks in.
I hope that now this part of the answer makes sense:
To process an insert, Cassandra needs to deserialize the messages from the clients, find which nodes should store the data and send messages to those nodes. Those nodes then store the data in an in memory data structure called a Memtable.
This is almost always CPU bound initially. However, as more data is inserted, the memtables grow large and are flushed to disk and new (empty) memtables are created. The flushed memtables are stored in files known as SSTables. There is an ongoing background process called compaction that merges SSTables together into progressively larger and larger files.
by Richard from Explanation required for a statement in Cassandra documentation
Is it safe to disable commit log if we use replication? When a node fails it is often due to hard disk failure so in that commit log would not help us for durability but replication would. Why do we even need a commit log when we use replication?
With no commit log, data stored in the memtables on the replicas may take a long time (could be unbounded, but in practice is often minutes) to be written to disk. This means, within that window, you could lose writes. If, for example, your data center loses power, you could lose all the writes for the last few minutes on all replicas. The commit log syncs (by default) every 10 seconds so you would lose at most 10 seconds of data in the event of simultaneous failure.
However, if you're using multi-data center replication then to lose data you would need simultaneous failures across data centers.
It's a trade off: commit log with no replication guards against a single node crashing or having a non-destructive failure. With replication in single DC, it guards against whole DC failure e.g. power failure. With replication in multi-DC, it guards against correlated failures. You can decide how much resilience you need based on the cost of enabling the commit log versus the cost of losing recent writes.
I met very strange problem during testing cassandra. I have a very simple column family that stores video data (keys point to time period and there is only one column containing ~2MB video for this period).
Use Case
I start to load data using Hector API (round-robin) to 6 empty nodes (8GB RAM for Cassandra)- load is run in 4 threads adding 4 rows in second for each thread.
After a while (running load for hour or so) near 100-200 GB are added to the node (depending on the replication factor) and then one or several nodes become unreachable. (no pinging just reboot helps)
Why Compaction
I do use tiered-level compaction and monitoring the system(Debian) i can see that it actually not writes but compaction that takes almost all resources (disk, memory) and cause server to refuse writes and than fail.
After like 30-40 minutes of test compaction tasks just cannot be handled and get queued. Interesting thing is that there are no deletes and updates - so compaction just reads/writes data again and again without bringing actual value to me (like it can be compacted once in the evening).
When i slow down the pace - i.e running 2 threads with 1 second delay things go better but whether it still be working when i have 20TB not 100 GB on a node.
Is Cassandra optimized for such type of workload? How the resources are normally distributed between compaction and reads/writes?
Update
Update of network driver solved problem with unreachable cluster
Thanks,
Sergey.
Cassandra will use up to in_memory_compaction_limit_in_mb memory for a compaction. It is routine to have compaction running while reads and writes are served simultaneously. It is also normal that compaction can fall behind if you continue to throw writes at it as fast as possible; if your read workload requires that compaction be up to date or close to it at all times, then you'll need a larger cluster to spread the load around more machines.
Recommended amount of disk per node for online queries is up to 500GB, maybe 1TB if you're pushing it. Remember that this amount of data will have to be rebuilt if a node fails. Typical Cassandra workloads are CPU-bound or iops-bound, not disk-space bound, so you won't be able to make good use of that space anyway.
(It's also possible to do batch analytics against Cassandra, which we do with the Cassandra Filesystem, in which case higher disk:cpu ratios are desirable, but we use a custom compaction strategy for that as well.)
It's not clear from your report why a server would become unreachable. This is really an OS-level problem. (Are you swapping? Disabling swap would be a good first step.)