I am trying to ingest JSON records using logstash but am running into memory issues. Previously our pipeline could run with default settings (memory queue, batch size 125, one worker per core) and process 5k events per second. We added some data to the JSON records and now the heap memory goes up and gradually falls apart after one hour of ingesting.
What I have tried to improve this is:
Increase memory via options in docker-compose to "LS_JAVA_OPTS=-Xmx8g -Xms8g". The virtual machine has 16GB of memory.
Switched to a persistent queue.
Lowered pipeline batch size from 125 down to 75.
While these have helped, it just delays the time until the memory issues start to occur. Any ideas on what I should do to fix this? Should I increase the memory some more? Should I increase the size of the persistent queue? I am at my wits end!
Edit: Here is another image of memory usage after reducing pipeline works to 6 and batch size to 75:
For anybody who runs into this and is using a lot of different field names, my problem was due to an issue with logstash here that will be fixed in version 7.17. Logstash is caching field names and if your events have a lot of unique field names, it will cause out of memory errors like in my attached graphs.
I wish to estimate how many cassandra storage nodes I would need to serve a specific number of reads per second.
My node specs are 32 cores, 256GB ram, 10Gbps NIC, 10x 6TB HDDs. Obviously SSDs would be much more preferrable, but these are not available in this instance.
I have around 5x10^11 values of 1kB each = 500TB of values to serve, at a rate of 100,000 read requests per second. The distribution of these requests is completely even, ie, ram capacity caching will have no effect.
If we assume that each HDD can sustain ~100 IOps, then I could expect that I need at least ~ 100 nodes to serve this read load - correct?
I also estimate that I would need at least ~ 20 machines for the total storage with a replication factor of 2, plus overhead.
It's a really broad question - you need to try to test your machines with tools, like, NoSQLBench that was specially built for such tasks.
Typical recommendation is to store ~1Tb of data per Cassandra node (including replication). You need to take into account other factors, like, how long it will take to replace the node in the cluster, or add new one - the speed of streaming is directly proportional to size of data on disk.
HDDs are really not recommended if you want to have low latency answers. I have a client with ~150Tb spread over ~30 machines with HDDs, with a lot of writes although, and read latencies regularly are going above 0.5 second, and higher. You need to take into account that Cassandra requires random access to data, and head of HDD simply couldn't move so fast to serve requests.
As we are using SSD disks to provide storage for our cluster on servers with 30 GB of memory.
There is an argument about the commitlog directory, whether to dedicate an individual disk or having it on the same data disk.
As we already using SSD disks, performance should be fine having both commitlogs and data on the same disk, as there is no mechanical moving head for writing.
However, there is another factor, that is the read/write ratio. How would such a ratio affect the performance of writing or reading when we have both commitlogs and data on the same disk?
Using SSD, when would it become important to dedicate a high performance disk for the commitlog directory?
A dedicated commitlog device usually makes a lot of sense when you have HDDs, but is less obvious if you're using SSDs.
Even if you asked only if it makes sense with SSDs setups, I will try to give some general hints about the subject, primarily based on my understandings and my own experience. I admit the focus is probably too much on HDDs, but HDDs allow a deep insight on how Cassandra stuff works and why backing a commitlog/data directory with an SSD can be a life saver.
Background: IOPS and OPS are not the same thing.
I will start from a (very) far point: Device Performance. Here's a start-point lecture about storage device performances in general. Even if the article's neutrality is under discussion, it can provide some insights about the general metrics and performance you can expect from some systems. Of course, your mileage may vary, depending on what device (type/brand/model etc...) and how much stress (intended as type of workload) you put on the device, but I think it is a good starting point for our discussion here.
The reason I prefer to start from IOPS is because it is the very starting point for understanding storage performance. The C* literature speaks about OPS, Operations Per Second, because people usually don't think in terms of IOPS, especially when looking at stats. This really hides a lot of details, the operation size for starters.
A Cassandra Operation usually consists of multiple IOPS. Cassandra documentation usually refers to spinning disks (even if SSDs are referenced too), clearly states what happens when performing reads/writes, and people tend to ignore the fact that when their software stack (that spans from up the application down to Cassandra and its data files on the storage) hit the disks the performance decreases by a huge amount just because they have failed to recognize a random workload, and even if "Cassandra is an high-performance etc... etc.. etc...".
As an example, looking at the picture in the read path documentation, you can clearly see what data structures are in memory/on disk, and how the SSTable data is accessed. Further, the row cache paragraph says:
... If row cache is enabled, desired partition data is read from the row cache, potentially saving two seeks to disk for the data...
And here's where the catch starts: these two seeks are potentially saved from Cassandra's point of view. This simply means that Cassandra won't make two requests to the storage system: it will avoid to request the partition index, and the data because everything is already in RAM, but it doesn't really translates to "the storage system will save two IO operations". Indeed, how (generic) data is retrieved from the storage device is a very different thing, and of course depends on how the files are layed-out on the disk itself: are you using EXT4, XFS, or what? Assuming no cache is available (eg for very big data set sizes you can't really cache everything...), looking for a file is IOPS consuming, and this tends to amplify the potentially saved seeks when you have data in RAM, and tends to amplify the penalty you perceive when your data is not.
You can't escape physics: HDDs pay some taxes, SSDs no.
As you already know, the main "problem" (performance-wise) of HDDs is the average seek time, that is the time the HDD needs to wait on average in order to have a target sector under the heads. Once the sector is under the heads, if the system have to read a bunch of sequential bits everything is smooth and the throughput is proportional to the rotational speed of the HDD (to be precise to the tangential speed of the platters under the head, which depends also on the track etc...).
In other terms, HDDs have an average fixed performance tax (the average seek time), and everything after is almost "free". If an application requests a bunch of sectors that are not "contiguous" (from the disk point of view, eg a fragmented file is splitted across multiple sectors, but an application can't really know this), the disk will have to wait the average seek time on average multiple times, and this fixed tax influences its maximum throughput.
The strongest argument about storage is: every device have its own maximum magic average IOPS number. This number express the number of random IOPS the device can perform. You can't force an HDD to have more IOPS on average, it's a physical problem. The OS is usually smart enough to "enqueue" sector requests in the attempt to reduce the seek times, eg ordering by ascending requested sector number (trying to exploit some sequential operations), but nothing will save the performances from a random IO workload. You have X allotted available IOPS and must face your problems with that. No matter what.
You need to take advantage of the allotted IOPS of your device, and you must be wise on how you use them.
Suppose you have an HDD that maxes out at 100 IOPS on average. If your application performs a bunch of small (say 4KB) file reads, you have an application that performs 100 * 4KB reads every second: the throughput will be around 400KB/s (unless some caching is involved, and in that case the cache saved you precious IOPS). Astonishing. This is simply because you keep paying the seek time multiple times. If you change your access pattern to something that reads 16MB (contiguous) files, you get an higher throughput because you won't pay the seek time so much, you are exploiting a sequential pattern. What changes under the hood is the Request Size of each operation.
Now an interesting question is: how are "IOPS" and "Request Size" related to? Does one request size of 16MB can be considered one IOPS? And what about a 128MB request size? This is indeed a good question. At lower level, the Request Size spans from 512 bytes (the minimum sector size) to 128KB (32*4K sectors in one request). If the operation has small size, its transfer time, the time the disk needs to fetch the data, is also small. Higher request sizes have higher transfer times obviously. However, if you are able to perform 100 4KB IOPS, you will probably be able to perform around 80 IOPS #8KB. The relation can't be linear, because the transfer time depends on the rotational speed of the disks only (the transfer time is negligible compared to the seek time), and since you are actually reading from two adjacent sectors, you'll hit the seek time penalty once per request. This translates to a throughput of around 400KB/s for 4K requests and 1.6MB/s for 8K requests. And so on.... The larger the request size, the longer it takes to transfer data, the lesser IOPS you have, the higher throughput you have. (These are random numbers, pun intended, no measurements done! Just to let you understand. I however think they are in the ballpark).
SSDs don't suffer mechanical penalties and that's why they are capable of performing much better than HDDs. They have much more IOPS, and their limits come from the onboard electronics, bus connection etc.... Having an higher IOPS device is a big plus, these can be consumed by applications that are not IOPS friendly, and the user won't notice that the applications suck. However, with SSDs, the Request Size linearly influences the number of IOPS you can perform. When you look at some device that have 100k IOPS, these are usually referred at 4K. You'll be able to perform only 6.2k requests if you perform 64K requests.
Why Cassandra has a such good read performances even with HDDs then?
Speaking from a single node point of view (because given the performance of a cluster Cassandra scales linearly with the number of nodes in the cluster), the problem lies in the question itself. This is only true if you model your data in this particular way:
You must fetch all your data with one query only.
Your data must be ordered.
If you can't fetch your data with one data, denormalize in order to retrieve it with one query only.
You fetch a relative good amount of data on every read
These are well-known Cassandra modeling rules, but the key point is that these rules do really have a reason to be applied IOPS-wise. Indeed, these rules allow Cassandra to:
Be a super fast database because it will just require the partition index and the SSTable offset index of the data: two IOPS in the best case, much more IOPS in the worst case.
Be a super fast database because it will exploit the sequential capabilities of the HDDs and will not stress the IO subsystem by issuing other IO (random) seeks.
Be a super fast database because it will just fetch more data like the point number 1.
Be a super fast database because it will exploit longer the sequential capabilities of the HDDs.
In other terms, following these basic data modeling rules allows Cassandra to be IOPS friendly when reading data back.
What happens if you screw-up your data model? Cassandra won't be IOPS friendly, and as a consequence the performances will be predictably horrible. Unless you use an SSD, which has greater IOPS and then you won't notice slowness too much.
What happens if you read/write a small amount of data (eg due to misconfigured flush sizes, small commit log etc...)? Cassandra won't be IOPS friendly, and as a consequence the performances will be predictably horrible. Unless you use an SSD, which has greater IOPS and then you won't notice slowness too much.
How a read/write ratio pattern can influence performance in a Cassandra node?
Cassandra is a complex system, with different components that interact each other. I will try to explain from my point of view what are the main points when you put everything on one device only.
Writes/Deletes/Updates in Cassandra are fast because they are simply append-only writes to the CommitLog device. Reads, on the contrary, can be very IOPS consuming. When both CommitLog and Data are on the same physical disk (either HDD or SSD), the read/write paths interact, and they both consume IOPS.
Two important questions are:
How many IOPS a read (using the read path) consumes?
How many IOPS a write consumes?
These are important question because you have to remember that your device can perform at most X IOPS, and your system will have to split these X IOPS among these operations.
It is quite difficult to answer to the "read" question because, when you request some data, Cassandra needs to locate all the SSTables needed to satisfy the request. Assuming a very big dataset size, where caching is not effective, this imply that the Cassandra read path can be very IOPS hungry. Indeed, if your data is spread into 3 different SSTables, Cassandra will have to locate all of them, and for each SSTable will follow the read path: will read the partition index, and then will read the data in the SSTable. These are at least two IOPS, because if your filesystem is not "collaborative" enough, locating a file and/or pointing at a file offset could require some more IOPS. In the end, in this example Cassandra is consuming at least six IOPS per read.
Answering the "write" question is also tricky, because compactions and flushes can be triggered. They will consume a lot of IOPS. Flushes are easy to understand: they write data from memtables to disk with a sequential pattern. Instead, compactions read data back from different SSTables on disk, and while reading the tables they flush the result out to a new disk file. This is a mixed read/write pattern, and on HDDs this is very disruptive, because will force the disk to perform multiple seeks.
Mixing percentages: TL;DR
If you have a R/W ratio of 95% reads and 5% writes, having a separate CommitLog device can be a waste of resources, because writes will hardly impact your read performances, and you write so rarely that write performance may be considered not critical.
If you have a R/W ratio of 5% reads and 95% writes, having a separate CommitLog device can be again a waste of resources, because reads will hardly impact your write performances, and your read performances will hardly suffer from a bunch of sequential appends on the commitlog.
And finally, if you have a R/W ratio of 50% reads and 50% writes, having a separate CommitLog device is NOT a waste of resources, because every write performed on the CommitLog device won't produce at least two IOPS on the data drive (one for writing, and one for going back to read).
Please note that I didn't mention compactions, because independently on your workload, when compaction triggers in, your workload will be disrupted by mixed read/write background operations on different files (consuming disk IOPS all the way), and you will suffer both on reads and writes.
All this should be clear enough for HDDs because you run out of IOPS very fast, and when you do you notice it immediately. On SSDs, however, you don't run out of IOPS that fast, but you could do if your data consists of a lot small data rows.
The reality is that getting out of IOPS on an SSD is very hard because you'll get out (by a far amount) of CPU resources, But once you do you will see your performance slowly decrease. The effect however won't be such dramatic as in the cases of HDDs. As an example, if you have a 100 IOPS HDD and you run-out of IOPS by trying to issue 500 random IO stuff, you cleary get a penalty. By calling this penalty P, if you have an SSD with 100k IOPS, to get the same penalty P you should issue 500k IOPS, which can be very difficult to do without exhausting CPU or RAM.
In general, when you run out of some type of resource in your system, you need to increase its quantity. The most important thing (to me) is not to run out of IOPS in the "Data" part of your Cassandra cluster. In the case of SSDs IOPS, it's rare enough that you'll get the limit. You'll burn your CPU well before I think. But you will if you don't tune your system, or if your workload put too much stress on the disk subsystem (eg Leveled Compaction). I'd suggest to put an ordinary HDD instead of an high performance SSD for the commitlog, saving money. But if you have a lot of very small commitlog flushes an SSD is a completely life saver, because your writers won't suffer the latency of HDDs.
Finally, in my opition, you should go in pre-production with some sort of real data, and check your IOPS requirements. If you have enough room to put the SSD there don't worry. Go and save money. If your system gets too much pressure due to compaction then having a separate device is suggested. Analyze your commitlog pattern, and if its not IOPS demanding put it on a separate disk. Moreover, if you have a virtual environment you can provision a relatively small commitlog device regardless of other factors. It won't rise the cost of your solution too much.
The actual numbers will depend highly on the type of workload you have the configuration you have etc. You can have a look at Netflix tech blog posts for ballpark numbers, e.g. #1, #2.
Dedicating a disk for commitlog directory is a sort of scale up strategy. Cassandra works well with scale out approach. You just add more nodes into the cluster to spread the load - 2nd from the linked articles has a nice graph showing near linear scalability.
I'm getting these errors:
java.lang.IllegalArgumentException: Mutation of 16.000MiB is too large for the maximum size of 16.000MiB
in Apache Cassandra 3.x. I'm doing inserts of 4MB or 8MB blobs, but not anything greater than 8MB. Why am I hitting the 16MB limit? Is Cassandra batching up multiple writes (inserts) and creating a "mutation" that is too large? (If so, why would it do that, since the configured limit is 8MB?)
There is little documentation on mutations -- except to say that a mutation is an insert or delete. How can I prevent these errors?
you can increase the commit log size to 64 mb in cassandra.yaml
commitlog_segment_size_in_mb: 64
By default the commitLog size is 32 mb.
By design intent the maximum allowed segment size is 50% of the configured commit_log_segment_size_in_mb. This is so Cassandra avoids writing segments with large amounts of empty space.
you should investigate why the write size has suddenly increased. If it is not expected i.e. due to a planned change then it may well be a problem with the client application that needs further inspection.
What is the maximum value that one can set for transaction_buffer inside memsql cnf? I assume there is a correlation with RAM allocated on the server. My leaves have 32G each and at the moment we have transaction_buffer set to 0. We are passed designing phase on our cluster and we would like to do some performance tuning and one parameter that needs to be set up accordingly is this one.
The transaction_buffer size is an amount of memory reserved per database partition - i.e. each leaf node will need transaction_buffer size * partitions per leaf * number of databases memory. The default is 128 MB and this should be sufficient generally.
Basically, it's a balancing act - data in transaction_buffer will exist in memory before being written to disk. A transaction_buffer of 0 may save you some memory, but it's not taking full advantage of the speed of being in memory. If you have a lot of databases that are updated infrequently a low transaction_buffer may be the right balance as it is a per database cost (keeping in mind that each partition is a database itself).
Transaction_buffer may also be valuable for you as a "get out of jail free" card - since if your workload becomes more and more memory intensive it's possible to get into a situation where your OS is killing MemSQL too frequently to reduce memory consumption. Once you get stuck in a vicious cycle like that, restarting with a reduced transaction buffer can reduce memory overhead enough to keep the system from being OOM-killed long enough to troubleshoot and correct the issue on your end.
Eventually, it might become adaptive, and you'll be left without that easy way to get some wiggle-room. Which is why it is essential to make sure the maximum_memory is low enough that your system doesn't begin to OOM kill processes. https://docs.memsql.com/docs/memory-management