We have a Redis Cache on Azure Standard 2.5gb. We observe the following behaviour:
Every now and then, we observe large drops in memory usage. It appears that lots of resources are being evicted.
Things to note:
Eviction policy is LRU
Available cache size is 2.5gb
No application code that would evict such large amounts of memory (largest objects are ~80kb and most are significantly smaller)
Observed memory drops represent tens of thousands of keys
We seldom use explicit expiry dates on cached objects, and when we do they are always < 1 hour.
My question is, apart from application logic explicitly evicting keys are there any other circumstances Redis would evict large amounts of keys?
The memory cleanup may not represent evictions.
You say "it appears" that lots of resources are being evicted, but if you are just relying on the reclaimed memory for that appearance, you may be chasing ghosts. Have you checked how this graph overlays with the Total Keys metric available in the Azure Portal? Overlaying the two series should allow you to see whether or not the memory reclamation really is due to eviction or if it's due to another process like Azure perhaps calling MEMORY PURGE periodically on the cache instance to clean up dirty pages?
Can you change your redis eviction policy to noeviction and see if that addresses your problem? Doing so means you will have to manage all content yourself. https://redis.io/topics/lru-cache has more details.
Related
How to pin the table in cache so it would not swap out of memory?
Situation: We are using Microstrategy BI reporting. Semantic layer is built. We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table ; we did cache for SPARK context( Thrift server). Initially it was all in cache , now some in cache and some in disk. That disk may be local disk relatively more expensive reading than from s3. Queries may take longer and inconsistent times from user experience perspective. If More queries running using Cache tables, copies of the cache table images are copied and copies are not staying in memory causing reports to run longer. so how to pin the table so would not swap to disk. Spark memory management is dynamic allocation, and how to use those few tables to Pin in memory .
There's a couple different ways to tackle this. First keep in mind that in memory storage is shared between storage and execution. so doing big joins/etc that may require temp storage may be competing for memory space. You may want to look at "spark.memory.storageFraction" which currently defaults to 0.5 Consider 0.75 but this will likely slow down your queries. Also consider applying good data engineering to the problem. Reduce the amount of data that needs to be stored. Create a temp view with old records removed and unneeded columns pruned, then cache that. Consider using smaller datatypes for improved storage. Ex ints are more space efficient than big strings. Lastly considering switching to an instance type that has more memory available or switching to an instance with fast local disks. In some situations disk storage isn't that much slower than in memory. This is particularly true if you're running big complicated analytical queries where the cluster is cpu bound and not io bound.
Recently I have gone through a tutorial about key cache and row cache. Can anyone help me with some real time examples where these caches can impact? And what is the impact if we increase these values in the config file?
On using desc table I found this
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
Your main concern is the memory profile of your application.
This diagram demonstrates how the key cache optimises the readpath, it allows us to skip the partition summary and partition index, and go straight to the compression offset. As for the row cache, if you get a hit, you've got your answer and don't need to go down the read path at all.
Key cache - The key cache is on by default as it only keeps the key of the row. Keys are typically smaller relative to the rest of the row so this cache can hold many entries before it's exhausted.
Row cache - The row cache holds an entire row and is useful when you have a fairly static querying pattern. The argument for the row cache is that if you read the same rows over and over, you can just keep them in memory rather going to the SSTable (storage medium) level and thus bypass an expensive seek on the read path. In practice the memory slow downs caused by usage of the row cache in non-optimal use-cases makes it an unpopular feature.
So what happens if you fill up the cache? Well, there's an eviction policy but if you're constantly kicking stuff out of either cache to make room for new items, then the caches won't exactly be useful as the gc related performance degradation will hurt overall performance.
What about having very high cache values? This is where there are better alternatives, more on this later. Making the row cache huge would just lead to GC issues, which depending on what you're doing exactly, typically leads to an overall net-loss in performance.
One idea I've seen being utilised relatively well is having a caching layer on top of Cassandra, such as Apache Ignite or Memcached. You load hot data in the caching layer to get fast READs and you write with an application that writes to the cache layer then to C* for persistence. These architectures come with many of their own headaches but if you want to cache data for lower query latencies, the C* row cache isn't the best tool for the job.
What is the maximum value that one can set for transaction_buffer inside memsql cnf? I assume there is a correlation with RAM allocated on the server. My leaves have 32G each and at the moment we have transaction_buffer set to 0. We are passed designing phase on our cluster and we would like to do some performance tuning and one parameter that needs to be set up accordingly is this one.
The transaction_buffer size is an amount of memory reserved per database partition - i.e. each leaf node will need transaction_buffer size * partitions per leaf * number of databases memory. The default is 128 MB and this should be sufficient generally.
Basically, it's a balancing act - data in transaction_buffer will exist in memory before being written to disk. A transaction_buffer of 0 may save you some memory, but it's not taking full advantage of the speed of being in memory. If you have a lot of databases that are updated infrequently a low transaction_buffer may be the right balance as it is a per database cost (keeping in mind that each partition is a database itself).
Transaction_buffer may also be valuable for you as a "get out of jail free" card - since if your workload becomes more and more memory intensive it's possible to get into a situation where your OS is killing MemSQL too frequently to reduce memory consumption. Once you get stuck in a vicious cycle like that, restarting with a reduced transaction buffer can reduce memory overhead enough to keep the system from being OOM-killed long enough to troubleshoot and correct the issue on your end.
Eventually, it might become adaptive, and you'll be left without that easy way to get some wiggle-room. Which is why it is essential to make sure the maximum_memory is low enough that your system doesn't begin to OOM kill processes. https://docs.memsql.com/docs/memory-management
I'm asking this question because I would like to understand how I can run RethinkDB better, which means what kind of hardware should it be running on, what kind of filesystem it should be running on and other system configurations to maximize it's throughput.
I'm trying to fill a table as fast as I can with documents that are {"n": <counter>, "rand": <Math.random()>}. I read somewhere that this is faster with batches of 200 documents, so that's what I'm inserting. I am also using soft durability. I started one nodejs process of this and I can insert on average 10k documents per second, pretty good.
But while this is happening, rethinkdb is using about 70% of one core (I have 8 virtual cores, it's an i7-4770) and the nodejs process is using 5%. So it seems that CPU is not the bottleneck.
As soon as I start another nodejs process doing the same thing, the inserts per second on both processes drop to about 4k-5k. Again, the CPU load keeps the same.
I fired iotop and I do see a lot of action there, but not what I expected. I configured two SSDs in a RAID0, and a quick dd test says I can write and read at about 800MBps. That's far above what the actual read and actual write speed iotop reports (average read ~14MBps average write ~50MBps).
So how can I exaust my machine's resources ? What does rethinkdb need to run faster ? Why doesn't it spend more resources and have a higher throughput ?
More information on what it's running on: It's an EX40SSD from Hetzner, two SSDs in a software RAID0, ext4 filesystem (tomorrow I'll try to mount it with noatime to see if it's better). The rethinkdb configuration is everything by default, the inserts are done to a table that has only one shard and one replica. Please feel free to ask anything else relevant I might have forgotten to mention.
Thanks in advance.
What I suspect is going on here is lock contention on the actual btrees. When you're inserting a large batch of documents, the system grabs various parts of the btree in parallel to update it with the new documents. This is a set of read-write locks -- other parts of the system can still read, but if you insert another large batch in parallel, there is a high probability that it will touch similar parts of the btree, and therefore has to wait for the system to start unlocking as it inserts parts of the first batch. (This isn't specific to RethinkDB, but a problem in databases in general) This is likely why you're not hitting 100% CPU/disk throughput.
There are a few things you can try, but note that there are subtleties to various approaches. Benchmarking in general is hard.
You can try to shard the table into 32 shards and retry your benchmark. You don't actually have to create a cluster, you can shard into 32 shards on a single machine. This will result in multiple btrees, so you'll minimize contention and will be able to use more system resources. Note that while this will likely increase the throughput, increasing the number of shards also slightly increases the latency, so you might need to significantly increase parallelism before you start seeing throughput increases.
You can try not to batch writes and instead write one document at a time (which generally approximates real-world use cases a little better). Then, start hundreds of parallel clients instead of just one or two, and have all them write one document at a time in parallel. Note that you need to make sure the clients themselves aren't a bottleneck in this situation.
You can try to rerun your benchmark and also spin up clients that read from the database in parallel with the writes. In RethinkDB reads can usually go through even when you're writing to a particular document, so this will give you the opportunity to up CPU usage and bypass contention.
Pay attention to the IDs of the documents. If the database is large enough (say, millions of documents), and the IDs you're inserting are random, you're much less likely to touch the same parts of the btree so contention becomes less of an issue.
You can combine various approaches (sharding, reading + writing, various numbers of concurrent clients) to start getting a sense for how the database behaves in various scenarios.
Note that there might be things going on that you wouldn't normally be aware of. For example, RethinkDB has a log-structured storage engine that does live compaction on disk, and this might use up some IO (and CPU) cycles that you'd be surprised by if you didn't know about live compaction. There are dozens of other components like this that might compound to surprising behavior, as these systems are typically very complex under the hood.
Hope this helps -- would love to hear about your progress on the benchmarks. We do a lot of them internally, and it's an art and a science to discover the boundaries of the system's performance on different use cases.
My guess is that the bottleneck here is the disk system, but not its throughput. What's more likely is that writes are happening in chunks that are too small to be efficient, or that there are delays due to latency between individual writes.
It's also possible that the latency between individual write queries coming from the client and their processing on the server slows the system down.
Here are a few things I recommend trying:
Increase the batch size further. Your documents are very small. Therefore I think you might get a significantly higher throughput with batches of 1,000-10,000 documents. This might work especially well in combination with the next point.
Run multiple concurrent clients. You mentioned that you have run 2 clients concurrently, but this might not be enough. I recommend running 16-32 if possible.
Check the cache size RethinkDB is using. By default, RethinkDB picks the cache size as a fraction of the available memory, but that is not always reliable. I recommend passing a --cache-size <MB> parameter to RethinkDB (or adding the cache-size=<MB> parameter to the configuration file, if you're using one). I can see that your server has 32 GB of RAM. I recommend using in the range of 20000 MB (or even more) as the cache size. A larger cache reduces the number of reads, but up to a certain limit also increases the amount of unsaved data that RethinkDB can accumulate in RAM to make disk writes more efficient.
Experiment with the --io-threads <THREADS> parameter. The default is 64, but you can try increasing it to e.g. 128 and see if it has any effect.
There is an In-Memory option introduced in the Cassandra by DataStax Enterprise 4.0:
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/inMemory.html
But with 1GB size limited for an in-memory table.
Anyone know the consideration why limited it as 1GB? And possible extend to a large size of in-memory table, such as 64GB?
To answer your question: today it's not possible to bypass this limitation.
In-Memory tables are stored within the JVM Heap, regardless the amount of memory available on single node allocating more than 8GB to JVM Heap is not recommended.
The main reason of this limitation is that Java Garbage Collector slow down when dealing with huge memory amount.
However if you consider Cassandra as a distributed system 1GB is not the real limitation.
(nodes*allocated_memory)/ReplicationFactor
allocated_memory is max 1GB -- So your table may contains many GB in memory allocated in different nodes.
I think that in future something will improve but dealing with 64GB in memory it could be a real problem when you need to flush data on disk. One more consideration that creates limitation: avoid TTL when working with In-Memory tables. TTL creates tombstones, a tombstone is not deallocated until the GCGraceSeconds period passes -- so considering a default value of 10 days each tombstone will keep the portion of memory busy and unavailable, possibly for long time.
HTH,
Carlo