REDIS used memory trending upwards - memory-leaks

So I have implemented ELK using REDIS as a caching layer.
I am using REDIS 3.0.4 from an RPM I found for Redhat EL6.
I am also running jemalloc 3.6.0.
I believe the configuration for REDIS is largely vanilla with the exception of a max memory cap and also non-default eviction policy.
maxmemory 500mb
and
maxmemory-policy allkeys-random
though I feel that the eviction policy is probably not required.
Now I have verified that the store is generally empty, i.e. my logstash indexer is doing its job well and the data is making it into Elastic Search.
What concerns me is the used memory for REDIS continues to trend upwards and from what I have seen, if used memory hits max, which it has done, then REDIS stop working, i.e. no more log entries flow through.
So, what am I missing:
Am I being paranoid and can dismiss what I am seeing.
Should I avoid the pre-packaged RPMs.
Is there some additional settings I need to change.
Everything I have read to date about REDIS and ELK suggests that out of the box should be fine.
Be aware that this is a lightweight implementation which I hope will provide impetus for a more broader, bullet proof implementation, which is the reason for the 500MB limit.

Related

Usage of Redis on Azure

I'm using Redis cache on the Azure.The Pricing tier of it as Standard 2.5 GB.So my question is, can you tell me how to see the current usage of memory on the cache ? In other words how much of more cache storage remaining for using in the future ? I have tried to find out it on the dash board. But unable to find out it.
You can configure redis cache diagnostics to get this information. Please refer to How to monitor Azure Redis Cache - Available metrics and reporting intervals for more details. From this link, one of the available metrics is Used Memory which I believe you're looking for.
Used Memory The amount of cache memory used for key/value pairs in the
cache in MB during the specified reporting interval. This value maps
to used_memory from the Redis INFO command. This does not include
metadata or fragmentation.
I have not used REDIS Cache personally but if my memory serves me right, I read somewhere that you can find this information by executing REDIS commands through REDIS Console available in the portal as well. For more information about this, please see this link: https://azure.microsoft.com/en-in/documentation/articles/cache-configure/#redis-console.
Run INFO memory command in Redis Console and look for used_memory_human parameter in output.

Redis cache in Azure was cleared unexpectadly

Recently, January 3rd, we observed interesting behavior with Redis Cache in Azure. It happened just once, and I'm trying to make sense of it.
We got alert that CPU went above 80% on Redis Cache service. Looking closely we discovered that used memory was dropped from typical 100MB to almost 0. Then it was quickly populated back to normal, I assume by normal usage of the application. While it was being populated, there was this CPU spike.
It looked like if cache was reset. However, this is production environment with very limited people having access to it, and we sure 100% that nobody reset it. There were no any deployment around that time. I couldn't find anything in diagnostic logs.
Questions:
1. Any ideas what could happen?
2. Where can I look, what to look for?
Update: We are on standard (C1) tier
No customers reported any problems, I just hate when I don't understand what is going on.
It depends on which cache tier you are using.
The basic tier only has one node with the cache data stored in memory. Any loss of memory in that node will cause the cache data to be lost.
If you are using the Standard tier then there are 2 nodes, a primary and secondary, with cached data being asynchronously replicated from primary to secondary. If the primary is offline then client requests are sent to the secondary. In this scenario the chance of cache data loss is low since it basically requires both nodes to be offline at the same time, which should only happen during scenarios of hardware failure (Azure ensures that normal updates maintenance such as OS updates are not done at the same time).
If you are using the premium tier then the cache data is backed by persistent storage so you should not experience cache data loss.
https://azure.microsoft.com/en-us/documentation/articles/cache-faq/#what-redis-cache-offering-and-size-should-i-use has some more information about this.

How to share Azure Redis Cache between environments?

We want to save a few bucks and share our 1GB dedicated Azure Redis Cache between Development, Test, QA and maybe even production.
Is there a better way than prefixing all keys with an environment string like "Dev_[key]", "Test_[key]" etc.
We are using the StackExchange Redis client for .NET.
PS: We tried using the cheap 250GB (Shared infrastructure), but had very slow performance. Read operations were consistent between 600-800ms... without any load (for a ~300KB object). Upgrading to dedicated 1GB services changed that to 30-40ms. See more here: StackExchange.Redis with Azure Redis is unusably slow or throws timeout errors
One approach is to use multiple Redis databases. I'm assuming this is available in your environment :)
Some advantages over prefixing your keys might be:
data is kept separate, you can flushdb in test and not touch the production data
keys are smaller and consume less memory
The main disadvantage would be not taking advantage of multiple cores, like you could do if you ran multiple instances of Redis on the same server. Obviously not an issue in this case. Also note that this feature is not deprecated, like one of the answers suggests.
Another thing I've seen people complain about is that databases are numbered, they don't have meaningful names. Some people create a hash in database 0 that maps each number to a name.
Here is another idea to save some bucks: use separate Redis cache machines for each environment - so no problems with the keys, but stop them when you don't use them, like in the weekend and during nights. Probably more than 50% of the time you are not using them. I think it would be easy to start and stop them with some PowerShell script, we are using AWS and here it is possible.
Now from what I see the Redis persistence in Azure is not enabled, but they started working on it http://feedback.azure.com/forums/169382-cache/status/191763 - it would be nice to do a RDB snapshot before stopping and then on start to load it. So if you need to save some values and reload them on start you should do it manually (with your own service).

Azure Redis Cache data loss?

I have a Node.js application that receives data via a Websocket connection and pushes each message to an Azure Redis cache. It stores a persistent array of messages in a variable for downstream use, and at regular intervals syncs that array from the cache. Bit convoluted, but at a later point I want to separate out the half of the application that writes to the cache from the half of it that reads from it..
At around 02:00 GMT, based on the Azure portal stats, I appear to have started getting "cache misses" on that sync, which last for a couple of hours before I started getting "cache hits" again sometime around 05:00.
The cache misses correspond to a sudden increase in CPU usage, which peaks at around 05:00. And when I say peaks, I mean it hits 81%, vs a previous max of about 6%.
So sometime around 05:00, the CPU peaks, then drops back to normal, the "cache misses" go away, but looking at the cache memory usage, I drop from about 37.4mb used to about 3.85mb used (which I suspect is the "empty" state), and the list that's being used by this application was emptied.
The only functions that the application is running against the cache are LPUSH and LRANGE, there's nothing that has any capability to remove data, and in case anybody was wondering, when the CPU ramped up the memory usage did not so there's nothing to suggest that rogue additions of data cropped up.
It's only on the Basic plan, so I'm not expecting it to be invulnerable or anything, but even without the replication features of the Standard plan I had expected that it wouldn't be in a position to completely wipe itself - I was under the impression that Redis periodically writes itself to disk and restores from that when it recovers from an error.
All of which is my way of asking:
Does anybody have any idea what might have happened here?
If this is something that others have been able to accidentally trigger themselves, are there any gotchas I should be looking out for that I might have in other applications using the same cache that could have caused it to fail so catastrophically?
I would welcome a chorus of people telling me that the Standard plan won't suffer from this sort of issue, because I've already forked out for it and it would be nice to feel like that was the right call.
Many thanks in advance..
Here my thoughts:
Azure Redis Cache stores information in memory. By default, it won't save a "backup" on disk, so, you had information in memory, for some reason the server got restarted and you lost your data.
PS: See this feedback, there is no option to persist information on disk using azure-redis cache yet http://feedback.azure.com/forums/169382-cache/suggestions/6022838-redis-cache-should-also-support-persistence
Make sure you don't use Basic plan. Basic plan doesn't suppose SLA and from my expirience it lost data quite often
Standard plan provides SLA and utilize 2 instances of Redis Cache. It's quite stable and it didn't lose our data, although such case still possible.
Now, if you're going to use Azure Redis as database, but not as a cache you need to utilize data persistance feature, which is already available in Azure Redis Cache Premium Tier: https://azure.microsoft.com/en-us/documentation/articles/cache-premium-tier-intro (see Redis data persistence)
James, using the Standards instance should give you much improved availability.
With the Basic tier any Azure Fabric update to the Master Node (or hardware failure), will cause you to loose all data.
Azure Redis Cache does not support persistence (writing to disk/blob) yet, even in Standard Tier. But the Standard tier does give you a replicated slave node, that can take over if you Master goes down.

Meteor Node Process CPU Usage Nears 100%

I'm having trouble with my Meteor app when it gets to its peak amount of traffic (peak for this is nothing, 1k visits, maybe 2,500 pageviews in a day). CPU usage spikes and never recovers, so I've taken to using Nodetime to monitor usage and I've been reloading the process (forever restart) to get things back to normal.
I'm fairly new to profiling, so finding the underlying cause has me at a loss for where to start. I'm fairly certain it has to do with my app's server code, but the profiling seems to point to the Fibers module as a "hotspot" which I understand aids in making my server code synchronous.
Below is a snippet from the profiling results. I hope someone can guide me in the right direction in troubleshooting this!
While I don't have a specific answer to your question, I have experience dealing with CPU issues for our production meteor app for so I can give you a list of things to investigate.
Upgrade to the latest version of meteor and the appropriate node version (see the changelog). As of this writing that's meteor 0.8.2 and node 0.10.28.
Read this and this article. The latter makes a great point that you really should always try to delay activation of subscriptions until you need them. In particular you may not need to publish anything for users who are not logged in. In my experience, meteor CPU problems have everything to do with subscriptions.
Be careful with observe and observeChanges. These are expensive and are easy to abuse. In particular:
Make sure you are calling stop() on your handles when they are no longer needed (consider using a package like publish-with-relations so this is done for you).
Fetch only the collections and fields that you absolutely need. Observe works by continually diffing objects (requires lots of CPU). The fewer and smaller objects you have, the less there is to compute.
Consider using smart-collections before it is retired. Use oplog tailing - this can make for a night and day difference in performance and CPU usage in your app.
Consider making some things not reactive (also mentioned in the articles above). For us that was a big win. We had one extremely expensive join that was used on two frequently accessed pages on the site. When it got to the point where the CPU was pegged at 100% about every 30 minutes I gave up on reactivity for that element and just did the join on the server and shipped the data to the client via a method call. I also created a server-side expiring cache for these results and stored them by user (special thanks to Matt DeBergalis for this suggestion).
Do a preventative nightly restart. I have a cron job that tells forever to restart our app once a day in the middle of the night. That brings the CPU down from ~10% to 1%. This seems like black magic, but the fact that the CPU usage changes after a reset leads me to believe this is a good idea.
Updated thoughts (1/13/14)
We migrated to oplog tailing as soon as it was available (meteor 0.7) and that made a big difference. Note that in order to get access to the oplog, you'll probably need to either host your own db or run a dedicated instance on the hosting provider of your choice. I'd also recommend adding the facts package to actually tell if its working.
There was a memory leak discovered in publish-with-relations, and as of this writing the atmosphere version (v0.1.5) hasn't been bumped to reflect these changes. If you are using it in production, I strongly recommend checking out the HEAD version and running it locally.
We stopped doing nightly restarts a couple of weeks ago. So far everything has been fine (fingers crossed).
Updated thoughts (7/2/14)
A few months ago we switched over to using an Elastic Deployment on mongohq. It's affordable, the performance has been great, and they even have a blog post which tells you how to enable oplog tailing.
I'd strongly recommend checking out kadira to help diagnose performance issues in your app. Also check out the academy articles which have a number of good tips in them.
I'm also having this problem. Actually there is an issue with 0.6.6.1, I run meteor --release 0.6.6 and the cpu is back to normal now.

Resources