What are the effects of rolling updates TiDB? - tidb

What are the effects of rolling updates TiDB? If it will impact production env?
When do the TiKV leader's transfer to other TiKV, how much time it will take?
I have tried to find from PingCAP documents, I can't find the result.

when you apply rolling updates to the TiDB services, the running application is not affected. If the Pump or Drainer service is involved in the cluster, it is recommended to stop Drainer before rolling updates. When you upgrade TiDB, Pump is also upgraded. The TiKV leader transfers to some other TiKV before this TiKV binary roll up, and this process won't keep long time, it may be several seconds.
More details pls check the doc https://docs.pingcap.com/tidb/stable/upgrade-tidb-using-tiup

Related

What is maintenanceWindow attribute in Azure redis cache patch schedule condiguration?

The documentation doesn't explain what is the maintenanceWindow is, as during patching a node switch happens, why this window length matter at all? Could it impact the duration that the Redis cache is not available to use?
What is the maintenance window?
When scheduling updates for Azure Redis Cache, you can decide the day of the week, start UTC hour and maintenance window.
Setting a maintenance window allows you to minimize the impact on your application and users.
Only Redis server updates are made during the scheduled maintenance window. The maintenance window does not apply to Azure updates or updates to the VM operating system.
What is the impact of setting a shorter or longer window?
The default and minimum, maintenance window for updates is five hours. The actual time required for maintenance depends on exactly what’s taking place.
You could always set it to a longer timespan, but we recommend selecting a timespan that would have the least impact on your business.

Distributed Lock Manager with Azure SQL database

We have Web API using Azure SQL database. Database model has Customers and Managers. Customers can add appointments. We can't allow overlapping appointments from 2 or more Customers for same Manager. Because we are working in a distributed environment (multiple instances of web server can insert records into database at the same time), there is a possibility that appointments that are not valid will be saved. As an example, Customer 1 wants an appointment between 10:00 - 10: 30. Customer 2 wants an appointment between 10:15 - 10:45. If both appointments happen during the same time, then the validation code in Web API, will not catch an error. That's why we need something like distributed lock manager. We read about Redlock from Redis and Zookeeper. My questions is: Is Redlock or Zookeeper good choise for our use case or there is some better solution?
If we would use Redlock than we would go with Azure Redis Cache because we already use Azure Cloud to host our Web API. We plan to identify shared resource (resource we want to lock) by using ManagerId + Date. This would result in lock for Manager on one date, so it would be possible to have other locks for same Manager on some other date. We plan to use one instance of Azure Redis Cache, is this safe enough?
Q1: Is Redlock or Zookeeper good choise for our use case or there is some better solution?
I consider Redlock as not the best choice for your use case because:
a) its guarantees are for a specific amount of time (TTL) set before using the DB operation. If for some reason (talk to DevOps for incredible ones and also check How to do distributed locking) the DB operation takes longer than TTL you loose the guarantee for lock validity (see lock validity time in the official documentation). You could use large TTL (minutes) or you could try to extend its validity with another thread which would monitor the DB operation time - but this gets incredibly complicated. On the other hands with zookeeper (ZK) your lock is there till you remove it or the process dies; it could be the situation when your DB operation hangs which would lead to the lock also to hang but these kind of problems are easily spotted by DevOps tools which will kill the hanging process which in turn will free the ZK lock (there's also the option to have a monitoring process which to also do this faster and in a more specific to you business fashion).
b) while trying to lock the processes must “fight” to win a lock; the “fighting” suppose for them to wait then retry getting the lock. These could lead to retry-count to overflow which would lead to a fail to get the lock. This seems to me a less important issue but with ZK the solution is far better: there’s no “fight” but all processes will get in a line of ones waiting their turn to get the lock (check ZK lock recipe).
c) Redlock is based on time measures which is incredible tricky; check at least the paragraph containing “feeling smug” at How to do distributed locking (Conclusion paragraph too) then think again how large should be that TTL value in order to be sure about your RedLock (time) based locking.
For these reasons I consider RedLock a risky solution while Zookeeper a good solution for your use case. Other better distributed locking solution fit for your case I don’t know but other distributed locking solutions do exist, e.g. just check Apache ZooKeeper vs. etcd3.
Q2: We plan to use one instance of Azure Redis Cache, is this safe enough?
It could be safe for your use case because the TTL seems to be predictable (if we really trust the time measuring - see the warn below) but only if the slave taking over a failed master could be delayed (not sure if possible, you should check the Redis configuration capabilities). In case you loose the master before a lock is synchronized to the slave than another process could just acquire the same lock. Redlock recommends to use delayed restarts (check Performance, crash-recovery and fsync in official documentation) with a period at least of 1 TTL. If for the Q1:a+c reason your TTL is a very long one than your system won’t be able to lock for maybe an unacceptable large period (because the only 1 Redis master you have must be replaced by the slave in a delayed fashion).
PS: I stress again to read Martin Kleppmann's opinion on Redlock where you’ll find incredible reasons for a DB operation to be delayed (search for before reaching the storage service) and also incredible reasons for not relaying on time measuring when locking (and also an interesting argumentation against using Redlock)

Redis cache in Azure was cleared unexpectadly

Recently, January 3rd, we observed interesting behavior with Redis Cache in Azure. It happened just once, and I'm trying to make sense of it.
We got alert that CPU went above 80% on Redis Cache service. Looking closely we discovered that used memory was dropped from typical 100MB to almost 0. Then it was quickly populated back to normal, I assume by normal usage of the application. While it was being populated, there was this CPU spike.
It looked like if cache was reset. However, this is production environment with very limited people having access to it, and we sure 100% that nobody reset it. There were no any deployment around that time. I couldn't find anything in diagnostic logs.
Questions:
1. Any ideas what could happen?
2. Where can I look, what to look for?
Update: We are on standard (C1) tier
No customers reported any problems, I just hate when I don't understand what is going on.
It depends on which cache tier you are using.
The basic tier only has one node with the cache data stored in memory. Any loss of memory in that node will cause the cache data to be lost.
If you are using the Standard tier then there are 2 nodes, a primary and secondary, with cached data being asynchronously replicated from primary to secondary. If the primary is offline then client requests are sent to the secondary. In this scenario the chance of cache data loss is low since it basically requires both nodes to be offline at the same time, which should only happen during scenarios of hardware failure (Azure ensures that normal updates maintenance such as OS updates are not done at the same time).
If you are using the premium tier then the cache data is backed by persistent storage so you should not experience cache data loss.
https://azure.microsoft.com/en-us/documentation/articles/cache-faq/#what-redis-cache-offering-and-size-should-i-use has some more information about this.

Azure Redis Cache data loss?

I have a Node.js application that receives data via a Websocket connection and pushes each message to an Azure Redis cache. It stores a persistent array of messages in a variable for downstream use, and at regular intervals syncs that array from the cache. Bit convoluted, but at a later point I want to separate out the half of the application that writes to the cache from the half of it that reads from it..
At around 02:00 GMT, based on the Azure portal stats, I appear to have started getting "cache misses" on that sync, which last for a couple of hours before I started getting "cache hits" again sometime around 05:00.
The cache misses correspond to a sudden increase in CPU usage, which peaks at around 05:00. And when I say peaks, I mean it hits 81%, vs a previous max of about 6%.
So sometime around 05:00, the CPU peaks, then drops back to normal, the "cache misses" go away, but looking at the cache memory usage, I drop from about 37.4mb used to about 3.85mb used (which I suspect is the "empty" state), and the list that's being used by this application was emptied.
The only functions that the application is running against the cache are LPUSH and LRANGE, there's nothing that has any capability to remove data, and in case anybody was wondering, when the CPU ramped up the memory usage did not so there's nothing to suggest that rogue additions of data cropped up.
It's only on the Basic plan, so I'm not expecting it to be invulnerable or anything, but even without the replication features of the Standard plan I had expected that it wouldn't be in a position to completely wipe itself - I was under the impression that Redis periodically writes itself to disk and restores from that when it recovers from an error.
All of which is my way of asking:
Does anybody have any idea what might have happened here?
If this is something that others have been able to accidentally trigger themselves, are there any gotchas I should be looking out for that I might have in other applications using the same cache that could have caused it to fail so catastrophically?
I would welcome a chorus of people telling me that the Standard plan won't suffer from this sort of issue, because I've already forked out for it and it would be nice to feel like that was the right call.
Many thanks in advance..
Here my thoughts:
Azure Redis Cache stores information in memory. By default, it won't save a "backup" on disk, so, you had information in memory, for some reason the server got restarted and you lost your data.
PS: See this feedback, there is no option to persist information on disk using azure-redis cache yet http://feedback.azure.com/forums/169382-cache/suggestions/6022838-redis-cache-should-also-support-persistence
Make sure you don't use Basic plan. Basic plan doesn't suppose SLA and from my expirience it lost data quite often
Standard plan provides SLA and utilize 2 instances of Redis Cache. It's quite stable and it didn't lose our data, although such case still possible.
Now, if you're going to use Azure Redis as database, but not as a cache you need to utilize data persistance feature, which is already available in Azure Redis Cache Premium Tier: https://azure.microsoft.com/en-us/documentation/articles/cache-premium-tier-intro (see Redis data persistence)
James, using the Standards instance should give you much improved availability.
With the Basic tier any Azure Fabric update to the Master Node (or hardware failure), will cause you to loose all data.
Azure Redis Cache does not support persistence (writing to disk/blob) yet, even in Standard Tier. But the Standard tier does give you a replicated slave node, that can take over if you Master goes down.

CouchDB: Re-running a completed replication progressing slowly

I wrote a service which does the following on startup:
Boots up CouchDB
Makes a pull _replicate (+continuous) request to CouchDB
Monitors _active_tasks for the 'progress' to reach 100 before considering itself as ready
However, the database I'm dealing with is fairly large and so the replication task takes a very long time to reach 100 even though I only turned off the database recently, and it had a continuous replication task before that, so it should be almost entirely up to date. That is, the incremental replication should be quick.
Why could it be taking so long considering it's already almost up to date, and is there anything I can do to either speed it up OR allow my service to consider itself as ready before "progress" reaches 100? The latter seems unlikely as I do want it to be fully up to date.
Thanks :)

Resources