Increase speed of CouchDB's _changes feed while applying a filter - couchdb

I'm having trouble with poor performance on CouchDB's _changes feed when there are multiple observers.
I have CouchDB running inside a virtual machine on a laptop, and multiple iOS clients are consuming _changes?feed=continuous on one of the databases over the network, using CouchDB's HTTP API. As the number of clients increases, the speed at which the changes come through is slowed to a crawl.
N.B. I'm actually communicating with CouchDB via an Apache reverse proxy, which is compressing the responses.
And I'm also noticing that, while applying a filter to the feed, it will often go long periods without delivering any changes to the HTTP stream. Almost as if I'm waiting for it to check a batch of documents that don't meet my filter.
Is there anything settings I can enable or optimisations I can make that will help speed this all up?

The increase of latency with the number of consumers of filtered _changes feed is no surprise when you realize, that for each change couchdb has ask the query server to evaluate the filter() function. Apparently it doesn't cache the results so it has to perform this operation for each consumer.
Something you could try is dropping the filter parameter and using the include_docs=true instead. This way the feed producer wouldn't have to ask the view server to evaluate the changes. This should make it more responsive. Of course, this comes with the price of significantly increasing the amount of data transferred in the feed and you have to duplicate the filter() function logic on the client side. Its not ideal, but I think its worth a shot.

Related

MongoDB ChangeStream performance

Is it possible to use change stream for extensive use? I want to watch many collections with many documents with various parameters. The idea is to allow for multiple users to watch data that they are interested in. So not only to show few real-time updates on e.g. some stock data from a single collection or whatever, but to allow a modern web application to be real-time. I've stumbled upon some discussions e.g. this one which suggests, that the feature is not usable for such purpose.
So imagine implementing commonly known social network. Each user would want to have live data on (1) notifications, (2) online friends, (3) friends requests, (4) news feed, (5) comments on news feed posts (maybe one for each post?). This makes at least 5 open change streams per user. If a service would have connected e.g. 10000 users, it makes 50000 active change streams.
Is this mechanism ready for such load? If I understood the discussion (and some others) every change stream watcher creates one connection. Would it be okay to have like tens of thousands of connections? It does not seems like a good design. It seems like it'd be better to watch each collection and do the filtering on a application server, but that is more of a database server's job.
Is there way how to handle such load with mongo db?
Each change stream will require a connection to the server. Assuming your 10000 active users are going to do things like login, post things, read things, comment on other people's things, manage friend lists, etc. you may actually be needing more like 10 connections per user.
Each change stream is essentially an aggregation the maintains a cursor over the operations log. That should work fairly well as long as the server is sufficiently sized to handle:
100,000 simultaneous connections
state for 50,000 long running cursors
10s of thousands of queries per second for those change streams
whatever query rate the other non-changestream reads and writes will need
On MongoDB Atlas you would need at least an M140 instance just to handle that number of connections, with a price tag in the neighborhood of $10K per month.
At that price point, it would probably be more cost effective to design a pub/sub notification service that uses a total of 5 change streams to watch for the different types of changes, and deliver those to users with a push mechanism rather than having every user poll the database directly.

How to avoid database from being hit hard when API is getting bursted?

I have an API which allows other microservices to call on to check whether a particular product exists in the inventory. The API takes in only one parameter which is the ID of the product.
The API is served through API Gateway in Lambda and it simply queries against a Postgres RDS to check for the product ID. If it finds the product, it returns the information about the product in the response. If it doesn't, it just returns an empty response. The SQL is basically this:
SELECT * FROM inventory where expired = false and product_id = request.productId;
However, the problem is that many services are calling this particular API very heavily to check the existence of products. Not only that, the calls often come in bursts. I assume those services loop through a list of product IDs and check for their existence individually, hence the burst.
The number of concurrent calls on the API has resulted in it making many queries to the database. The rate can burst beyond 30 queries per sec and there can be a few hundred thousands of requests to fulfil. The queries are mostly the same, except for the product ID in the where clause. The column has been indexed and it takes an average of only 5-8ms to complete. Still, the connection to the database occasionally time out when the rate gets too high.
I'm using Sequelize as my ORM and the error I get when it time out is SequelizeConnectionAcquireTimeoutError. There is a good chance that the burst rate was too high and it max'ed out the pool too.
Some options I have considered:
Using a cache layer. But I have noticed that, most
of the time, 90% of the product IDs in the requests are not repeated.
This would mean that 90% of the time, it would be a cache miss and it
will still query against the database.
Auto scale up the database. But because the calls are bursty and I don't
know when they may come, the autoscaling won't complete in time to
avoid the time out. Moreover, the query is a very simple select statement and the CPU of the RDS instance hardly crosses 80% during the bursts. So I doubt scaling it would do much too.
What other techniques can I do to avoid the database from being hit hard when the API is getting burst calls which are mostly unique and difficult to cache?
Use cache in the boot time
You can load all necessary columns into an in-memory data storage (redis). Every update in database (cron job) will affect cached data.
Problems: memory overhead of updating cache
Limit db calls
Create a buffer for ids. Store n ids and then make one query for all of them. Or empty the buffer every m seconds!
Problems: client response time extra process for query result
Change your database
Use NoSql database for these data. According to this article and this one, I think choosing NoSql database is a better idea.
Problems: multiple data stores
Start with a covering index to handle your query. You might create an index like this for your table:
CREATE INDEX inv_lkup ON inventory (product_id, expired) INCLUDE (col, col, col);
Mention all the columns in your SELECT in the index, either in the main list of indexed columns or in the INCLUDE clause. Then the DBMS can satisfy your query completely from the index. It's faster.
You could start using AWS lambda throttling to handle this problem. But, for that to work the consumers of your API will need to retry when they get 429 responses. That might be super-inconvenient.
Sorry to say, you may need to stop using lambda. Ordinary web servers have good stuff in them to manage burst workload.
They have an incoming connection (TCP/IP listen) queue. Each new request coming in lands in that queue, where it waits until the server software accept the connection. When the server is busy requests wait in that queue. When there's a high load the requests wait for a bit longer in that queue. In nodejs's case, if you use clustering there's just one of these incoming connection queues, and all the processes in the cluster use it.
The server software you run (to handle your API) has a pool of connections to your DBMS. That pool has a maximum number of connections it it. As your server software handles each request, it awaits a connection from the pool. If no connection is immediately available the request-handling pauses until one is available, then handles it. This too smooths out the requests to the DBMS. (Be aware that each process in a nodejs cluster has its own pool.)
Paradoxically, a smaller DBMS connection pool can improve overall performance, by avoiding too many concurrent SELECTs (or other queries) on the DBMS.
This kind of server configuration can be scaled out: a load balancer will do. So will a server with more cores and more nodejs cluster processes. An elastic load balancer can also add new server VMs when necessary.

MongoDB + NodeJS: MapReduce or manual calculation

I am creating a REST API in NodeJS that connects to MongoDB does a MapReduce and store the results on a different collection.
The code is pretty simple. It takes a User ID, gets all other users who are related to this user somehow using some algorithm, and then for each one, calculate a likeness percentage. Assuming there are 50k users in the test database, this MapReduce takes around 200-800ms. And that is ideal for me. If this were to get famous and have hundreds of concurrent requests like this, I'm pretty sure that will not be the case any more. I understand that MongoDB might need to be sharded as needed.
The other scenario is to just do a normal find(), loop over the cursor and do the same logic. It takes the same amount of time as MapReduce mind you. However, I just thought about this to try and put the heavy lifting of the calculations on the client side (NodeJS) and not on the server side like MapReduce. Does this idea even have merit? I thought that this way, I can scale APIs horizontally behind a load balancer or something.
It would be better to keep heavy lifting off of the server which processes each request and put it onto the database.
If you have 1000 requests and 200 of them require you to perform the calculation, 800 requests can be processed as normal by the server, so long as mongo does the the calculation with mapReduce or aggregation.
If you instead run the calculations manually on your node server, all requests will be affected by the server having to do the heavy lifting.
Mongo is also quite efficient at aggregation for sure and mapReduce also I would imagine.
I recently moved a ton of logic from my server onto mongoDB where I could and it made a world of difference.

Is this MEAN stack design-pattern suitable at the 1,000-10,000 user scale?

Let's say that when a user logs into a webapp, he sees a list of information.
Let's say that list of information is served by one of two dynos (via heroku), but that the list of information originates from a single mongo database (i.e., the nodejs dynos are just passing the mongo information to a user when he logs into the webapp).
Question: Suppose I want to make it possible for a user to both modify and add to that list of information.
At a scale of 1,000-10,000 users, is the following strategy suitable:
User modifies/adds to data; HTTP POST sent to one of the two nodejs dynos with the updated data.
Dyno (whichever one it may be) takes modification/addition of data and makes a direct query into the mongo database to update the data.
Dyno sends confirmation back to the client that the update was successful.
Is this OK? Would I have to likely add more dynos (heroku)? I'm basically worried that if a bunch of users are trying to access a single database at once, it will be slow, or I'm somehow risking corrupting the entire database at the 1,000-10,000 person scale. Is this fear reasonable?
Short answer: Yes, it's a reasonable fear. Longer answer, depends.
MongoDB will queue the responses, and handle them in the order it receives. Depending on how much of it is being served from memory, it may or maybe not be fast enough.
NodeJS has the same design pattern, where it will queue responses it doesn't process, and execute them when the resources become available.
The only way to tell if performance is being hindered is by monitoring it, and seeing if resources consistently hit a threshold you're uncomfortable with passing. On the upside, during your discovery phase your clients will probably only notice a few milliseconds of delay.
The proper way to implement that is to spin up a new instance as the resources get consumed to handle the traffic.
Your database likely won't corrupt, but if your data is important (and why would you collect it if it isn't?), you should be creating a replica set. I would probably go with a replica set of data before I go with a second instance of node.

Instagram real-time API POST rate

I'm building an application using tag subscriptions in the real-time API and have a question related to capacity planning. We may have a large number of users posting to a subscribed hashtag at once, so the question is how often will the API actually POST to our subscription processing endpoint? E.g., if 100 users post to #testhashtag within a second or two, will I receive 100 POSTs or does the API batch those together as one update? A related question: is there a maximum rate at which POSTs can be sent (e.g., one per second or one per ten seconds, etc.)?
The Instagram API seems to lack detailed information about both how many updates are sent and what are the rate limits. From the [API docs][1]:
Limits
Be nice. If you're sending too many requests too quickly, we'll send back a 503 error code (server unavailable).
You are limited to 5000 requests per hour per access_token or client_id overall. Practically, this means you should (when possible) authenticate users so that limits are well outside the reach of a given user.
In other words, you'll need to check for a 503 and throttle your application accordingly. No information I've seen for how long they might block you, but it's best to avoid that completely. I would advise you manage this by placing a rate limiting mechanism on your own code, such as pushing your API requests through a queue with rate control. That will also give you the benefit of a retry of you're throttled so you won't lose any of the updates.
Moreover, a mechanism such as a queue in the case of real-time updates is further relevant because of the following from the API docs:
You should build your system to accept multiple update objects per payload - though often there will be only one included. Also, you should acknowledge the POST within a 2 second timeout--if you need to do more processing of the received information, you can do so in an asynchronous task.
Regarding the number of updates, the API can send you 1 update or many. The problem with this is you can absolutely murder your API calls because I don't think you can batch calls to specific media items, at least not using the official python or ruby clients or API console as far as I have seen.
This means that if you receive 500 updates either as 1 request to your server or split into many, it won't matter because either way, you need to go and fetch these items. From what I observed in a real application, these seemed to count against our quota, however the quota itself seems to consume resources erratically. That is, sometimes we saw no calls at all consumed, other times the available calls dropped by far more than we actually made. My advice is to be conservative and take the 5000 as a best guess rather than an absolute. You can check the remaining calls by parsing one of the headers they send back.
Use common sense, don't be stupid, and using a rate limiting mechanism should keep you safe and have the benefit of dealing with failures either due to outages (this happens more than you may think), network hicups, and accidental rate limiting. You could try to be tricky and use different API keys in a pooling mechanism, but this is likely a violation of the TOS and if they are doing anything via IP, you'd have to split this up to different machines with different IPs.
My final advice would be to restructure your application to not completely rely on the subscription mechanism. It's less than reliable and very expensive API wise. It's only truly useful if you just need to do something in your app that doesn't require calling back to Instgram, your number of items is small, or you can filter out the majority of items to avoid calling back to Instagram accept when a specific business rule is matched.
Instead, you can do things like query the tag or the user (ex: recent media) and scale it out that way. Normally this allows you to grab 100 items with 1 request rather than 100 items with 100 requests. If you really want to be cute, you could at least merge the subscription notifications asynchronously and combine the similar ones into a single batched request when you combine the duplicate characteristics such as tag into a single bucket. Sort of like a map/reduce but on a small data set. You could of course do an actual map/reduce from time-to-time on your own data as another way of keeping things in async. Again, be careful not to thrash instagram, but rather just use map/reduce to batch out your calls in a way that's useful to your app.
Hope that helps.

Resources