This requires a little context, so bear with me.
Suppose you're building a chat app atop CouchDB that functioned like IRC (or Slack). There's a server and some clients. But in this case, the server and the clients each have a CouchDB database and they're all bidirectionally replicating to each other -- the clients to the server, and the server to the other clients (hub-and-spoke style). Clients send messages by writing to their local instance, which then replicates to the server and out to the other clients.
Is there any way (validation function?) to prevent hostile clients from inserting a billion records and replicating those changes up to the server and other clients? Or is it a rule that you just can't give untrusted clients write access to a CouchDB instance that replicates anywhere else?
Related:
couchdb validation based on content from existing documents
Can I query views from a couchdb update or validate_doc_update function?
Can local documents be disabled in CouchDB?
For a rather simple defense agaist flooding, I am using the following workflow:
All public write access is only allowed through update functions
Every document insert/update gets generated a unique hash, consisting of the req.peer field (for the IP address) and an ISO timestamp where I cut off the final part. For example I may have 2017-11-24T14:14 as they key unique string, so that ensures that a unique key is generated every minute
Calculate the hash for every write request, ensure it is unique, and you will be certain a given IP would only be allowed to write once every minute.
This technique works ok for small floods, coming from a given set of IPs. For a more coordinated attack a variation (or even something else completely) might be needed.
Related
Is it possible to use change stream for extensive use? I want to watch many collections with many documents with various parameters. The idea is to allow for multiple users to watch data that they are interested in. So not only to show few real-time updates on e.g. some stock data from a single collection or whatever, but to allow a modern web application to be real-time. I've stumbled upon some discussions e.g. this one which suggests, that the feature is not usable for such purpose.
So imagine implementing commonly known social network. Each user would want to have live data on (1) notifications, (2) online friends, (3) friends requests, (4) news feed, (5) comments on news feed posts (maybe one for each post?). This makes at least 5 open change streams per user. If a service would have connected e.g. 10000 users, it makes 50000 active change streams.
Is this mechanism ready for such load? If I understood the discussion (and some others) every change stream watcher creates one connection. Would it be okay to have like tens of thousands of connections? It does not seems like a good design. It seems like it'd be better to watch each collection and do the filtering on a application server, but that is more of a database server's job.
Is there way how to handle such load with mongo db?
Each change stream will require a connection to the server. Assuming your 10000 active users are going to do things like login, post things, read things, comment on other people's things, manage friend lists, etc. you may actually be needing more like 10 connections per user.
Each change stream is essentially an aggregation the maintains a cursor over the operations log. That should work fairly well as long as the server is sufficiently sized to handle:
100,000 simultaneous connections
state for 50,000 long running cursors
10s of thousands of queries per second for those change streams
whatever query rate the other non-changestream reads and writes will need
On MongoDB Atlas you would need at least an M140 instance just to handle that number of connections, with a price tag in the neighborhood of $10K per month.
At that price point, it would probably be more cost effective to design a pub/sub notification service that uses a total of 5 change streams to watch for the different types of changes, and deliver those to users with a push mechanism rather than having every user poll the database directly.
I have a functionality where user post data containing few userid and some data related to those userid and I am saving it into postgresql database. I want to save this returned userid in some object.
I just want to check if userid is present in this object and then only call database. This check happen very frequently so I can not hit db every time just to check is there any data present for that userid.
Problem is, I have multiple nodejs instances running on different server so how can I have a common object.
I know I can use redis/riak for storing key-value on server, but don't want to increase complexity/learning just for a single case.(I have never used redis/riak before.)
Any suggestion ?
If your data is in different node.js processes on different servers, then the ONLY option is to use networking to communicate across servers with some common server to get the value. There are lots of different ways to do that.
Put the value in a database and always read the value from the common database
Designate one of your node.js instances as the master and have all the other node.js instances ask the value is on the master anytime they need it
Synchronize the value to each node.js process using networking so each node.js instance always has a current value in its own process
Use a shared file system (kind of like a poor man's database)
Since you already have a database, you probably want to just store it in the database you already have and query it from there rather than introduce another data store with redis just for this one use. If possible, you can have each process cache the value over some interval of time to improve performance for frequent requests.
Let's say that when a user logs into a webapp, he sees a list of information.
Let's say that list of information is served by one of two dynos (via heroku), but that the list of information originates from a single mongo database (i.e., the nodejs dynos are just passing the mongo information to a user when he logs into the webapp).
Question: Suppose I want to make it possible for a user to both modify and add to that list of information.
At a scale of 1,000-10,000 users, is the following strategy suitable:
User modifies/adds to data; HTTP POST sent to one of the two nodejs dynos with the updated data.
Dyno (whichever one it may be) takes modification/addition of data and makes a direct query into the mongo database to update the data.
Dyno sends confirmation back to the client that the update was successful.
Is this OK? Would I have to likely add more dynos (heroku)? I'm basically worried that if a bunch of users are trying to access a single database at once, it will be slow, or I'm somehow risking corrupting the entire database at the 1,000-10,000 person scale. Is this fear reasonable?
Short answer: Yes, it's a reasonable fear. Longer answer, depends.
MongoDB will queue the responses, and handle them in the order it receives. Depending on how much of it is being served from memory, it may or maybe not be fast enough.
NodeJS has the same design pattern, where it will queue responses it doesn't process, and execute them when the resources become available.
The only way to tell if performance is being hindered is by monitoring it, and seeing if resources consistently hit a threshold you're uncomfortable with passing. On the upside, during your discovery phase your clients will probably only notice a few milliseconds of delay.
The proper way to implement that is to spin up a new instance as the resources get consumed to handle the traffic.
Your database likely won't corrupt, but if your data is important (and why would you collect it if it isn't?), you should be creating a replica set. I would probably go with a replica set of data before I go with a second instance of node.
I want to achieve this :
I have a couchbase instance and it has buckets and documents . As soon as ttl of a key or document is about to expire , the couch base server make a call(Post request) to another server with key and its data and that server will save it in another couchbase instance .
So there are two questions :
1) How can i configure couchbase to make a post request to another server with key and data it contains .
2) Is there a better way in couch base to attain this thing ? i mean , i dont have to make a rest api for couchbase to send data , it can some how save data to another server by itself , just by doing some configurations ?
The simple answer to your question is that this is not possible.
First, Couchbase doesn't evict things from the data set the instant they expire. Rather, it has a background process that trims the expired items out periodically, or expired items are removed when they are accessed, whichever occurs first.
Next, I am not sure it make sense to have data expire if you want to keep it. Couchbase offers an efficient disk-storage mechanism. Keep in mind that only the most frequently-accessed data is kept in RAM, should the data size exceed the RAM capacity; furthermore, on node startup, data is loaded in order of most frequent/recent to less frequent/older.
If your data must be stored in two separate databases, it is up to your application logic to make that happen when saving the data.
I'm building a REST web service that receives a request and must return "Ok" if the operation was done correctly. How could I deal with the possibility of the loose of the connection while returning this "Ok" message?
For example, a system like Amazon SimpleDB.
1) It receives a request.
2) Process the request (store and replicates the content).
3) Return a confirmation message.
If the connection was lost between phases 2 and 3, the client thinks the operation was not successful then submits again.
Thanks!
A system I reviewed earlier this year had a process similar to this. The solution they implemented was to have the client reply to the commit message, and clear a flag on the record at that point. There was a periodic process that checked every N minutes, and if an entry existed that was completed, but that the client hadn't acknowledged, that transaction was rolled back. This allowed a client to repost the transaction, but not have 2 'real' records committed on the server side.
In the event of the timeout scenario, you could do the following:
Send a client generated unique id with the initial request in a header.
If the client doesn't get a response, then it can resend the request with the same id.
The server can keep a list of ids successfully processed and return an OK, rather than repeating the action.
The only issue with this is that the server will need to eventually remove the client ids. So there would need to be a time window for the server to keep the ids before purging them.
Depends on the type of web service. The whole nature of HTTP and REST is that it's basically stateless.
e.g. In the SimpleDB case, if you're simply requesting a value for a given key. If in the process of returning it the client connection is dropped then the client can simply re-request the data at a later time. That data is likely to have been cached by the db engine or the operating system disk cache anyway.
If you're storing or updating a value and the data is identical then quite often the database engines know the data hasn't changed and so the update won't take very long at all.
Even complex queries can run quicker the second time on some database engines.
In short, I wouldn't worry about it unless you can prove there is a performance problem. In which case, start caching the results of some recent queries yourself. Some REST based frameworks will do this for you. I suspect you won't even find it to be an issue in practice though.