I'm currently considering CouchDB for a project (still in the research phase). What I want is many users, each with its own database (for authentication purposes), as well as one big read-only database to which they're all replicated, so I can generate reports and make a dashboard. For now let's assume all the databases are running on the same machine and in the same process.
Normally, only a small fraction of the human users will be online and doing things, so I should be able to keep the max_dbs_open setting a lot lower than the total number of users. CouchDB should have no problem determining which ones are idle.
However, I'm worried the extra database will throw a wrench into this. Will having continuous replication from all user databases into the one big database keep them all awake all the time? (And if so, is that bad? Will actually logged-in users lose any priority advantage and get thrashed at bad times?)
Or is CouchDB smart enough to consider a database idle as long as it's not being queried or written to, even if there's an ongoing continuous replication session? Note the replication would only be in one direction: the big database only watches for changes from the user databases.
Suggestions for other ways to approach this problem are also welcome. I'm vaguely aware that if I'm going to have enough to users for this to be a problem, that one database is going to be an unwieldy beast.
Thanks!
Related
I've come across many clients who aren't really able to provide real production data about a website's peak usage. I often do not get peak pageviews per hour, etc.
In these circumstances, besides just guessing or going with what "feels right" (i.e. making it all up), how exactly does one come up with a realistic workload model with an appropriate # of virtual users and a good pacing value?
I use Loadrunner for my performance/load testing.
Ask for the logs for a month.
Find the stats for session duration, then count the number of distinct IP's blocked by session duration.
Once you have the high volume hour, count the number of page instances. Business processes will typically have a termination page which is distinct and allows you to understand how many times a particular action takes place, such as request new password, update profile, business process 1, etc...
With this you will have a measurement of users and actions. You will want your stakeholder to take ownership of this data. As Quality assurance, we should not own both the requirement and the test against it. We should own one, but not both. If your client will not own the requirement, cascading it down to rest of the organization, assume you will be left out in the cold with a result they do not like....i.e., defects that need to be addressed before deployment to production.
Now comes your largest challenge, which is something that needs to be fixed with a process issue by your client.....You are about to test using requirements that no other part of the organization, architecture, development, platform engineering, had when they built the solution. Even if your requirements are a perfect recovery, plus some amount for growth, any defects you find will be challenged aggressively.
Your test will not match any assumptions or requirements used by any other portion of the organization.
And, in a sense, these other orgs will be correct in aggressively challenging your results. It really isn't fair to hold their designed solution to a set of requirements which were not in place when they made decisions which impacted scalability and response times for the system. You would be wise to call this out with your clients before the first execution of any performance test.
You can buy yourself some time. If the client does have a demand for a particular response time, such as an adoption of the Google RAIL model, then you can implement a gate before accepting any code for multi-user performance testing that the code SHALL BE compliant for a single user. It is not going to get any faster for two or more users. Implementing this hard gate will solve about 80% of your performance issues, for the changes required to bring code into compliance for a single user most often will have benefits on the multi-user front.
You can buy yourself some time in a second way as well. Take a look at their current site using tools such as Google Lighthouse and GTMetrix. Most of us are creatures of habit, that includes architect, developers, and ops personnel. We design, build, deploy to patterns we know and are comfortable with....usually the same ones over and over again until we are forced to make a change. It is highly likely that the performance antipatterns pulled from Lighthouse and GTMetrix will be carried forward into a future release unless they are called out for mitigation. Begin citing defects directly off of these tools before you even run a performance test. You will need management support, but you might consider not even accepting a build for multi-user performance testing until GTMetrix scores at least a B across the board and Lighthouse a score of 90 or better.
This should leave edge cases when you do get to multi-user performance testing, such as too early allocation of a resource, holding onto resources too long, too large of a resource allocation, hitting something too often, lock contention on a shared resource. An architectural review might pick up on these, where someone might say, "we are pre-allocating this because.....," or "Marketing says we need to hold the cart for 30 minutes before de-allocation," or "...." Well, you get the idea.
Don't forget to have the database profiler running while functional testing is going on. You are likely to pick up a few missing indexes or high cost queries here which should be addressed before multi-user performance testing as well.
You are probably wondering why am I pointing out all of these things before your performance test takes place. Darnit, you are hired to engage in a performance test! The test you are about to conduct is very high risk politically. Even if it finds something ugly, because the other parts of the organization did not benefit from the requirements, the result is likely to be rejected until the issue shows up in production. By shifting the focus to objective measures even before you need to run two users in anger together there are many avenues to finding and fixing performance issues which are far less politically volatile. Food for thought.
Newer developer here. I'm creating a Nodejs application with MongoDB. When do you write user inputs to the database? Is it immediately when they want to perform a CRUD action? Or do you wait until they end their session to update their changes (showing them a "fake" updated view during the meantime)? I would think writing to the database every time would be less than ideal, but I also wouldn't want to make the user think their changes were saved to the database, and then some error occurs where it didn't actually happen. How's this handled in the real world?
The user inputs should be written to the database as soon as the user wants to perform the CRUD operations.
If they are not, and you wait for the user to terminate their session, there may be other parts of the application that try and change the data that was supposed to be updated. Or you may want to take certain action in your application based on the current user data from the database, but your database reflects older data, and your application may behave incorrectly.
One may argue that you can maintain the current state of your application, but in case of backend code, the database should always be your single source of thruth.
This is what's known in the "real world" (as you referred to) as a design decision. It's not something for which there's anything even remotely resembling a rule-of-thumb or a hard-and-fast rule.
Instead, it's important to consider all possible factors relating to this design prior to committing to it:
User expectations - will the users of this application expect that their input is stored immediately? When they click the "Save" button? Do they expect their input to be destroyed?
Data retention - are there requirements to retain user input prior to its formal submission? (This is useful in applications for which
Infrastructure - can the underlying infrastructure handle the increased workload? When this application is scaled, will the infrastructure demands exceed capacity?
Cost/benefit - will the addition of this feature trigger development/testing times that exceed acceptable levels for the benefit the feature provides?
These are just some of the considerations you might have. I'm sure with additional time most people could come up with at least ten more.
We're working on a big school project with twenty people. The case is a decentralized anonymous chatting platform. So we're not allowed to set up a central server, therefore we were looking into distributed databases and found Cassandra to best fit in our project.
This means that everybody who is running the application will also be a Cassandra node. This rises many concerns for me, mainly malicious nodes. If everybody runs a Cassandra node on their computer how can we prevent them from manipulation/vandalizing or even just straight up deleting data?
I was doing some research and I'm starting to conclude that Cassandra (and other distributed databases I looked into) are meant for corporate solutions where the company owns, runs and maintains the databases. This is not true in our case, because as soon as the application launches there won't be an "owner". Every user is equally part of the system.
I know one (or maybe the only) way to prevent malicious node in a decentralized/distributed system is to have nodes keep each other in check. I found no way to do this in Cassandra thus my question, can we prevent data vandalism and malicious node from being a threat?
As you mentioned, the design of Cassandra assumes that you'll have control of all the nodes, as once that any third party has access to a copy of your data, you lose control of what they can do with it, similar to any post in the internet.
One option to ensure that only "authorized nodes" are joining the cluster, you can enforce SSL internode encryption which can give you some control, but there are some caveats:
if a node goes rogue or is compromised after it was given access, it will be very difficult to kick it out.
a node that is using an expired certificate will be able to continue interacting with the cluster until the service gets restarted.
administration of SSL certificates adds another layer of complexity for administration.
Regarding the statement I know one (or maybe the only) way to prevent malicious node in a decentralized/distributed system is to have nodes keep each other in check. Cassandra is already using a gossip mechanism to keep each of the nodes in check with the others.
I have been reading up a lot about CouchDB (and PouchDB) and am still unsure what the best option would be for a project of mine.
I do have a possible way to solve the project in my head based on what I have read so far, but I am unsure about things like performance and would love to get some insights. Or perhaps there's a better place to ask this question? Please let me know if that's the case! (Already tried their IRC channel and the mailing list, but no answers there as of yet)
So the project is basically an 'offline-first' mobile application. The users are device installers. They get assigned a few locations and devices to install every day. They need to walk around buildings and update the data (eg. device X has been installed at location Y; Or property A of device B on location C has been changed to D, etc...)
Some more info about the basic data.
There are users, they are the device installers. They need to log into the app.
There are locations, all the places that the device installers need to visit.
There are devices, all the different devices that can be installed by the users.
There are todos, basically a planned installation for a specific user at a specific location for specific devices.
Of course I have tried to simplify the data, but this should contain the gist.
Now, these are important characteristics of the application:
Users, locations and devices can be changed by an administrator (back-end software).
Todos can be planned by an administrator (back-end software).
App user (device installer) only sees his/her own todos/planning for today + 1 week ahead.
Multiple app users (device installers) might be assigned to the same location and/or todos, because for a big building there might be multiple installers at work.
Automatic synchronization between the data in each app in use and the global database.
Secure, it should only be possible for user X to request his/her own todos/planning.
Taking into account these characteristics I currently have the following in mind:
One global 'master' database containing all users, locations, devices, todos.
Filtered replication/sync using a selector object which for every user replicates only the data that may be accessible for this specific user.
Ionic application using PouchDB which does full/normal replication/sync with his/her own user database.
Am I correct in assuming the following?
The user of the application using PouchDB will have full read access on his own user database which has been filtered server-side?
For updating data I can make use of validate_doc_update to check whether the user may or may not modify something?
Any changes done on the PouchDB database will be replicated to the 'user' database?
These changes will then also be replicated from the 'user' database to the global 'master' database?
Any changes done on the global 'master' database will be replicated to the 'user' database, but only if required (only if there have been new/changed(/deleted) documents for this user)?
These changes will then also be replicated from the 'user' database to the PouchDB database for the mobile app?
If all this holds true, then it might be a good fit for this project. At least I think so? (Correct me if I'm wrong!) But I did read some 'performance' problem regarding filtered replication. Suppose there are hundreds of users (device installers) (there aren't this many right now, but there might be in the future). Then would it be a problem to have this filtered replication running for hundreds of 'user' databases? I did read about CouchDB 2.0 and 2.1 having a selector object to do filtered replication instead of the usual JS MapReduce which is supposed to be up to 10x faster. But my question is still: does this work well, even for hundreds (or even thousands) of 'filtered' databases? I don't know enough about the underlying algorithms and limitations but I am wondering whether any change to the global 'master' database does or does not require expensive calculations to run to decide which 'filtered' databases to replicate to. And if it does... does it matter in practice?
Please, any advice would be welcome. I did also consider using other databases. My first approach would actually have been to use a relational database. But one of the required characteristics of this app must be the real-time synchronization. In the past I have been able to handle this myself using revision fields in a RDBMS and with a lot of code, but I would really prefer something as elegant as CouchDB/PouchDB for the synchronization. This is really an area that would save me a lot of headache. Keeping this in mind, what are my options? Am I going in a possible right path or could performance become an issue down the road?
Also note that I have also thought about having separate databases for each user ('one database per user'), but I think it might not be the best fit for this project because some todos might be assigned to multiple users and when one user updates something for a todo, it must be updated for the other user as well.
Hopefully some CouchDB experts can shed some light on my questions. Much appreciated!
I understand there might be some debate but I am only interested in the facts and expertise of others.
We are currently in the process of organising a student conference.
The issue is that we offer several different events at the same time over the course of a week. The conference runs the whole day.
It's currently been operating on a first come, first served basis, however this has led to dramatic problems in the past, namely the server crashing almost immediately, as 1000+ students all try to get the best events as quickly as they can.
Is anyone aware of a way to best handle this so that each user has the best chance of enrolling in the events they wish to attend, firstly without the server crashing and secondly with people registering for events which have a maximum capacity, all within a few minutes? Perhaps somehow staggering the registration process or something similar?
I'm aware this is a very broad question, however I'm not sure where to look to when trying to solve this problem...
Broad questions have equally broad answers. There are broadly two ways to handle it
Write more performant code so that a single server can handle the load.
Optimize backend code; cache data; minimize DB queries; optimize DB queries; optimize third party calls; consider storing intermediate things in memory; make judicious use of transactions trading off consistency with performance if possible; partition DB.
Horizontally scale - deploy multiple servers. Put a load balancer in front of your multiple front end servers. Horizontally scale DB by introducing multiple read slaves.
There are no quick fixes. It all starts with analysis first - which parts of your code are taking most time and most resources and then systematically attacking them.
Some quick fixes are possible; e.g. search results may be cached. The cache might be stale; so there would be situations where the search page shows that there are seats available, when in reality the event is full. You handle such cases on registration time. For caching web pages use a caching proxy.