Design Question: Which is better? Many Databases or Many Collections? - arangodb

I've been using arangodb community version for sometime and soon will be starting a new commercial project soon. As i am pondering and working on the system and database designs, I'm stuck on this performance question. Should I create one database per customer or group them in large collections instead?
To give abit of background, the project objective is have a bot that execute tasks given or set by the users. My concern is where each user's bot has to constantly read the users' collection data and execute or followup tasks every few seconds, thus the great attention on designing for performance and speed.
Furthermore, the deployment will be on cloud servers hence the needs for gauging the memory usage, CPU and disk space. Clustering is not required as each user entity is totally separated and I will do load balancing by directing a pool of users to their dedicated cloud server.
Any advice will be greatly appreciated.
Thanks!

Related

Learning DDD and CQRS

I'm new to DDD and CQRS and I'm planning to build a simple application to improve my skills a bit.
What I'm planning to do is a simple Taxi Corp application.
Requirements:
Client orders a taxi.
Client can have only one order at a time.
Driver picks an order.
Driver can have only one order at a time.
Driver goes to client.
Client enters cab.
Course starts.
Course finishes.
Client is purchased and driver is paid
And so on.
I can see there can be three aggregates: Client, Order and Driver. I want to split them into separate microservices. Do you think it's a good idea or I should start with one microservice?
I'm currently focused on the ordering a taxi. First of all I need to check if client doesn't already have a course assigned, later on I can create an order. After the order is created, I need to assign it to client. As during one request only one aggregate can be updated/created I wonder how to do it correctly. I've read something about Process Managers and I think it will be very useful in this case. I even draw a schema of communication. Can anyone tell me if my approach is correct and give me some tips on how to going further?
Process of creating an order
Do you think it's a good idea or I should start with one microservice?
I refer you to the wisdom of John Gall
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system.
Instead of worrying about microservices, give your attention to messages.
Someone said: "If you have more microservices than customers, you are doing it wrong".
And if you really follow CQRS/ES approach, resulting system is much easier to split apart than traditional ORM monolyths.
So focus on the domain first and start with monolyth.
start with the microservices design even in a wrong way, you get a better insight into desired architecture. because problems in microservices architecture design show themselves very soon.
client and driver are both users of systems and have some commonalities so you can consider them as one domain and one micro-service for them.
consider an order manager micro-service to assign client and driver to a trip by their ids. the order database may include trips table with two id keys for driver-Id and client-Id and some columns for the different states. after finishing each trip you can remove it from the trip table and insert that in an archive table. also, you can leave it there and partition your table daily to keep your database performance high.
consider an accounting micro-service for keeping payments and transactions. It's ok if you opt to use NoSql databases for other microservices, but do use SQL database for your transactions.
you may need another microservice for reporting and dashboards. mirror other dbs in a new one for reporting.
you also need an API gateway to route requests to micro-services or do authentication
your process is a set of events. definitely, you will expand the system later on and perhaps will have some long-running tasks, better to have a message broker and implement your flow as an event/task flow using patterns like event sourcing.
I can see there can be three aggregates: Client, Order and Driver. I
want to split them into separate microservices. Do you think it's a
good idea or I should start with one microservice?
They all belong to the same bounded context. Bounded context translates nicely to microservices (see Eric Evans video: https://www.infoq.com/news/2015/06/dddx-microservices-boundaries). But don't start by designing a micro service, you are doing it in the wrong order. Design first your bounded context then if it makes sense create a micro service around the hexagonal architecture.
After the order is created, I need to assign it to client. As during
one request only one aggregate can be updated/created I wonder how to
do it correctly.
This is the perfect example of why you need to do it all in the same process.
But in the case you want to go multiple micro services, think of eventual consistency (https://en.wikipedia.org/wiki/Eventual_consistency) and create a message driven architecture between your services. Might be too much work in my opinion but for learning purpose can be a good idea.

Choosing the ideal multi-tenancy architecture for an ASP.NET Core application

I am currently working on an application that will be hosted on Azure. As it does not make sense to have an instance of it running for each customer (you'll see why), it's going to be a multi-tenancy solution.
To be honest: I'm only starting to gather experience with web applications, so I apologize if the answer to my question is obvious.
Question: Which multi-tenancy concept will be most beneficial for my application, considering the following assumptions:
Many tenants (ideally hundreds or even more, we'll see...)
consisting of few user accounts per tenant (<5-10 in most cases, up to 200 for a hand full of tenants)
dealing with mostly small amounts of data (<100 entries in <20 tables)
changes in data occur a few times a day (approx. <50 changes per
user per day)
The application needs to stay responsive (of course)
My thoughts:
Database-per-Tenant: Does not make sense as the DB won't be utilized
much, therefore not cost effective at all
Table-per-Tenant: Could be a good solution, guess this should scale
pretty good?
Tenant-column within the entities: Could be a problem with scaling, right? Could be
better when using charding on the tenant id?
I would really appreciate your help and some "shared experience" in order to choose the not-so-painful path.
A good summary of the different models can be found here:
https://www.linkedin.com/pulse/database-design-multi-tenant-applications-dharmendar-kumar/
Based on my experience on Azure I would recommend CosmosDB with the following options:
partitioned collections: if tenants are evenly distributed and have similar requirements
collection per tenant: if some tenants have scale or special requirements
mix between the preceding two.
Cosmos DB has a lot of benefits e.g sharding, global distribution, performance, freedom of consistency models as well as a good sql support.

what type of azure resource should i use for hosting many database

I have a project where i need to host many Databases, (500 and up)
and i am trying to find what is the best option to manage everything considering all the options this days and the price.
in the past i would have a virtual server that has SQL-SERVER on it, and i would create the database on my own, and that is all.
but today
i host my current project on AZURE, a simple web server, with SQL server, with one database.
and i do not know what Resource to choose from AZURE
is it the SQL Ware House? or do i need to get a Virtual Machine?
or any other option?
i read all the information i found online, but its mostly confused me.
i hope some one could help me, i would like to know from your experience
thank you in advanced
It all very much depends on size and load of databases. You have 3 options - you can get a VM yourself and have a SQL server there. You are pretty much in control of what is happening and you can host as many DBs as you want. However you'll be in charge of backups, updates and maintenance. But this is a pretty much fixed price.
Another option is to get SQL Server from Azure - you don't need to think much about backups, encryption, updates and other boring stuff and you can get. You can have up to 5000DBs per server, but you can choose size and performance tier of your databases. However that can be expensive, as you are charged per DB.
Third option is to have Elastic Pool - this is basically a pile of DBs that are sharing the same resource. Can be useful if you have a lot of small DBs with small load. This will work out cheaper than just paying per DB on your scale. However might not work if you have very uneven load on some DBs - they can consume all the DTUs and will starve the rest of your DBs from processing power.
So it is up to you what you want to do based on your conditions. Personally I would not go with a VM - too much hassle. I would recommend considering (based on DBs load) a combination of Elastic Pool and a stand-alone DBs.

The _replicator database is not scalable or my design needs tweaking

I think it is important that I elaborate on where I am coming from so that you can understand my use case, please bear with me.
Background: I’m looking to migrate my app from CouchDB 1 to 2 and this migration is going to take a decent amount of work. I just want to double check that I’m not reinventing the wheel and make sure that there isn’t a better design to what I will elaborate on below, especially since CouchDB 2 appears to have some awesome new features.
Consider the following simplified use case for an app that allows students to submit quiz answers digitally. Each student should be able to submit her/his quiz answers and the teacher should be able to view all the answers. This design needs to work with PouchDB as PouchDB speaks directly to the DB and this saves us a lot of time as otherwise an elaborate set of APIs would need to be written.
My chosen design consists of one database per student and one database per teacher, i.e. a database per user. Only the owner of the database can edit her/his database and this is enforced via CouchDB roles. When a student submits an answer, it is synced with her/his database via PouchDB. The answers are then replicated to the teacher’s database. This in turn allows the students to quickly load their answers in the app and the teachers to load all the answers for all their students. Of course, there are views in the teacher databases that segment the answers by class, quiz, etc… so that the teacher doesn’t have to load the answers for all their students at once. If we didn’t have the teacher database then a teacher would need access to all the students’ databases and would have to sync with all of the their student’s databases.
At first glance, the _replicator database appears to be the the obvious way to replicate the data from the student databases to a single teacher database. The big gotcha is that when you use continuous replication, it consumes a file handle and a database connection which means that you can very quickly starve a database of its resources. For example, if we have say 10,000 students in our database then we need 10,000 concurrent file handles and database connections just for the replications. This is pretty crazy considering that it is unlikely that even say 100 of these 10,000 students would be using the app simultaneously.
Instead, I developed a service that listens to the _db_updates feed and then only replicates a database when there is a change to that specific database. With this method, we only worry about consuming resources when there are changes and as a result we end up with plenty of free file handles and database connections.
I’ve briefly experimented with CouchDB 2 and it appears that the _replicator database is just as greedy with resources as it was in CouchDB 1.
Is this database-per-user design for both students and teachers the best solution or is there a better solution? If it is the best solution, is there a better way of replicating this data that doesn’t consume as many resources?
I've open sourced my solution, called Spiegel, which provides the missing piece: scalable CouchDB replication and change listening. Spiegel is currently being used in production with a db-per-user design and is efficiently handling the replication of over 10,000 databases for Quizster.

New Azure SQL Database Services, how scalable and what are DTUs

The new new Azure SQL Database Services look good. However I am trying to work out how scalable they really are.
So, for example, assume a 200 concurrent user system.
For Standard
Workgroup and cloud applications with "multiple" concurrent transactions
For Premium
Mission-critical, high transactional volume with "many" concurrent users
What does "Multiple" and "Many" mean?
Also Standard/S1 offers 15 DTUs while Standard/S2 offers 50 DTUs. What does this mean?
Going back to my 200 user example, what option should I be going for?
Azure SQL Database Link
Thanks
EDIT
Useful page on definitions
However what is "max sessions"? Is this the number of concurrent connections?
There are some great MSDN articles on Azure SQL Database, this one in particular has a great starting point for DTUs. http://msdn.microsoft.com/en-us/library/azure/dn741336.aspx and http://channel9.msdn.com/Series/Windows-Azure-Storage-SQL-Database-Tutorials/Scott-Klein-Video-02
In short, it's a way to understand the resources powering each performance level. One of the things we know when talking with Azure SQL Database customers, is that they are a varied group. Some are most comfortable with the most absolute details, cores, memory, IOPS - and others are after a much more summarized level of information. There is no one-size fits all. DTU is meant for this later group.
Regardless, one of the benefits of the cloud is that it's easy to start with one service tier and performance level and iterate. In Azure SQL Database specifically you can change the performance level while you're application is up. During the change there is typically less than a second of elapsed time when DB connections are dropped. The internal workflow in our service for moving a DB from service tier/performance level follows the same pattern as the workflow for failing over nodes in our data centers. And nodes failing over happens all the time independent of service tier changes. In other words, you shouldn’t notice any difference in this regard relative to your past experience.
If DTU's aren't your thing, we also have a more detailed benchmark workload that may appeal. http://msdn.microsoft.com/en-us/library/azure/dn741327.aspx
Thanks Guy
It is really hard to tell without doing a test. By 200 users I assume you mean 200 people sitting at their computer at the same time doing stuff, not 200 users who log on twice a day. S2 allows 49 transactions per second which sounds about right, but you need to test. Also doing a lot of caching can't hurt.
Check out the new Elastic DB offering (Preview) announced at Build today. The pricing page has been updated with Elastic DB price information.
DTUs are based on a blended measure of CPU, memory, reads, and writes. As DTUs increase, the power offered by the performance level increases. Azure has different limits on the concurrent connections, memory, IO and CPU usage. Which tier one has to pick really depends upon
#concurrent users
Log rate
IO rate
CPU usage
Database size
For example, if you are designing a system where multiple users are reading and there are only a few writers, and if your application middle tier can cache the data as much as possible and only selective queries / application restart hit the database then you may not worry too much about the IO and CPU usage.
If many users are hitting the database at the same time, you may hit the concurrent connection limit and requests will be throttled. If you can control user requests coming to the database in your application then this shouldn't be a problem.
Log rate: Depends upon the volume of the data changes (including additional data pumping in the system). I have seen application steadily pumping the data vs data being pumped all at once. Selecting the right DTU again depends upon how one can do throttling at the application end and get steady rate.
Database size: Basic, standard, and premium has different allowed max sizes, and this is another deciding factor. Using table compression kind of features helps reducing the total size, and hence total IO.
Memory: Tuning the expesnive queries (joins, sorts etc), enabling lock escalation / nolock scans help controlling the memory usage.
The very common mistake people usually do in database systems is scaling up their database instead of tuning the queries and application logic. So testing, monitoring the resources / queries with different DTU limits is the best way of dealing this.
If choose the wrong DTU, don't worry you can always scale up/ down in SQL DB and it is completely online operation
Also unless a strong reason migrate to V12 to get even better performance and features.

Resources