I have a monolith application backend which serves millions of request a day written in NodeJs, with Sequelize and postgres as the database. Since ours is a tenant based application I am planning to shard my database in a way that I have x thousands of tenants in one shard and x in another shards, etc. I use AWS RDS (postgressql) as the Database Server.
On the infra structure its pretty much straight-forward to create a new shard. Just creating a new RDS database server with same configurations as my primary database would be sufficient.
The main problem I am facing now is how to manage the shards.
For example: I have the following requirement -
All my queries of tenant_id < 10000 should go to meta_database
All my queries of tenant_id > 10000 and < 30000 should go to shard_1
All my queries of tenant_id > 30000 and < 60000 should go too shard_2
I tried with the following tools:
Sequelize -
Its seems like it's highly impossible in doing this with Sequelize since it does not support sharding still. I can have multiple Sequelize connections created for all the shards and do the mapping of tenant_id with a particular shard manually in code. But it requires to get the models each time by passing the tenant_id of the tenant, which is not a good and readable approach.
pg_bouncer_rr -
I tried with pg_bouncer-rr and droppped it since I found that having a logic in the query routing level to get the tenant_id from the query and check the value using regex is not a good approach and also can cause some unexpected errors too.
Pg_fdw - Foreign Data Wrapper
I was able to create a fdw server and was able to route my queries to the foriegn server by following few articles. But the problem is it's still inserting all the records to my primary meta database tables. It seems like I was able to route only the reading through data wrappers and the data will still reside on the co-ordinator database. Also on addition to that I can partition my table and have few partitions on the foreign servers, but still when I have a record is to be inserted it is getting written to the main database table and then its getting reflected in my foreign tables. How can i have my foreign server to handle all my write and read calls completely independent of the meta database (meta database should only do the routing and should not have any data persisted).
pl/proxy -
I read few articles on pl/proxy it requires me to write a function for every read and inserts. I guess its more useful for managing table partitions, than managing shards.
I am not sure how to proceed with the tenant based sharding. If anyone have achieved sharding with nodejs, postgres and sequelize, kindly help!
I am even okay in having a proxy to the database that will take care of the query routing based on tenant_id. I tried CITUS for this purpose to use as a proxy but it revoked its support for AWS recently.
Related
I am re-designing a project I built a year ago when I was just starting to learn how to code. I used MEAN stack, back then and want to convert it to a PERN stack now. My AWS knowledge has also grown a bit and I'd like to expand on these new skills.
The application receives real-time data from an api which I clean up to write to a database as well as broadcast that data to connected clients.
To better conceptualize this question I will refer to the following items:
api-m1 : this receives the incoming data and passes it to my schema I then send it to my socket-server.
socket-server: handles the WSS connection to the application's front-end clients. It also will write this data to a postgres database which it gets from Scraper and api-m1. I would like to turn this into clusters eventually as I am using nodejs and will incorporate Redis. Then I will run it behind an ALB using sticky-sessions etc.. for multiple EC2 instances.
RDS: postgres table which socket-server writes incoming scraper and api-m1 data to. RDS is used to fetch the most recent data stored along with user profile config data. NOTE: RDS main data table will have max 120-150 UID records with 6-7 columns
To help better visualize this see img below.
From a database perspective, what would be the quickest way to write my data to RDS.
Assuming we have during peak times 20-40 records/s from the api-m1 + another 20-40 records/s from the scraper? After each day I tear down the database using a lambda function and start again (as the data is only temporary and does not need to be saved for any prolonged period of time).
1.Should I INSERT each record using a SERIAL id, then from the frontend fetch the most recent rows based off of the uid?
2.a Should I UPDATE each UID so i'd have a fixed N rows of data which I just search and update? (I can see this bottlenecking with my Postgres client.
2.b Still use UPDATE but do BATCHED updates (what issues will I run into if I make multiple clusters i.e will I run into concurrency problems where table record XYZ will have an older value overwrite a more recent value because i'm using BATCH UPDATE with Node Clusters?
My concern is UPDATES are slower than INSERTS and I don't want to make it as fast as possible. This section of the application isn't CPU heavy, and the rt-data isn't that intensive.
To make my comments an answer:
You don't seem to need SQL semantics for anything here, so I'd just toss RDS and use e.g. Redis (or DynamoDB, I guess) for that data store.
I'm building an API that is meant to query from multiple databases, all of these databases are isolated instances of the same data structure, so the idea is for the request to contain to which database to point. The amount of databases is dynamic. Is there a way for a module to set up a different amount of databases when starting up?
I've tried to use typeorm and connect to the specific db when asked but adds some time to the request, so I wanted to know if a way to have them all exists
I'm looking to get some opinions on what the best approach is for the following scenario:
Our product requires connections to our users' Postgres databases via our Node Express server, they provide their credentials once and we store it in an encrypted way in our internal operations DB and can reference to it when access is needed. A user can do an action on our app UI like create a table, delete a table, etc. and view table sizes, min max values of a column, etc.
These actions comes to our server as authenticated API calls and we can query their databases via Sequelize as needed and return the results to frontend.
My question is, when there are N number of users with N number of databases from different SQL instances that need to be connected when an API is called to query the respective database, what is the best approach to maintain that?
Should we create a new Sequelize connection instance each time an API is called and run the query, return the response, and close the connection. Or create a new Sequelize connection instance for a DB when an API is called, and keep the instance for certain amount of time, and close the connection if it was inactive during that amount of time, and restart the instance next time?
If there are better and more efficient ways of doing this, I would love to hear about it. Thanks.
Currently, I've tried to do a new Sequelize instance each time at the beginning of the API request, and run the query, and then close the connection. Works ok, but that's just locally with 2 DBs so I can't tell what production would be like.
Edit: Anatoly suggested connection pool, in that case, what're the things that need to be considered for the config?
The ASP.net MVC application is in EC2 and interacting with RDS ( SQL Server). The application is sending Bulk GET request (API call) to RDS via NHibernate to get the items. The application performance is very slow as sometimes it’s making around 500 numbers of GET API call to get 500 items from the DB ( note - getting items from DB has its own stored procedure/ Logic)
I was referring this to understand scaling RDS https://aws.amazon.com/blogs/database/scaling-your-amazon-rds-instance-vertically-and-horizontally/ and https://aws.amazon.com/premiumsupport/knowledge-center/requests-rds-read-replicas/
However, didn’t get much clue that support my business scenario.
My questions are(considering above scenario):
Is there any way to distribute my GET request to RDS (SQL Server) so that it can return the 500 items from SQL server quickly?
Is it possible to achieve this without any code or existing architecture change ( both from .net an SQL end)?
What are the different ways I should tryout to make this performance better?
What are the pricing details for Read replica?
Note: The application does both read and write. And, I’m more concern about this particular GET API calls.
Thanks.
Is there any way to distribute my GET request to RDS (SQL Server) so that it can return the 500 items from SQL server quickly?
You will need to have a router in your application that will route the request to the read replicas(can be many).
You can provision a read replica with different instance type with enhanced capacity for that use-case.
You can try memory cache, it can reduce response time and can off load read work load to the read replicas.
Is it possible to achieve this without any code or existing architecture change ( both from .net an SQL end)?
Based on the documentation "applications can connect to a read replica just as they would to any DB instance." which means your application requires additional modification to support the use-case.
What are the different ways I should tryout to make this performance better?
memory cache and instance type with enhanced capacity for reads(the same suggestion above)
What are the pricing details for Read replica?
It will depends on the instance type that you provision.
I have a NodeJS project that using mongodb as main database.
Regular, I just use one database for containing all information (users, organization, messages,...)
But now, I need to store one more thing - log data - which grow very very fast.
So I consider store log in other database to keep current database safe and fast.
Does anyone has experience in this, Is that better than single database?
Not a real question the mods will certainly say. You have a few options depending on your log data and how / how often you want to access it.
Capped collections if you don't need to store the logs for a long time
Something like Redis to delay writing to the log and keep the app responding fast
Use a replica set to distribute the database load.