node.js keep a small in-memory database - node.js

I have an API-service in Node.js, basically what it does is gets id from request, reads record with this id from the database and returns it back in response.
While there are many clients with different ids usually only about 10-20 of them are used in a given timespan.
Is it a good idea to create an object with ids as keys and store the resulting record along with last_requested time to emulate a small database with fast-access? Whenever a record is requested I will update the last_requested field with new Date(). Also, create a setInterval() to delete those keys which were not used for some time.
Records in the database do not change often, and when they do I can restart the service (there are several instances running simultaneously via PM2, so they can be gracefully restarted).
If the required id is not found in this "database" a request to real database will be performed and the result will be stored in the object in a new key.

You're talking about caching. And it's very useful, if
You have a lot of reads, but not a lot of writes. i.e. Lots of people request a record, and it changes rarely.
You have a lot of free memory, or not many records.
You have a good indication of when to invalidate the cache.
For trivial usecases (i.e. under 50 requests / second), you probably don't need an in-memory cache for the database. Moreover, database access is very fast if you use the tools the database gives you (like persistent connection pools, consistent parameterized queries, query cache, etc).
It all depends on your specific usecase. But I wouldn't do it until I actually start encountering performance problems, and determine that the database is the bottleneck.

It's not just a good idea, caching is a necessity in different level of a computational system. Caching start from the CPU level (L1, L2, L3), OS Level up to application level which must be done by the developer.
Even if you have a well structured Database with good indexes, still there is an overhead for TCP-IP communication between your app and database. So if you are going to access some row frequently it's a must to have them in your app process.
The good news is Node.js apps are single process resident in memory (unlike PHP or other scripting programs which come and go). So you can load frequent required data and omit the database access.
The best mechanism to store the record can be an LRU (least-recently-used) cache. There are several LRU cache packages available for node.js:
https://github.com/adzerk/node-lru-native
https://github.com/isaacs/node-lru-cache
https://www.npmjs.com/package/simple-lru-cache
In an LRU cache you can define how much memory the cache can use, expiry age of each item, and how many item it can store! or you can write your own!

Related

Most efficient way to get users country in Next.js/Node.js?

In a Next.js app, what would be the most efficient (fastest) way to retrieve the users country?
Among other things, I would use it to determine which scripts are loaded using next/script.
I looked in to node-geoip and fast-geoip, but even though fast-geoip has a very thorough explanation below, I do not understand the mechanisms behind Next.js/Node.js to evaluate the methods properly.
Concretely, what geoip-lite does is that, on startup, it reads the whole database from disk, parses it and puts it all on memory, thus this results in the startup time being increased by about ~233 ms along with an increase of memory being used by the process of around ~110 MB, in exchange for any new queries being resolved with low sub-millisecond latencies (~0.02 ms).
This works if you have a long-running process that will need to geolocate a lot of IPs and don't care about the increases in memory usage nor startup time, but if, for example, your use-case requires only geolocating a single IP, these trade-offs don't make much sense as only a small part of the satabase is needed to answer that query, not all of it.
This library tries to provide a solution for these use-cases by separating the database into chunks and building an indexing tree around them, so that IP lookups only have to read the parts of the database that are needed for the query at hand. This results in the first query taking around 9ms and subsequent ones that hit the disk cache taking 0.7 ms, while memory consumption is kept at around 0.7MB.
Wrapping it up, geoip-lite has huge overhead costs but sub-millisecond queries whereas this library doesn't have any overhead costs but its queries are slower (0.7-9 ms).
As geoip would be called for every visitor, I assume it would have to read the whole database on each initialization and thereby making fast-geoip the best choice?
or is there some built in mechanism, that makes sure it is accessed from memory across the subsequent requests, when frequently loaded and hence making node-geoip the best choice?
or am I focused on solving my problem the wrong way, and should rather see if there is some way I can get the location via the users browser?
Would appreciate any feedback, even if there is a completely different path worth exploring:-)
I read the documentation for fast-geoip. It's designed for "serverless" cloud services such as AWS Lambda, GCP Cloud Functions, CF Workers where RAM is limited and expensive.
Note the package author's emphasis on low steady-state RAM use in the graphs below.
In summary, assuming a cloud VM/bare-metal deployment and the need to call the IP to location method on every page request, there is probably no compelling reason to use the above package.
PS: Check if the above packages require you to rotate a DB file on disk every few weeks (or rebuild+redeploy your Node app) to keep data up to date. There are commercial REST APIs such as the one in my bio (I am the developer) that may mitigate this hassle, YMMV.

Best persistent data storage system for an alternative to global variables?

I am building a Node.js application which uses a few global variables to track data such as online users and statuses, information about other servers, and ongoing events, but having this information be lost in the event of server restart/crash is not ideal.
As these things are frequently read & modified, I figure it would not be a good idea to put that extra strain on my existing MySQL database. I have looked into Redis but unfortunately my application is hosted on a Windows server so I would have to use an old unsupported version of it which isn't ideal.
I'm currently considering setting up a NoSQL database such as MongoDB, but I'm not sure if this is an efficient solution and if it would be too much on my relatively weak server to have an application and 2 different databases running.
What would be the best solution for persistent storage of data that needs to be frequently accessed and updated by an application?
Making my comments into an answer...
If it's a reasonable amount of data, you can just write JSON to a single data file. No database required. Just overwrite the file with a new block of JSON to save the new state. This is very fast, efficient and simple. I've used this before as a quick and easy way to regularly save snapshots of state that you want to be able to reload if your server restarts. Read the state into memory upon server start, then use it from memory, then regularly save a new snapshot to disk however often your application desires.
If some data changes a lot and some data doesn't change very much, you can break the data into multiple files so you're writing less data on the more frequent interval. Obviously, there is a threshold of amount of data or frequency of writes or complexity of data access where a database would be warranted, but you should at least consider the simpler option first and only add a new database when you think you really need it.
If you cluster your servers in the future, that would speak to a multi-user database (one with appropriate concurrency management features) to be your master keeper of state, but you're going to have other design issues to work through if you're trying to share multi-user state (like online status) across all clustered servers as you can no longer keep that in memory for any server unless all state changes are broadcast to all servers so they can update their in-memory copy of the data or unless you make users sticky to a particular server (which complicates load balancing in clustering). That does somewhat call for a redis-like central store that all clustered servers can access.

MongoDB Performance: single collection vs multiple collections for concurrent read/writes

I'm utilizing a local database on my web server to sync certain data from external APIs. The local database would be used to serve the web application. The data I'm syncing is different for each user who would be visiting the web app. Since the sync job is periodically but continuously writing to the DB while users are accessing their data from the web page, I'm wondering what would give me the best performance here.
Since the sync job is continuously writing to the DB, I believe the collection is locked until it's done. I'm thinking that having multiple collections would help here since the lock would be on a particular collection that is being written to rather than on a single collection every time.
Is my thinking correct here? I basically don't want reads to get throttled since the write operation is continuously locking up one collection.
Collection level locking was never a thing in MongoDB. Before the WiredTiger storage engine arrived with MongoDB 4.x there were plenty of occcasions when the whole database would lock.
Nowdays with WiredTiger writing with multiple threads and/or processes to a single collection is extremely efficient. The right way to distribute a very heavy write load in MongoDB is to shard your collection.
To test a sharded vs unsharded config you can easily spin up both configurations in parallel with MongoDB Atlas.
There is an extensive amount of information regarding lock granularity and locking in MongoDB in general here.
In general, writing to multiple collections, for a small to medium value of "multiple", and assuming all of the collections are created in advance, can be faster than using a single collection, at the cost of queries becoming awkward as well as potentially slow if you have to perform joins via the aggregation pipeline instead of performing a single collection/index scan, for example.
If you have so many collections that there are so many files open that either the DB or the OS starts evicting files out of their respective caches, performance will start dropping again.
Creating collections may also be relatively slow, so if this happens under load it may not be very good for performance.

How should I keep temporary data for socket.io interactions in node.js?

I am building a simple game in node.js using socket.io. My web experience with node.js has typically involved saving everything to a relational database and keeping nothing in memory. I set up a relational database for the state of a game. I am using sqlite3 for development and I might use something like PostgreSQL or MySQL for production.
My concern is that, every time an event is emitted from the socket the whole game-state is loaded into memory from the server. I feel that in practice this will be less efficient than just keeping all of the game-state data in memory. Events will probably be emitted every 5 seconds or so during a game. All of the game data is temporary, none of it will be needed after the game is over. A game-state consists of a set of about 120 groups of small strings and integers (about 10 per group but subject to change).
Is it good practice to keep this type of data in memory?
If not, should I stick with relational databases or switch to a third option like a file-based storage structure?
Should I not load the whole gamestate in for every event even though that will lead to a lot more read/writes (at least triple)?
I would not keep this data in the memory of your NodeJS application. Its best avoid storing state in your app server. If you really need faster read access than sql provides consider using a cache like Redis or Memcached as a layer between your app and db.
All that being said its best not to prematurely optimize you code. Most SQL engines have their own form of cacheing, and optimizing your sql queries is a better place to start if your experiencing performance issues. Postgresql Query Optimization
But don't worry about it until its an actual problem (because most likely it never will be).
Sounds like relational, SQL type database is a huge overhead for your specifics. Do you have idea how big your data is and how many users you'd like to handle? Then you could compare it with your's server ability. If result is negative (couldn't handle with mem) - then i'd go with some quick nosql, like mongo. For yours example its sounds like the best choice. It'll be faster to get data for single session, easier to dump, more elastic in structure.

couchdb saftey of abitrary design docs

I am creating a couchdb database per user of my application, in which the application is granted database admin privileges. This is done so that the application can sync design docs -- but I do not want to expose my server to any risks.
There is no legitimate reason for a user to run a view on my server (they only use the server for 2-way sync'ing) so it wouldn't be hard to filter requests out that were attempting to view views?
Are there other security risks or DoS attacks I'm missing?
Every user that has read access to your database is able to run view. That's not an issue since view index builds once and updates incrementally.
But database admins can create new views whatever they like. Views couldn't consume a lot of CPU time since CouchDB limits their execution with timeout (default 5 sec), but they could consume a lot of disk space, especially if full doc content will be emitted from view - this could make single index view be bigger than whole database.
More over, database admins can run database and view index compactions - these operations are very heavy for disk IO (and sometimes for CPU too), especially for large databases (100GiBs+). These tasks may significantly slow down (single compaction probably may not, but multiple - easily will) your server performance if will be running at the peak of your users activity.
Things can get worse if you're using custom view server without sandbox feature (like Python, Erlang etc.). By the fact, they will allow your db admins execute custom code on your server though CouchDB. In this case, losing all databases and finding remote shell on your server are just the top of the iceberg of possibilities.
Resume: don't assign to database admins people whom you cannot trust and you'll be safe.

Resources