High Disk usage when writing to sqlite database

High Disk usage when writing to sqlite database - multithreading

I are running a local server which is using sqlite as database. When insert or update operations are happening disk usage goes very high(around 95%). The device becomes unusable. All database requests come from multiple threads which perform atomic db operations.
I have tried different pragma operations which improved speed of execution but the disk usage is still very high. Since request come from multiple threads I cannot use transactions.
Any help is appreciated.

Related

How to find optimal size of connection pool for single mongo nodejs driver

I am using official mongo nodejs driver with default settings, but was digging deeper into options today and apparently there is an option of maxPoolSize that is set to 100 by default.
My understanding of this is that single nodejs process can establish up to 100 connections, thus allowing mongo to handle 100 reads/writes simultaneously in paralel?
If so, it seems that setting this number higher could only benefit the performance, but I am not sure hence decided to ask here.
Assuming default setup with no indexes, is there a way to determine (based on cpu's and memory of the db) what the optimal connection number for pool should be?
We can also assume that nodejs process itself is not a bottleneck (i.e can be scaled horizontally).

Good question =)
it seems that setting this number higher could only benefit the performance
It does indeed. I mean it seems, and it would be the case for an abstract nodejs process in a vacuum with unlimited resources. Connections are not free, so there are things to consider:
limited connection quota on the server. Atlas in particular, but even self-hosted cluster has only 65k sockets. Remember the driver keeps them open to reuse, and the default timeout per cursor is 30 minutes of inactivity.
single thread clientside. BSON serialisation blocks event loop and is quite expensive, e.g. see the flamechart in this answer https://stackoverflow.com/a/72264469/1110423 . Blocking the loop, you increase time cursors from the previous point remain open, and in worst case get performance degradation.
limited RAM. Each connection require ~1 MB serverside.
Assuming default setup with no indexes
You have at least _id, and you should have more if we are talking about performance
is there a way to determine what the optimal connection number for pool should be?
I'd love to know that too. There are too many factors to consider, not only CPA/RAM, but also data shape, query patterns, etc. This is what dbops are for. Mongo cluster requires some attention, monitoring and adjustments for optimal operations. In many cases it's more cost efficient to scale up the cluster than optimise the app.
We can also assume that nodejs process itself is not a bottleneck (i.e can be scaled horizontally).
This is quite wild assumption. The process cannot scale horisontally. It's on the OS level. Once you have a process descriptor, it's locked to it till the death. You can use a node cluster to utilise all CPU cores, can even have multiple servers running the same nodejs and balance the load, but none of them will share connections from the pool. The pool is local to nodejs process.

In node.js, why one would want to use pools when connecting through node-postgres?

With node-postgres npm package, I'm given two connection options: with using Client or with using Pool.
What would be the benefit of using a Pool instead of a Client, what problem will it solve for me in the context of using node.js, which is a) async, and b) won't die and disconnect from Postgres after every HTTP request (as PHP would do, for example).
What would be the technicalities of using a single instance of Client vs using a Pool from within a single container running a node.js server? (e.g. Next.js, or Express, or whatever).
My understanding is that with server-side languages like PHP (classic sync php), Pool would benefit me by saving time on multiple re-connections. But a Node.js server connects once and maintains an open connection to Postgres, so why would I want to use a Pool?

PostgreSQL's architecture is specifically built for pooling. Its developers decided that forking a process for each connection to the database was the safest choice and this hasn't been changed since the start.
Modern middleware that sits between the client and the database (in your case node-postgres) opens and closes virtual connections while administering the "physical" connection to the Postgres database can be held as efficient as possible.
This means connection time can be reduced a lot, as closed connections are not really closed, but returned to a pool, and opening a new connection returns the same physical connection back to the pool after use, reducing the actual forking going on the database side.
Node-postgres themselves write about the pros on their website, and they recommend you always use pooling:
Connecting a new client to the PostgreSQL server requires a handshake
which can take 20-30 milliseconds. During this time passwords are
negotiated, SSL may be established, and configuration information is
shared with the client & server. Incurring this cost every time we
want to execute a query would substantially slow down our application.
The PostgreSQL server can only handle a limited number of clients at a
time. Depending on the available memory of your PostgreSQL server you
may even crash the server if you connect an unbounded number of
clients. note: I have crashed a large production PostgreSQL server
instance in RDS by opening new clients and never disconnecting them in
a python application long ago. It was not fun.
PostgreSQL can only process one query at a time on a single connected
client in a first-in first-out manner. If your multi-tenant web
application is using only a single connected client all queries among
all simultaneous requests will be pipelined and executed serially, one
after the other. No good!
https://node-postgres.com/features/pooling
I think it was clearly expressed in this snippet.
"But a Node.js server connects once and maintains an open connection to Postgres, so why would I want to use a Pool?"
Yes, but the number of simultaneous connections to the database itself is limited, and when too many browsers try to connect at the same time, the database's handling of it is not elegant. A pool can better mitigate this by virtualizing and outsourcing from the database itself the queuing and error handling that no databases are specialized in.
"What exactly is not elegant and how is it more elegant with pooling?"
A database stops responding, a connection times out, without any feedback to the end user (and even often with few clues to the server admin). The database is dependent on hardware to a higher extent than a javascript program. The risk of failure is higher. Those are my main "not elegant" arguments.
Pooling is better because:
a) As node-postgres wrote in my link above: "Incurring the cost of a db handshake every time we want to execute a query would substantially slow down our application."
b) Postgres can only process one query at a time on a single connected client (which is what Node would do without the pool) in a first-in first-out manner. All queries among all simultaneous requests will be pipelined and executed serially, one after the other. Recipe for disaster.
c) A node-based pooling component is in my opinion a better interface for enhancements, like request queuing, logging and error handling compared to a single-threaded connection.
Background:
According to Postgres themselves pooling IS needed, but deliberately not built into Postgres itself. They write:
"If you look at any graph of PostgreSQL performance with number of connections on the x axis and tps on the y access (with nothing else changing), you will see performance climb as connections rise until you hit saturation, and then you have a "knee" after which performance falls off. A lot of work has been done for version 9.2 to push that knee to the right and make the fall-off more gradual, but the issue is intrinsic -- without a built-in connection pool or at least an admission control policy, the knee and subsequent performance degradation will always be there.
The decision not to include a connection pooler inside the PostgreSQL server itself has been taken deliberately and with good reason:
In many cases you will get better performance if the connection pooler is running on a separate machine;
There is no single "right" pooling design for all needs, and having pooling outside the core server maintains flexibility;
You can get improved functionality by incorporating a connection pool into client-side software; and finally
Some client side software (like Java EE / JPA / Hibernate) always pools connections, so built-in pooling in PostgreSQL would then be wasteful duplication.
Many frameworks do the pooling in a process running on the the database server machine (to minimize latency effects from the database protocol) and accept high-level requests to run a certain function with a given set of parameters, with the entire function running as a single database transaction. This ensures that network latency or connection failures can't cause a transaction to hang while waiting for something from the network, and provides a simple way to retry any database transaction which rolls back with a serialization failure (SQLSTATE 40001 or 40P01).
Since a pooler built in to the database engine would be inferior (for the above reasons), the community has decided not to go that route."
And continue with their top reasons for performance failure with many connections to Postgres:
Disk contention. If you need to go to disk for random access (ie your data isn't cached in RAM), a large number of connections can tend to force more tables and indexes to be accessed at the same time, causing heavier seeking all over the disk. Seeking on rotating disks is massively slower than sequential access so the resulting "thrashing" can slow systems that use traditional hard drives down a lot.
RAM usage. The work_mem setting can have a big impact on performance. If it is too small, hash tables and sorts spill to disk, bitmap heap scans become "lossy", requiring more work on each page access, etc. So you want it to be big. But work_mem RAM can be allocated for each node of a query on each connection, all at the same time. So a big work_mem with a large number of connections can cause a lot of the OS cache to be periodically discarded, forcing more accesses to disk; or it could even put the system into swapping. So the more connections you have, the more you need to make a choice between slow plans and trashing cache/swapping.
Lock contention. This happens at various levels: spinlocks, LW locks, and all the locks that show up in pg_locks. As more processes compete for the spinlocks (which protect LW locks acquisition and release, which in turn protect the heavyweight and predicate lock acquisition and release) they account for a high percentage of CPU time used.
Context switches. The processor is interrupted from working on one query and has to switch to another, which involves saving state and restoring state. While the core is busy swapping states it is not doing any useful work on any query. Context switches are much cheaper than they used to be with modern CPUs and system call interfaces but are still far from free.
Cache line contention. One query is likely to be working on a particular area of RAM, and the query taking its place is likely to be working on a different area; causing data cached on the CPU chip to be discarded, only to need to be reloaded to continue the other query. Besides that the various processes will be grabbing control of cache lines from each other, causing stalls. (Humorous note, in one oprofile run of a heavily contended load, 10% of CPU time was attributed to a 1-byte noop; analysis showed that it was because it needed to wait on a cache line for the following machine code operation.)
General scaling. Some internal structures allocated based on max_connections scale at O(N^2) or O(N*log(N)). Some types of overhead which are negligible at a lower number of connections can become significant with a large number of connections.
Source

DB for Pi Zero running multiple NodeJs processes

I need a local DB on a pi zero, with multiple processes running that need to write and read data. That kind of rules SQLite out (I think). From my experience SQLite only allows one connection at a time and is tricky with multiple processes trying to do database work. All of my data transmission would be JSON driven so NOSQL makes sense but I need something light weight to store a few configs and to store data that will synced up to the server. But what NOSQL options would be best to run on a pi with great NODE support?

SQLite is generally fine when using it with multiple concurrent processes. From the SQLite FAQ:
We are aware of no other embedded SQL database engine that supports as much concurrency as SQLite. SQLite allows multiple processes to have the database file open at once, and for multiple processes to read the database at once. When any process wants to write, it must lock the entire database file for the duration of its update. But that normally only takes a few milliseconds. Other processes just wait on the writer to finish then continue about their business. Other embedded SQL database engines typically only allow a single process to connect to the database at once.
For the majority of applications, that should be fine. If only one of your processes is doing writes, and the other only reads, it should have no impact at all.
If you're looking for something that's NoSQL-specific, you can also consider LevelDB, which is used in Google Chrome. With Node, the best way to access it is through the levelup library.

Loading Streaming Data from RabbitMQ to Postgres in Parallel

I'm still somewhat new to Node.js, so I'm not as conversant in how parallelism works with concurrent I/O operations as I'd like to be.
I'm planning a Node.js application to load streaming data from RabbitMQ to Postgres. These loads will happen during system operation, so it is not a bulk load.
I expect throughput requirements to be fairly low to start (maybe 50-100 records per minute). But I'd like to plan the application so it can scale up to higher volumes as the requirements emerge.
I'm trying to think through how parallelism would work. My first impressions of flow and how parallelism would be introduced is:
Message read from the queue
Query to load data into Postgres kicked off, which pushes callback to the Node stack
Event loop free to read another message from the queue, if available, which will launch another query
Repeat
I believe the queries kicked off in this fashion will run in parallel up to the number of connections in my PG connection pool. Is this a good assumption?
With this simple flow, the limit on parallel queries would seem to be the size of the Postgres connection pool. I could make that as big as required for throughput (and that the server and backend database can handle) and that would be the limiting factor on how many messages I could process in parallel. Does that sound right?
I haven't located a great reference on how many parallel I/Os Node will instantiate. Will Node eventually block as my event loop generates too many I/O requests that aren't yet resolved (if not, I assume pg will put my query on the callback stack when I have to wait for a connection)? Are there dials I can turn to affect these limits by setting switches when I launch Node? Am I assuming correctly that libuv and the "pg" lib will in fact run these queries in parallel within one Node.js process? If those assumptions are correct, I'd think I'd hit connection pool size limits before I'd run into libuv parallelism limits (or possibly at the same time if I size my connection pool to the number of cores on the server).
Also, related to the discussion above about Node launching parallel I/O requests, how do I prevent Node from pulling messages off the queue as quick as they come in and queuing up I/O requests? I'd think at some point this could cause problems with memory consumption. This relates back to my question about startup parameters to limit the amount of parallel I/O requests created. I don't understand this too well at this point, so maybe it's not a concern (maybe by default Node won't create more parallel I/O requests than cores, providing a natural limit?).
The other thing I'm wondering is when/how running multiple copies of this program in parallel would help? Does it even matter on one host since the Postgres connection pool seems to be the driver of parallelism here? If that's the case, I'd probably only run one copy per host and only run additional copies on other hosts to spread the load.
As you can see, I'm trying to get some basic assumptions right before I start down this road. Insight and pointers to good reference doc would be appreciated.

I resolved this with a test of the prototype I wrote. A few observations:
If I don't set pre-fetch on the RabbitMQ channel, Node will pull ALL the messages off the queue in seconds. I did a test with 100K messages off the queue and Node pulled all 100K off in seconds, though it took many minutes to actually process the messages.
The behavior mentioned in #1 above is not desireable, because then Node must cache all the messages in memory. In my test, Node took up 2GB when pulling down all those message quickly, whereas if I set pre-fetch to match the number of database connections, Node took up only 80 MB and drained the queue slowly, as it finished processing the messages and sent back ACKs.
A single instance of Node running this program kept my CPUs 100% utilized.
So, the morals of the story seem to be:
Node can spawn any number of async I/O handlers (limited by available memory)
In a case like this, you want to limit how many async I/O requests Node spawns to avoid excessive memory usage.
Creating additional child processes for this workload made no difference. The unit of parallelism was the size of the database connection pool. If my workload did more in JavaScript instead of just delegating to Postgres, additional child processes would help. But in this case, it's all I/O (and thankfully I/O that doesn't need the Node threadpool), so the additional child processes do nothing.

Hardware importance on asynchronous JVM server performance

I am running a Finatra server (https://github.com/capotej/finatra) which a sinatra inspired web framework for scala built on top of Finagle (an asynchronous RPC system).
The application should be design to receive something between 10 and 50 requests concurrently. Each request is quite CPU intensive, mostly due to parsing and serializing large JSON's and operation on arrays, like sorting, grouping, etc...
Now I am wondering what is the impact of the following parameters on performance and how to combine them :
RAM of the server
Number of cores of the server
JVM heap size
Number of threads run on parallel in my Future Pool
As a partial response, I would say :
JVM heapsize should me tuned depending on RAM
Having multiple cores improves performance under concurrent workload but does not really speed up processing of a single request.
Having large RAM, on the contrary, can notably speed up execution of a single request
Number of threads in my Future Pool must be tuned according to my number of cores.
EDIT
I want to compare performance regardless of the the code, only focusing on hardware/threading model. Let's assume the code is already optimized. Additional information :
I am building a data reporting API. Processing time of a request largely depends of the dataset I am manipulating. For big datasets, it can hit 10 seconds max.
I retrieve most of the data from third party API but I am also accessing a MySQL database with a c3po connection pooling mechanism. Execution of the request is additionally delegated to a Future Pool to prevent blocking.
No disk IO excluding MySQL
I don't want to cache anything on the server side because I need to
work with fresh data.
Thanks !!!

The performance and overall behaviour will still depend on your own code, outside of the framework you are using. In other words, you have correctly listed the major factors which will influence performance, but your own code will have such a significant impact on it that it's almost impossible to tell in advance.
Offhand, I'd say that you need to characterize some things about your application in more detail:
You say that each request will be CPU intensive, but what do you mean by it? Will each request take 1 ms? 10 ms? 100 ms?
Do you access a database? What are the characteristics of your database?
Either with the database or without it, do you have any disk IO? How significant is it?
... but if your application is really simple, does not hit the disk much (or at all.. your request may be read-only and everything gets cached), and you are CPU-bound, simply sticking enough CPU cores in your server will be the most significant thing you can do.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string