How to Log Data in a Realtime Linux Application? - linux

I am working with the 4.4.12-rt19 RTLinux kernel patch.
I have a realtime application written in C that has separate processes running on separate cores taking in data from the network, computing on that data, and then logging results. I am attempting to log on the order of 10KB per ms tick of data to file.
The logging process has access to all of the incoming data in shared memory. Right now, I am using sqlite3 and sqlite3async to buffer the database to memory in one thread of the logging process and then commit the in-memory instance to file every second with a call to sqlite3async_run().
The problem is that during part of the sqlite3async_run() execution, the sqlite3_step() command to write to the in-memory database buffer hangs and violates my 1ms timing guarantee.
I am not sure if the error is happening because of how threaded processes work in a realtime environment or because of how sqlite3async works. As far as I can tell, sqlite3async is supposed to be able to buffer the database in memory using the sqlite3 virtual file system and then handle the actual file write with a background thread (as detailed here). I have tried changing the scheduling priorities and nicenesses of each thread to no avail.
Any help or suggestions would be greatly appreciated!

Using sqlite3async does not remove the delays associated with writing; it just defers them until later, when you can afford them.
Consider using WAL mode. There, you have the same delay when doing a checkpoint, but the WAL is stored on disk, so you can defer the checkpoint for arbitrary long times without running out of memory (at the cost of the WAL becoming arbitrarily large).
If writing in WAL mode is still too slow, you have to implement your own FIFO, and let another thread continuously empty it. (If that thread moves the data out of the FIFO before actually writing, the FIFO is never locked for a long time.)

Related

NodeJS is reading the same file blocking?

We have a backend expressjs server that will read off of the disk for many files whenever a front-end client connects.
At the OS level, are these reads blocking?
I.E., if two people connect at the same time, will whoever gets scheduled second have to wait to read the file until the first person who is currently reading it finishes?
We are just using fs.readFile to read files.
EDIT: I'm implementing caching anyway (it's a legacy codebase, don't hate me), I'm just curious if these reads are blocking and this might improve response time from not having to wait until the file is free to read.
fs.readFile() is not blocking for nodejs. It's a non-blocking, asynchronous operation. While one fs.readFile() operation is in progress, other nodejs code can run.
If two fs.readFile() calls are in operation at the same time, they will both proceed in parallel.
Nodejs itself uses a native OS thread pool with a default size of 4 for file operations so it will support up to 4 file operations in parallel. Beyond 4, it queues the next operation so when one of the 4 finishes, then the next one in line will start to execute.
Within the OS, it will time slice these different threads to achieve parallel operation. But, at the disk controller itself for a spinning drive, only one particular read operation can be occurring at once because the disk head can only be on one track at a given time. So, the underlying read operations reading from different parts of a spinning disk will eventually be serialized at the disk controller as it moves the disk head to read from a given track.
But, if two separate reads are trying to read from the same file, the OS will typically cache that info so the 2nd read won't have to read from the disk again, it will just get the data from an OS cache.
I inherited this codebase and am going to implement some caching anyway, but was just curious if caching would also improve response time since we would be reading from non-blocking process memory instead of (potentially) blocking filesystem memory.
OS file caching is heavily, heavily optimized (it's a problem operating systems have spent decades working on). Implementing my own level of caching on top of the OS isn't where I would think you'd find the highest bang for the buck for improving performance. While there may be a temporary lock used in the OS file cache, that lock would only exist for the duration of a memory copy from cache to target read location which is really, really short. Probably not something anything would notice. And, that temporary lock is not blocking nodejs at all.

Why reading is done at random speeds when using createReadStream (file reading) on local node?

I am using createReadStream api to read huge data from a file.
Sometimes it takes 7 seconds to read while sometimes it takes 30 seconds to read.
I would like to understand why it is taking more time to read the same file with same data in some instances while it is taking less time in few instances. Why it is not fixed time as I am reading the same file with same data?
Here are some possible reasons:
1. Disk caching. The OS has a disk cache and uses it when it knows it is safe to use the cache. This will often make the first read of some data slower (because it's being read directly from the disk) and later reads of the same data faster (if the OS thinks it can be successfully cached and is not too large to be cached and it stays in the cache).
2. Nodejs event loop unpredictability. Reading a large set of data will necessarily require reading many blocks of the file, each one of the read operations will go through the nodejs event loop. If there are other events also being inserted into the event loop, the disk-read-related events may sometimes have to wait their turn.
3. Garbage collector unpredictability. If you're dealing with large amounts of data (even if not all in memory at once), you may be creating lots of objects in the nodejs heap, many of which needs to be garbage collected. Eventually, the garbage collector will have to run and may introduce a bit of a pause in the execution of your code. If this happens multiple times during an operation, this could become noticeable.
4. Disk busyness variability. A disk read/write head (assuming this is a spinning disk and the OS is reading from the actual disk) can only be on one track at a time. If it's busy reading something else that the OS asked it to read, your request may have to wait for some prior requests to finish. This wouldn't typically add up to many seconds, but it can lead to some variability. As an example of a worst case, the OS could be running a defrag operation on your hard drive which your disk operations would have to interleave with.
5. OS/CPU busyness. If the OS or CPU is busy doing something else, your app may not be getting full cycles to run.
6. Nodejs threadpool busy. Nodejs uses a threadpool with a default size of 4 for disk operations. If you happen to have multiple disk operations (or other operations that use the thread pool) in flight at the same time and max out the threadpool, then your operation may have to wait for some previous operation to finish before you get allocated a thread to run your disk operation in. The size of the threadpool is customizable, but making it larger than the number of actual CPU cores you have is probably not helpful.

Poor performance from SQLite, big writes bring little reads to a crawl

Related question: How to use SQLite in a multi-threaded application.
I've been trying to get decent performance out of SQLite3 in a multi-threaded program. I've been very impressed with its performance except for write latency. That's not it's fault, it has to wait for the disk to spin to commit the data. But having reads blocked during those writes, even if they could read from cache, is pretty intolerable.
My use case involves a large number of small read operations to get one tiny object by an indexed field, but latency is important for these operations because there are a lot of them. Writes are large and are accumulated into a single transaction. I don't want reads to have huge latency due to completing writes.
I first just used a single connection with a mutex to protect it. However, while the writing thread is waiting for the transaction to complete, readers are blocked on disk I/O because they can't acquire the mutex until the writer releases it. I tried using multiple connections, but then I get SQLITE_LOCKED from sqlite3_step, which means having to redesign all the reading code.
My write logic currently looks like this:
Acquire connection mutex.
START TRANSACTION
Do all writes. (Typically 10 to 100 small ones.)
END TRANSACTION -- here's where it blocks
Release mutex.
Is there some solution I'm not aware of? Is there an easy way to keep my readers from having to wait for the disk to finish rotating if the entry is in cache without having to rewrite all my reading code to handle SQLITE_LOCKED, reset, and retry?
To allow multiple readers and one writer to access the database concurrently, enable write-ahead logging.
WAL works well with small transactions, so you don't need to accumulate writes.
Please note that WAL does not work with networked file systems, and for optimal performance, requires regular checkpointing.
First of all, sqlite offers multi-threaded support on it's own. You do not have to use your own mutexes, since you only slow the entire program down. Consult sqlite thread options if you have any doubts.
Using write-ahead log may solve your problems, but it is a double-edged sword. As long as there is a read ongoing, the inserted data will not be written to the main database file and the WAL journal will grow. This is covered in detail in Write-Ahead Logging
I am using sqlite in WAL mode in one of my applications. For small amounts of data it works well. However, when there is a lot of data (several hundred inserts per second, in peaks even more) I experience some issues which I don't seem to be able to fix through any meddling with sqlite configuration.
What you may consider is using several database files, each assigned to a certain time span. This will be applicable only when your queries depend on time.
I am probably running too much ahead. WAL journal should help:)

Reducing seek times when reading many small files

I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.
Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.
A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.
could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.
Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.
A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().

Transferring data between process calls

I have a Linux process that is being called numerous times, and I need to make this process as fast as possible.
The problem is that I must maintain a state between calls (load data from previous call and store it for the next one), without running another process / daemon.
Can you suggest fast ways to do so? I know I can use files for I/O, and would like to avoid it, for obvious performance reasons. Should (can?) I create a named pipe to read/write from and by that avoid real disk I/O?
Pipes aren't appropriate for this. Use posix shared memory or a posix message queue if you are absolutely sure files are too slow - which you should test first.
In the shared memory case your program creates the segment with shm_open() if it doesn't exist or opens it if it does. You mmap() the memory and make whatever changes and exit. You only shm_unlink() when you know your program won't be called anymore and no longer needs the shared memory.
With message queues, just set up the queue. Your program reads the queue, makes whatever changes, writes the queue and exits. Mq_unlink() when you no longer need the queue.
Both methods have kernel persistence so you lose the shared memory and the queue on a reboot.
It sounds like you have a process that is continuously executed by something.
Why not create a factory that spawns the worker threads?
The factory could provide the workers with any information needed.
... I can use files for I/O, and would like to avoid it, for obvious performance reasons.
I wonder what are these reasons please...
Linux caches files in kernel memory in the page cache. Writes go to the page cash first, in other words, a write() syscall is a kernel call that only copies the data from the user space to the page cache (it is a bit more complicated when the system is under stress). Some time later pdflush writes data to disk asynchronously.
File read() first checks the page cache to see if the data is already available in memory to avoid a disk read. What it means is that if one program writes data to files and another program reads it, these two programs are effectively communicating via kernel memory as long as the page cache keeps those files.
If you want to avoid disk writes entirely, that is, the state does not need to be persisted across OS reboots, those files can be put in /dev/shm or in /tmp, which are normally the mount points of in-memory filesystems.

Resources