how to mmap() a queue? - multithreading

I have the following problem:
I have created a queue. The addition of elements (malloc) are done by the main() function, and I have created a thread, which will process the elements/datum and free them.
This is a continuous process, and it will continue till I kill the process.
Now, if I kill the process the data in the queue will be lost, so I was thinking about implementing mmap() on it. So that the queue is also stored in a regular file, and when I restart the process the data will be reload into memory for further processing by the thread...
Since I am malloc'ing and free'ing memory, I suppose the mmapped file size will grow or reduce continuously.
Now is this possible to implement or should I consider other options???
Thanks.
EDIT1: can I use lseek or ftruncate() to resize file?

You can certainly put a queue (or any other data structure) into mmap()ed memory instead of heap memory, but you will run in to several problems which you must overcome:
You will have to do all of the memory management in the memory block corresponding to the mmap()ed file yourself. Unless your queue data structure is one monolithic block of memory, it probably has nodes and pointers which can get created, removed, and relocated. With heap memory you can delegate the task of allocating and freeing small blocks of memory to malloc() and free(), including reusing memory that has been freed for use by new nodes in your data structure. In your mmap()ed file you will have to do all of this yourself.
You won't be able to use pointers inside your mmap()ed block of memory, only offsets from the start of the block. This is because if the block is detached and reattached in another process it probably won't be at the same memory address. Since data structure access is done using pointers, you will have the overhead of constantly transforming offsets to pointers and back by adding or subtracting the mmap() block's base address.
If you want to be able to recover by reattaching the block in another process after the first process has been killed, you will have to be prepared for the case where the first process was killed in the middle of a critical section while one or more invariants of the data structure were temporarily violated. In other words, you might be reattaching a corrupt of inconsistent data structure. In order to support this fully correctly you will have to be very careful with the kinds of mutations you perform on your data structure.
All in all, my recommendation to you is that it's not worth it. You should use a fast, efficient, and easy to use data structure based in heap memory, and once in a while save a serialized snapshot to a regular file. If you have to receiver, recover from the last known good snapshot.

Related

How to synchronize `vkMapMemory`?

vkMapMemory states:
vkMapMemory does not check whether the device memory is currently in use before returning the host-accessible pointer. The application must guarantee that any previously submitted command that writes to this range has completed before the host reads from or writes to that range, and that any previously submitted command that reads from that range has completed before the host writes to that region
It links to this site which sadly doesn't seem to exist yet. I am wondering how I would synchronize this?
Basically I need to worry about two things
Only 1 thread accesses the same range at the same time
The Gpu isn't currently trying to read the range
The only real way that I see to synchronize this is with a thread safe list. Every time you want to write/read to/from that buffer you have to add the memory range that you are currently trying to read or write into that thread safe list.
That means when when you want to access that buffer you need to lock that list and search for the range that you are trying to access.
Is that how you would synchronize vkMapMemory or are there other ways to do this?
The only time that the gpu will try to access the mapped memory is when a command buffer accessing that memory has been submitted. That memory will be in use until the associated vkFence has been signaled.
A fully general solution would be to track every memory access by the gpu and surround each CPU mapped memory access with a begin/end pair that will wait on the appropriate fences and call flush/invalidate as needed. This is a lot of state tracking and a plethora of potentially blocking calls.
However for persistent mesh/texture data you will only need to write to memory to a staging buffer and then copy to a device-local non-host-visible buffer. You shouldn't need this often so a single fence to track whether a copy from it is in flight is enough. Or for data that needs to survive only for a single frame (per object transforms) you can use a ring buffer. Readback of GPU occlusion test or compute results, you can use a ring buffer.
I hope you can see the pattern emerging. Use just a few mapped ring-buffers and be very conscious about when they are used by the gpu and then you just need to keep a small array of vkFence+offset+size per ring buffer to ensure no data hazard occurs.

Is it nesessary to synchronize access to shared memory if one process is read only?

Having two processes one of which reads and writes data to the shared memory segment and second one only reads(and it's not critical if data changes during reading). Is there any danger if not to use synchronization between them?

C++/Linux: Using c++11 atomic to avoid partial read on dual-mapped mmap region

I have a program which has two threads. One thread (Writer Thread) writes to a file while the other consuming (Reader Thread) the data from the first. In the program, the same region of the file is mapped twice: one with read & write permission for Writer Thread, another just with read permission for Reader Thread. (The two mapped regions have different pointer/virtual address from mmap as expected). I attempt use a C++11 atomic to control the memory order.
Here is what I have in my mind:
Writer Thread:
Create the data content (fixed size) in the memory mapped region with write permission.
Update the atomic variable with release memory order.
Reader Thread:
Continuously poll on the atomic variable with acquire memory order till there is/are new messages.
If there is an outstanding message, read the data from the read only memory mapped region.
Questions
Even though the read-only mmap region and writable mmap region are referring the same file region, they have different virtual memory addresses. Could the atomic variable protect partial read here? (i.e. if the reader thread saw the atomic variable is updated with acquire semantics, will the read only memory region just have partial message or the message is not yet visible at all?) (It seems to me that if the two virtual memory are mapped to the same physical memory page(s), it should work.)
What if Reader Thread using read system call instead of read-only mmap region? Could the atomic memory variable avoid partial read?
I have written a test program that seems to work. However, I would like to be advised by a more experienced programmer/Linux expert whether it should work. Thanks!
Using different virtual memory ranges does not change things here. For proof note that atomic operations work just fine between two processes using the same shared memory. Each may have it mapped at a different virtual address.
What is important is that it references the same piece of physical memory.
The read() system call does not do anything to lock memory or access it atomically. It is simply a memcpy() done in the kernel from the file cache to user space. If another CPU is writing into that memory it can get a partial read.
Your scenario is perfectly valid and safe. The release in the writer thread and acquire on the reader thread guarantee a happens before relation. To quote the stadard :
29.3.2 2 An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A
As a side note your bottleneck would most probably be the file operations so atomics instead of mutex wouldn't probably have an observable effect on performance.
It does not seem like you need atomics here at all. What you need is a volatile variable. Atomicity would be ensured by OS itself, since the memory is backed by a file.
EDIT.
I see a bunch of people downvoted this answer, and in absence of meaningful comments I will assume people didn't understand it, and just reacted to suggested usage of volatile in context of multithreaded application. I will try to explain my position.
Reads and writes to memory backed files are atomic as long as corresponding read() and write() system calls are atomic, and those are atomic as long as memory buffer size does not exceed PIPE_BUF. (4K on Linux if memory serves). They also guarantee ordering. So, as long as you are memcpying chunks of less than 4K, you are good as long as the call has made it through compiler optimizations.
Volatile is needed exactly for this - to prevent compiler from optimizing reads and writes to this memory, and used exactly as prescribed.
On a side note, with exactly the same design on AIX, we've seen a huge performance degradation with compared to slighly modified design, when the writer uses write() to update the memory mapped file directly. Not sure if it is AIX quirk - but if performance is important, you might want to do some benchmarking.

Reducing seek times when reading many small files

I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.
Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.
A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.
could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.
Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.
A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().

Transferring data between process calls

I have a Linux process that is being called numerous times, and I need to make this process as fast as possible.
The problem is that I must maintain a state between calls (load data from previous call and store it for the next one), without running another process / daemon.
Can you suggest fast ways to do so? I know I can use files for I/O, and would like to avoid it, for obvious performance reasons. Should (can?) I create a named pipe to read/write from and by that avoid real disk I/O?
Pipes aren't appropriate for this. Use posix shared memory or a posix message queue if you are absolutely sure files are too slow - which you should test first.
In the shared memory case your program creates the segment with shm_open() if it doesn't exist or opens it if it does. You mmap() the memory and make whatever changes and exit. You only shm_unlink() when you know your program won't be called anymore and no longer needs the shared memory.
With message queues, just set up the queue. Your program reads the queue, makes whatever changes, writes the queue and exits. Mq_unlink() when you no longer need the queue.
Both methods have kernel persistence so you lose the shared memory and the queue on a reboot.
It sounds like you have a process that is continuously executed by something.
Why not create a factory that spawns the worker threads?
The factory could provide the workers with any information needed.
... I can use files for I/O, and would like to avoid it, for obvious performance reasons.
I wonder what are these reasons please...
Linux caches files in kernel memory in the page cache. Writes go to the page cash first, in other words, a write() syscall is a kernel call that only copies the data from the user space to the page cache (it is a bit more complicated when the system is under stress). Some time later pdflush writes data to disk asynchronously.
File read() first checks the page cache to see if the data is already available in memory to avoid a disk read. What it means is that if one program writes data to files and another program reads it, these two programs are effectively communicating via kernel memory as long as the page cache keeps those files.
If you want to avoid disk writes entirely, that is, the state does not need to be persisted across OS reboots, those files can be put in /dev/shm or in /tmp, which are normally the mount points of in-memory filesystems.

Resources