How to mmap() a large file without risking the OOM killer?

How to mmap() a large file without risking the OOM killer? - linux

I've got an embedded ARM Linux box with a limited amount of RAM (512MB) and no swap space, on which I need to create and then manipulate a fairly large file (~200MB). Loading the entire file into RAM, modifying the contents in-RAM, and then writing it back out again would sometimes invoke the OOM-killer, which I want to avoid.
My idea to get around this was to use mmap() to map this file into my process's virtual address space; that way, reads and writes to the mapped memory-area would go out to the local flash-filesystem instead, and the OOM-killer would be avoided since if memory got low, Linux could just flush some of the mmap()'d memory pages back to disk to free up some RAM. (That might make my program slow, but slow is okay for this use-case)
However, even with the mmap() call, I'm still occasionally seeing processes get killed by the OOM-killer while performing the above operation.
My question is, was I too optimistic about how Linux would behave in the presence of both a large mmap() and limited RAM? (i.e. does mmap()-ing a 200MB file and then reading/writing to the mmap()'d memory still require 200MB of available RAM to accomplish reliably?) Or should mmap() be clever enough to page out mmap'd pages when memory is low, but I'm doing something wrong in how I use it?
FWIW my code to do the mapping is here:
void FixedSizeDataBuffer :: TryMapToFile(const std::string & filePath, bool createIfNotPresent, bool autoDelete)
{
const int fd = open(filePath.c_str(), (createIfNotPresent?(O_CREAT|O_EXCL|O_RDWR):O_RDONLY)|O_CLOEXEC, S_IRUSR|(createIfNotPresent?S_IWUSR:0));
if (fd >= 0)
{
if ((autoDelete == false)||(unlink(filePath.c_str()) == 0)) // so the file will automatically go away when we're done with it, even if we crash
{
const int fallocRet = createIfNotPresent ? posix_fallocate(fd, 0, _numBytes) : 0;
if (fallocRet == 0)
{
void * mappedArea = mmap(NULL, _numBytes, PROT_READ|(createIfNotPresent?PROT_WRITE:0), MAP_SHARED, fd, 0);
if (mappedArea)
{
printf("FixedSizeDataBuffer %p: Using backing-store file [%s] for %zu bytes of data\n", this, filePath.c_str(), _numBytes);
_buffer = (uint8_t *) mappedArea;
_isMappedToFile = true;
}
else printf("FixedSizeDataBuffer %p: Unable to mmap backing-store file [%s] to %zu bytes (%s)\n", this, filePath.c_str(), _numBytes, strerror(errno));
}
else printf("FixedSizeDataBuffer %p: Unable to pad backing-store file [%s] out to %zu bytes (%s)\n", this, filePath.c_str(), _numBytes, strerror(fallocRet));
}
else printf("FixedSizeDataBuffer %p: Unable to unlink backing-store file [%s] (%s)\n", this, filePath.c_str(), strerror(errno));
close(fd); // no need to hold this anymore AFAIK, the memory-mapping itself will keep the backing store around
}
else printf("FixedSizeDataBuffer %p: Unable to create backing-store file [%s] (%s)\n", this, filePath.c_str(), strerror(errno));
}
I can rewrite this code to just use plain-old-file-I/O if I have to, but it would be nice if mmap() could do the job (or if not, I'd at least like to understand why not).

After much further experimentation, I determined that the OOM-killer was visiting me not because the system had run out of RAM, but because RAM would occasionally become sufficiently fragmented that the kernel couldn't find a set of physically-contiguous RAM pages large enough to meet its immediate needs. When this happened, the kernel would invoke the OOM-killer to free up some RAM to avoid a kernel panic, which is all well and good for the kernel but not so great when it kills a process that the user was relying on to get his work done. :/
After trying and failing to find a way to convince Linux not to do that (I think enabling a swap partition would avoid the OOM-killer, but doing that is not an option for me on these particular machines), I came up with a hack work-around; I added some code to my program that periodically checks the amount of memory fragmentation reported by the Linux kernel, and if the memory fragmentation starts looking too severe, preemptively orders a memory-defragmentation to occur, so that the OOM-killer will (hopefully) not become necessary. If the memory-defragmentation pass doesn't appear to be improving matters any, then after 20 consecutive attempts, we also drop the VM Page cache as a way to free up contiguous physical RAM. This is all very ugly, but not as ugly as getting a phone call at 3AM from a user who wants to know why their server program just crashed. :/
The gist of the work-around implementation is below; note that DefragTick(Milliseconds) is expected to be called periodically (preferably once per second).
// Returns how safe we are from the fragmentation-based-OOM-killer visits.
// Returns -1 if we can't read the data for some reason.
static int GetFragmentationSafetyLevel()
{
int ret = -1;
FILE * fpIn = fopen("/sys/kernel/debug/extfrag/extfrag_index", "r");
if (fpIn)
{
char buf[512];
while(fgets(buf, sizeof(buf), fpIn))
{
const char * dma = (strncmp(buf, "Node 0, zone", 12) == 0) ? strstr(buf+12, "DMA") : NULL;
if (dma)
{
// dma= e.g.: "DMA -1.000 -1.000 -1.000 -1.000 0.852 0.926 0.963 0.982 0.991 0.996 0.998 0.999 1.000 1.000"
const char * s = dma+4; // skip past "DMA ";
ret = 0; // ret now becomes a count of "safe values in a row"; a safe value is any number less than 0.500, per me
while((s)&&((*s == '-')||(*s == '.')||(isdigit(*s))))
{
const float fVal = atof(s);
if (fVal < 0.500f)
{
ret++;
// Advance (s) to the next number in the list
const char * space = strchr(s, ' '); // to the next space
s = space ? (space+1) : NULL;
}
else break; // oops, a dangerous value! Run away!
}
}
}
fclose(fpIn);
}
return ret;
}
// should be called periodically (e.g. once per second)
void DefragTick(Milliseconds current_time_in_milliseconds)
{
if ((current_time_in_milliseconds-m_last_fragmentation_check_time) >= Milliseconds(1000))
{
m_last_fragmentation_check_time = current_time_in_milliseconds;
const int fragmentationSafetyLevel = GetFragmentationSafetyLevel();
if (fragmentationSafetyLevel < 9)
{
m_defrag_pending = true; // trouble seems to start at level 8
m_fragged_count++; // note that we still seem fragmented
}
else m_fragged_count = 0; // we're in the clear!
if ((m_defrag_pending)&&((current_time_in_milliseconds-m_last_defrag_time) >= Milliseconds(5000)))
{
if (m_fragged_count >= 20)
{
// FogBugz #17882
FILE * fpOut = fopen("/proc/sys/vm/drop_caches", "w");
if (fpOut)
{
const char * warningText = "Persistent Memory fragmentation detected -- dropping filesystem PageCache to improve defragmentation.";
printf("%s (fragged count is %i)\n", warningText, m_fragged_count);
fprintf(fpOut, "3");
fclose(fpOut);
m_fragged_count = 0;
}
else
{
const char * errorText = "Couldn't open /proc/sys/vm/drop_caches to drop filesystem PageCache!";
printf("%s\n", errorText);
}
}
FILE * fpOut = fopen("/proc/sys/vm/compact_memory", "w");
if (fpOut)
{
const char * warningText = "Memory fragmentation detected -- ordering a defragmentation to avoid the OOM-killer.";
printf("%s (fragged count is %i)\n", warningText, m_fragged_count);
fprintf(fpOut, "1");
fclose(fpOut);
m_defrag_pending = false;
m_last_defrag_time = current_time_in_milliseconds;
}
else
{
const char * errorText = "Couldn't open /proc/sys/vm/compact_memory to trigger a memory-defragmentation!";
printf("%s\n", errorText);
}
}
}
}

Related

Linux initrd optimization

I am investigating Linux initrd mechanism. I learned the following code:
bool __init initrd_load(void)
{
if (mount_initrd) {
create_dev("/dev/ram", Root_RAM0);
/*
* Load the initrd data into /dev/ram0. Execute it as initrd
* unless /dev/ram0 is supposed to be our actual root device,
* in that case the ram disk is just set up here, and gets
* mounted in the normal path.
*/
if (rd_load_image("/initrd.image") && ROOT_DEV != Root_RAM0) {
init_unlink("/initrd.image");
handle_initrd();
return true;
}
}
init_unlink("/initrd.image");
return false;
}
int __init rd_load_image(char *from)
{
// ...
out_file = filp_open("/dev/ram", O_RDWR, 0);
in_file = filp_open(from, O_RDONLY, 0);
// ...
for (i = 0; i < nblocks; i++) {
// ...
kernel_read(in_file, buf, BLOCK_SIZE, &in_pos);
kernel_write(out_file, buf, BLOCK_SIZE, &out_pos);
// ...
}
// ...
}
Now I know ramdisk content read from device "/initrd.image" to device "/dev/ram" device (a RAM simulated disk?).
Here are my questions:
Where is the implementation of file_operations for device "/dev/ram" and "/initrd.image"?
How is the device "/dev/ram" used later? I didn't find anywhere else "/dev/ram" is used.
From the above logic, file content is first READ from "/initrd.image", and the WRITE to "/dev/ram". That means there are 2 memory copies. I am wondering if it is possible to exclude one of the memory copy so as to improve the boot performance?
Thanks in advance for any reply!

/dev/ram has a block_device_operations struct in brd.c https://elixir.bootlin.com/linux/v5.4.210/source/drivers/block/brd.c#L327 while initrd.image depends on the filesystem that contains it.
/dev/ram consists of blocks of memory starting at rd_image_start
How do you know whether the /dev/ram area is an acceptable DMA target? How do you do decompress?

What s the Windows exact equivalent of WaitOnAddress() on Linux?

Using shared memory with the shmget() system call, the aim of my C++ program, is to fetch a bid price from the Internet through a server written in Rust so that each times the value changes, I m performing a financial transaction.
Server pseudocode
Shared_struct.price = new_price
Client pseudocode
Infinite_loop_label:
Wait until memory address pointed by Shared_struct.price changes.
Launch_transaction(Shared_struct.price*1.13)
Goto Infinite_loop
Since launching a transaction involve paying transaction fees, I want to create a transaction only once per buy price change.
Using a semaphore or a futex, I can do the reverse, I m meaning waiting for a variable to reachs a specific value, but how to wait until a variable is no longer equal to current value?
Whereas on Windows I can do something like this on the address of the shared segment:
ULONG g_TargetValue; // global, accessible to all process
ULONG CapturedValue;
ULONG UndesiredValue;
UndesiredValue = 0;
CapturedValue = g_TargetValue;
while (CapturedValue == UndesiredValue) {
WaitOnAddress(&g_TargetValue, &UndesiredValue, sizeof(ULONG), INFINITE);
CapturedValue = g_TargetValue;
}
Is there a way to do this on Linux? Or a straight equivalent?

You can use futex. (I assumed "var" is in shm mem)
/* Client */
int prv;
while (1) {
int prv = var;
int ret = futex(&var, FUTEX_WAIT, prv, NULL, NULL, 0);
/* Spurious wake-up */
if (!ret && var == prv) continue;
doTransaction();
}
/* Server */
int prv = NOT_CACHED;
while(1) {
var = updateVar();
if (var != prv || prv = NOT_CACHED)
futex(&var, FUTEX_WAKE, 1, NULL, NULL, 0);
prv = var;
}
It requires the server side to call futex as well to notify client(s).
Note that the same holds true for WaitOnAddress.
According to MSDN:
Any thread within the same process that changes the value at the address on which threads are waiting should call WakeByAddressSingle to wake a single waiting thread or WakeByAddressAll to wake all waiting threads.
(Added)
More high level synchronization method for this problem is to use condition variable.
It is also implemented based on futex.
See link

Windows Filtering Platform Network Slowdown Due to Spinlock

I am writing a Windows Filtering Platform Kernel Mode Driver, the goal of the driver is to capture all traffic on a particular layer, and communicate this traffic back down to user-mode so that it can be further analyses. The driver never needs to block any traffic, the classifyOut is always set to FWP_ACTION_CONTINUE.
The following code is used in my Classify function to queue up the packets that are received.
classifyOut->actionType = FWP_ACTION_CONTINUE;
do
{
if ((classifyOut->rights & FWPS_RIGHT_ACTION_WRITE) == 0)
{
break;
}
if (layerData != NULL)
{
PNET_BUFFER_LIST netBufferList = (PNET_BUFFER_LIST) layerData;
PNET_BUFFER netBuffer = NET_BUFFER_LIST_FIRST_NB(netBufferList);
if (packetQueueSize >= 2048)
{
ExInterlockedRemoveHeadList(&packetQueue, &packetQueueLock);
packetQueueSize--;
}
ULONG netBufferSize = NET_BUFFER_DATA_LENGTH(netBuffer);
PACKET_ITEM* allocatedPacket = InitalizePacketItem(
netBuffer,
netBufferSize
);
if (allocatedPacket == NULL)
{
classifyOut->actionType = FWP_ACTION_BLOCK;
classifyOut->rights &= ~FWPS_RIGHT_ACTION_WRITE;
break;
}
ExInterlockedInsertTailList(
&packetQueue,
&allocatedPacket->listEntry,
&packetQueueLock
);
allocatedPacket = NULL;
packetQueueSize++;
}
} while (FALSE);
The PACKET_ITEM struct is defined as the following
typedef struct _PACKET_ITEM {
LIST_ENTRY listEntry;
PVOID data;
ULONG dataLen;
} PACKET_ITEM;
I am using the inverted call model to communicate this packet data from kernel mode to user mode. The following code is used in the kernel driver once it detects the correct IOCTL has been sent.
status = WdfRequestRetrieveOutputBuffer(request, 0, &buffer, &bufferSize);
if (!NT_SUCCESS(status))
{
break;
}
PLIST_ENTRY listEntry = ExInterlockedRemoveHeadList(&packetQueue, &packetQueueLock);
if (listEntry == NULL)
{
break;
}
PACKET_ITEM* packetItem = CONTAINING_RECORD(
listEntry,
struct _PACKET_ITEM,
listEntry
);
RtlCopyMemory(
buffer,
packetItem->data,
packetItem->dataLen);
status = STATUS_SUCCESS;
WdfRequestCompleteWithInformation(
request,
status,
packetItem->dataLen
);
FreePacketItem(packetItem);
This code seems to slow the network down greatly after a short while, causing timeouts when trying to load websites in a web browser, for example.
I assume this is being caused by the spinlocks and the sheer volume of packets being transferred across the network that are being captured by this driver.
My questions are the following
Is it likely the spinlock is definitely causing my problems here? If so
Is it possible to set the classifyOut->actionType immediately and return this value before allocating any memory to copy the data into my queue. I assume this would prevent the slow down from happening?
What else should I be doing differently to prevent this?
If not,
What is causing the issue with the slow down here?

How to ensure two different applications running on the same machine attach to the same starting address with shmat

I am working on two completely separate applications that will need to use System V shared memory as a means of IPC. After reading The Linux man page, it seems like I will have to provide both applications with an address hint in order to guarantee that they point to the exact same memory location. I will be able to (almost) guarantee that they both have the same shmid, as described below. So I was wondering, 1. If NULL is passed as the second param and 0 as the third, will I be able to be 100% certain that the system will point both applications to the same starting location in memory if given the same shmid? And 2. If not, is there is a way to, at runtime, figure out what addresses the system is using for shared memory to make sure both applications use an address hint that won't cause the shmat to fail?
Example of code being used:
typedef struct
{
uint8_t dataBuffer[SHARED_MEM_BUFFER_SIZE]; //8 byte char array
} SharedData;
typedef struct
{
int32_t dataIndex;
SharedData data;
} SharedDataStructure;
bool initialize()
{
//Parse JSON file for key gen file path and char
auto keyGenFilePath = ...//Parsed file path
auto keyGenChar = ...//Parsed char
//Both applications will be reading the exact same json file, to ensure
//they both receive the same key.
key_t sharedMemKey = ftok(keyGenFilePath.c_str(), keyGenChar[0]);
if (sharedMemKey == -1)
{
//Log error
return false;
}
//m_shMemId is an int, m_params is a std::vector<SharedDataStructure>
m_shMemId = shmget(sharedMemKey, m_params.size() * sizeof(SharedData), IPC_CREAT | 0666);
if (m_shMemId == -1)
{
//Log error
return false;
}
//m_attachedSharedMem is a SharedData pointer
m_attachedSharedMem = (SharedData *)shmat(m_shMemId, NULL, 0);
if (m_attachedSharedMem == (void *)-1)
{
//Log error
return false;
}
//Zero out shared memory
return true;
}
Also, please note that both applications will initialize their shared memory this way (Only one will zero the memory). This is also going on a very barebones system, so these two applications WILL be the only two applications using shared memory outside of the OS. Also, using POSIX shared memory is not an option, not because of system limitations, but due to other factors.
I apologize for not being able to provide a copy-paste compilable example, the application(s) need to be highly configurable to avoid having the change code in the future.

Embedded Linux poll() returns constantly

I have a particular problem. Poll keeps returning when I know there is nothing to read.
So the setup it as follows, I have 2 File Descriptors which form part of a fd set that poll watches. One is for a Pin high to low change (GPIO). The other is for a proxy input. The problem occurs with the Proxy Input.
The order of processing is: start main functions; it will then poll; write data to proxy; poll will break; accept the data; send the data over SPI; receiving slave device, signals that it wants to send ack, by Dropping GPIO low; poll() senses this drop and reacts;
Infinite POLLINs :(
IF I have no timeout on the Poll function, the program works perfectly. The moment I include a timeout on the Poll. The Poll returns continuously. Not sure what I am doing wrong here.
while(1)
{
memset((void*)fdset, 0, sizeof(fdset));
fdset[0].fd = gpio_fd;
fdset[0].events = POLLPRI; // POLLPRI - There is urgent data to read
fdset[1].fd = proxy_rx;
fdset[1].events = POLLIN; // POLLIN - There is data to read
rc = poll(fdset, nfds, 1000);//POLL_TIMEOUT);
if (rc < 0) // Error
{
printf("\npoll() failed/Interrupted!\n");
}
else if (rc == 0) // Timeout occurred
{
printf(" poll() timeout\n");
}
else
{
if (fdset[1].revents & POLLIN)
{
printf("fdset[1].revents & POLLIN\n");
if ((resultR =read(fdset[1].fd,command_buf,10))<0)
{
printf("Failed to read Data\n");
}
if (fdset[0].revents & POLLPRI)
//if( (gpio_fd != -1) && (FD_ISSET(gpio_fd, &err)))
{
lseek(fdset[0].fd, 0, SEEK_SET); // Read from the start of the file
len = read(fdset[0].fd, reader, 64);
}
}
}
}
So that is the gist of my code.
I have also used GDB and while debugging, I found that the GPIO descriptor was set with revents = 0x10, which means that an error occurred and that POLLPRI also occurred.
In this question, something similar was addressed. But I do read all the time whenever I get POLLIN. It is a bit amazing, that this problem only occurs when I include the timeout, if I replace the poll timeout with -1, it works perfectly.

When poll fails (returning -1) you should do something with errno, perhaps thru perror; and your nfds (the second argument to poll) is not set, but it should be the constant 2.
Probably the GCC compiler would have given a warning, at least with all warnings enabled (-Wall), about nfds not being set.
(I'm guessing that nfds being uninitialized might be some "random" large value.... So the kernel might be polling other "random" file descriptors, those in your fdset after index 2...)
BTW, you could strace your program. And using the fdset name is a bit confusing (it could refer to select(2)).

Assuming I fixed your formatting properly in your question, it looks like you have a missing } after the POLLIN block and the next if() that checks the POLLPRI. It would possibly work better this way:
if (fdset[1].revents & POLLIN)
{
printf("fdset[1].revents & POLLIN\n");
if ((resultR =read(fdset[1].fd,command_buf,10))<0)
{
printf("Failed to read Data\n");
}
}
if (fdset[0].revents & POLLPRI)
//if( (gpio_fd != -1) && (FD_ISSET(gpio_fd, &err)))
{
lseek(fdset[0].fd, 0, SEEK_SET); // Read from the start of the file
len = read(fdset[0].fd, reader, 64);
}
Although you can do whatever you want with indentation in C/C++/Java/JavaScript, not doing it right can bite you really hard. Hopefully, I'm wrong and your original code was correct.
Another one I often see: People not using the { ... } at all and end up writing code like:
if(expr) do_a; do_b;
and of course, do_b; will be executed all the time, whether expr is true or false... and although you could fix the above with a comma like so:
if(expr) do_a, do_b;
the only safe way to do it right is to use the brackets:
if(expr)
{
do_a;
do_b;
}
Always make sure your indentation is perfect and write small functions so you can see that it is indeed perfect.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to mmap() a large file without risking the OOM killer? - linux

Related

Linux initrd optimization

What s the Windows exact equivalent of WaitOnAddress() on Linux?

Windows Filtering Platform Network Slowdown Due to Spinlock

How to ensure two different applications running on the same machine attach to the same starting address with shmat

Embedded Linux poll() returns constantly

Categories

Resources