Linux initrd optimization - linux

I am investigating Linux initrd mechanism. I learned the following code:
bool __init initrd_load(void)
{
if (mount_initrd) {
create_dev("/dev/ram", Root_RAM0);
/*
* Load the initrd data into /dev/ram0. Execute it as initrd
* unless /dev/ram0 is supposed to be our actual root device,
* in that case the ram disk is just set up here, and gets
* mounted in the normal path.
*/
if (rd_load_image("/initrd.image") && ROOT_DEV != Root_RAM0) {
init_unlink("/initrd.image");
handle_initrd();
return true;
}
}
init_unlink("/initrd.image");
return false;
}
int __init rd_load_image(char *from)
{
// ...
out_file = filp_open("/dev/ram", O_RDWR, 0);
in_file = filp_open(from, O_RDONLY, 0);
// ...
for (i = 0; i < nblocks; i++) {
// ...
kernel_read(in_file, buf, BLOCK_SIZE, &in_pos);
kernel_write(out_file, buf, BLOCK_SIZE, &out_pos);
// ...
}
// ...
}
Now I know ramdisk content read from device "/initrd.image" to device "/dev/ram" device (a RAM simulated disk?).
Here are my questions:
Where is the implementation of file_operations for device "/dev/ram" and "/initrd.image"?
How is the device "/dev/ram" used later? I didn't find anywhere else "/dev/ram" is used.
From the above logic, file content is first READ from "/initrd.image", and the WRITE to "/dev/ram". That means there are 2 memory copies. I am wondering if it is possible to exclude one of the memory copy so as to improve the boot performance?
Thanks in advance for any reply!

/dev/ram has a block_device_operations struct in brd.c https://elixir.bootlin.com/linux/v5.4.210/source/drivers/block/brd.c#L327 while initrd.image depends on the filesystem that contains it.
/dev/ram consists of blocks of memory starting at rd_image_start
How do you know whether the /dev/ram area is an acceptable DMA target? How do you do decompress?

Related

How to mmap() a large file without risking the OOM killer?

I've got an embedded ARM Linux box with a limited amount of RAM (512MB) and no swap space, on which I need to create and then manipulate a fairly large file (~200MB). Loading the entire file into RAM, modifying the contents in-RAM, and then writing it back out again would sometimes invoke the OOM-killer, which I want to avoid.
My idea to get around this was to use mmap() to map this file into my process's virtual address space; that way, reads and writes to the mapped memory-area would go out to the local flash-filesystem instead, and the OOM-killer would be avoided since if memory got low, Linux could just flush some of the mmap()'d memory pages back to disk to free up some RAM. (That might make my program slow, but slow is okay for this use-case)
However, even with the mmap() call, I'm still occasionally seeing processes get killed by the OOM-killer while performing the above operation.
My question is, was I too optimistic about how Linux would behave in the presence of both a large mmap() and limited RAM? (i.e. does mmap()-ing a 200MB file and then reading/writing to the mmap()'d memory still require 200MB of available RAM to accomplish reliably?) Or should mmap() be clever enough to page out mmap'd pages when memory is low, but I'm doing something wrong in how I use it?
FWIW my code to do the mapping is here:
void FixedSizeDataBuffer :: TryMapToFile(const std::string & filePath, bool createIfNotPresent, bool autoDelete)
{
const int fd = open(filePath.c_str(), (createIfNotPresent?(O_CREAT|O_EXCL|O_RDWR):O_RDONLY)|O_CLOEXEC, S_IRUSR|(createIfNotPresent?S_IWUSR:0));
if (fd >= 0)
{
if ((autoDelete == false)||(unlink(filePath.c_str()) == 0)) // so the file will automatically go away when we're done with it, even if we crash
{
const int fallocRet = createIfNotPresent ? posix_fallocate(fd, 0, _numBytes) : 0;
if (fallocRet == 0)
{
void * mappedArea = mmap(NULL, _numBytes, PROT_READ|(createIfNotPresent?PROT_WRITE:0), MAP_SHARED, fd, 0);
if (mappedArea)
{
printf("FixedSizeDataBuffer %p: Using backing-store file [%s] for %zu bytes of data\n", this, filePath.c_str(), _numBytes);
_buffer = (uint8_t *) mappedArea;
_isMappedToFile = true;
}
else printf("FixedSizeDataBuffer %p: Unable to mmap backing-store file [%s] to %zu bytes (%s)\n", this, filePath.c_str(), _numBytes, strerror(errno));
}
else printf("FixedSizeDataBuffer %p: Unable to pad backing-store file [%s] out to %zu bytes (%s)\n", this, filePath.c_str(), _numBytes, strerror(fallocRet));
}
else printf("FixedSizeDataBuffer %p: Unable to unlink backing-store file [%s] (%s)\n", this, filePath.c_str(), strerror(errno));
close(fd); // no need to hold this anymore AFAIK, the memory-mapping itself will keep the backing store around
}
else printf("FixedSizeDataBuffer %p: Unable to create backing-store file [%s] (%s)\n", this, filePath.c_str(), strerror(errno));
}
I can rewrite this code to just use plain-old-file-I/O if I have to, but it would be nice if mmap() could do the job (or if not, I'd at least like to understand why not).
After much further experimentation, I determined that the OOM-killer was visiting me not because the system had run out of RAM, but because RAM would occasionally become sufficiently fragmented that the kernel couldn't find a set of physically-contiguous RAM pages large enough to meet its immediate needs. When this happened, the kernel would invoke the OOM-killer to free up some RAM to avoid a kernel panic, which is all well and good for the kernel but not so great when it kills a process that the user was relying on to get his work done. :/
After trying and failing to find a way to convince Linux not to do that (I think enabling a swap partition would avoid the OOM-killer, but doing that is not an option for me on these particular machines), I came up with a hack work-around; I added some code to my program that periodically checks the amount of memory fragmentation reported by the Linux kernel, and if the memory fragmentation starts looking too severe, preemptively orders a memory-defragmentation to occur, so that the OOM-killer will (hopefully) not become necessary. If the memory-defragmentation pass doesn't appear to be improving matters any, then after 20 consecutive attempts, we also drop the VM Page cache as a way to free up contiguous physical RAM. This is all very ugly, but not as ugly as getting a phone call at 3AM from a user who wants to know why their server program just crashed. :/
The gist of the work-around implementation is below; note that DefragTick(Milliseconds) is expected to be called periodically (preferably once per second).
// Returns how safe we are from the fragmentation-based-OOM-killer visits.
// Returns -1 if we can't read the data for some reason.
static int GetFragmentationSafetyLevel()
{
int ret = -1;
FILE * fpIn = fopen("/sys/kernel/debug/extfrag/extfrag_index", "r");
if (fpIn)
{
char buf[512];
while(fgets(buf, sizeof(buf), fpIn))
{
const char * dma = (strncmp(buf, "Node 0, zone", 12) == 0) ? strstr(buf+12, "DMA") : NULL;
if (dma)
{
// dma= e.g.: "DMA -1.000 -1.000 -1.000 -1.000 0.852 0.926 0.963 0.982 0.991 0.996 0.998 0.999 1.000 1.000"
const char * s = dma+4; // skip past "DMA ";
ret = 0; // ret now becomes a count of "safe values in a row"; a safe value is any number less than 0.500, per me
while((s)&&((*s == '-')||(*s == '.')||(isdigit(*s))))
{
const float fVal = atof(s);
if (fVal < 0.500f)
{
ret++;
// Advance (s) to the next number in the list
const char * space = strchr(s, ' '); // to the next space
s = space ? (space+1) : NULL;
}
else break; // oops, a dangerous value! Run away!
}
}
}
fclose(fpIn);
}
return ret;
}
// should be called periodically (e.g. once per second)
void DefragTick(Milliseconds current_time_in_milliseconds)
{
if ((current_time_in_milliseconds-m_last_fragmentation_check_time) >= Milliseconds(1000))
{
m_last_fragmentation_check_time = current_time_in_milliseconds;
const int fragmentationSafetyLevel = GetFragmentationSafetyLevel();
if (fragmentationSafetyLevel < 9)
{
m_defrag_pending = true; // trouble seems to start at level 8
m_fragged_count++; // note that we still seem fragmented
}
else m_fragged_count = 0; // we're in the clear!
if ((m_defrag_pending)&&((current_time_in_milliseconds-m_last_defrag_time) >= Milliseconds(5000)))
{
if (m_fragged_count >= 20)
{
// FogBugz #17882
FILE * fpOut = fopen("/proc/sys/vm/drop_caches", "w");
if (fpOut)
{
const char * warningText = "Persistent Memory fragmentation detected -- dropping filesystem PageCache to improve defragmentation.";
printf("%s (fragged count is %i)\n", warningText, m_fragged_count);
fprintf(fpOut, "3");
fclose(fpOut);
m_fragged_count = 0;
}
else
{
const char * errorText = "Couldn't open /proc/sys/vm/drop_caches to drop filesystem PageCache!";
printf("%s\n", errorText);
}
}
FILE * fpOut = fopen("/proc/sys/vm/compact_memory", "w");
if (fpOut)
{
const char * warningText = "Memory fragmentation detected -- ordering a defragmentation to avoid the OOM-killer.";
printf("%s (fragged count is %i)\n", warningText, m_fragged_count);
fprintf(fpOut, "1");
fclose(fpOut);
m_defrag_pending = false;
m_last_defrag_time = current_time_in_milliseconds;
}
else
{
const char * errorText = "Couldn't open /proc/sys/vm/compact_memory to trigger a memory-defragmentation!";
printf("%s\n", errorText);
}
}
}
}

Windows Filtering Platform Network Slowdown Due to Spinlock

I am writing a Windows Filtering Platform Kernel Mode Driver, the goal of the driver is to capture all traffic on a particular layer, and communicate this traffic back down to user-mode so that it can be further analyses. The driver never needs to block any traffic, the classifyOut is always set to FWP_ACTION_CONTINUE.
The following code is used in my Classify function to queue up the packets that are received.
classifyOut->actionType = FWP_ACTION_CONTINUE;
do
{
if ((classifyOut->rights & FWPS_RIGHT_ACTION_WRITE) == 0)
{
break;
}
if (layerData != NULL)
{
PNET_BUFFER_LIST netBufferList = (PNET_BUFFER_LIST) layerData;
PNET_BUFFER netBuffer = NET_BUFFER_LIST_FIRST_NB(netBufferList);
if (packetQueueSize >= 2048)
{
ExInterlockedRemoveHeadList(&packetQueue, &packetQueueLock);
packetQueueSize--;
}
ULONG netBufferSize = NET_BUFFER_DATA_LENGTH(netBuffer);
PACKET_ITEM* allocatedPacket = InitalizePacketItem(
netBuffer,
netBufferSize
);
if (allocatedPacket == NULL)
{
classifyOut->actionType = FWP_ACTION_BLOCK;
classifyOut->rights &= ~FWPS_RIGHT_ACTION_WRITE;
break;
}
ExInterlockedInsertTailList(
&packetQueue,
&allocatedPacket->listEntry,
&packetQueueLock
);
allocatedPacket = NULL;
packetQueueSize++;
}
} while (FALSE);
The PACKET_ITEM struct is defined as the following
typedef struct _PACKET_ITEM {
LIST_ENTRY listEntry;
PVOID data;
ULONG dataLen;
} PACKET_ITEM;
I am using the inverted call model to communicate this packet data from kernel mode to user mode. The following code is used in the kernel driver once it detects the correct IOCTL has been sent.
status = WdfRequestRetrieveOutputBuffer(request, 0, &buffer, &bufferSize);
if (!NT_SUCCESS(status))
{
break;
}
PLIST_ENTRY listEntry = ExInterlockedRemoveHeadList(&packetQueue, &packetQueueLock);
if (listEntry == NULL)
{
break;
}
PACKET_ITEM* packetItem = CONTAINING_RECORD(
listEntry,
struct _PACKET_ITEM,
listEntry
);
RtlCopyMemory(
buffer,
packetItem->data,
packetItem->dataLen);
status = STATUS_SUCCESS;
WdfRequestCompleteWithInformation(
request,
status,
packetItem->dataLen
);
FreePacketItem(packetItem);
This code seems to slow the network down greatly after a short while, causing timeouts when trying to load websites in a web browser, for example.
I assume this is being caused by the spinlocks and the sheer volume of packets being transferred across the network that are being captured by this driver.
My questions are the following
Is it likely the spinlock is definitely causing my problems here? If so
Is it possible to set the classifyOut->actionType immediately and return this value before allocating any memory to copy the data into my queue. I assume this would prevent the slow down from happening?
What else should I be doing differently to prevent this?
If not,
What is causing the issue with the slow down here?

How to ensure two different applications running on the same machine attach to the same starting address with shmat

I am working on two completely separate applications that will need to use System V shared memory as a means of IPC. After reading The Linux man page, it seems like I will have to provide both applications with an address hint in order to guarantee that they point to the exact same memory location. I will be able to (almost) guarantee that they both have the same shmid, as described below. So I was wondering, 1. If NULL is passed as the second param and 0 as the third, will I be able to be 100% certain that the system will point both applications to the same starting location in memory if given the same shmid? And 2. If not, is there is a way to, at runtime, figure out what addresses the system is using for shared memory to make sure both applications use an address hint that won't cause the shmat to fail?
Example of code being used:
typedef struct
{
uint8_t dataBuffer[SHARED_MEM_BUFFER_SIZE]; //8 byte char array
} SharedData;
typedef struct
{
int32_t dataIndex;
SharedData data;
} SharedDataStructure;
bool initialize()
{
//Parse JSON file for key gen file path and char
auto keyGenFilePath = ...//Parsed file path
auto keyGenChar = ...//Parsed char
//Both applications will be reading the exact same json file, to ensure
//they both receive the same key.
key_t sharedMemKey = ftok(keyGenFilePath.c_str(), keyGenChar[0]);
if (sharedMemKey == -1)
{
//Log error
return false;
}
//m_shMemId is an int, m_params is a std::vector<SharedDataStructure>
m_shMemId = shmget(sharedMemKey, m_params.size() * sizeof(SharedData), IPC_CREAT | 0666);
if (m_shMemId == -1)
{
//Log error
return false;
}
//m_attachedSharedMem is a SharedData pointer
m_attachedSharedMem = (SharedData *)shmat(m_shMemId, NULL, 0);
if (m_attachedSharedMem == (void *)-1)
{
//Log error
return false;
}
//Zero out shared memory
return true;
}
Also, please note that both applications will initialize their shared memory this way (Only one will zero the memory). This is also going on a very barebones system, so these two applications WILL be the only two applications using shared memory outside of the OS. Also, using POSIX shared memory is not an option, not because of system limitations, but due to other factors.
I apologize for not being able to provide a copy-paste compilable example, the application(s) need to be highly configurable to avoid having the change code in the future.

Linux kernel module to read out GPS device via USB

I'm writing a Linux kernel module to read out a GPS device (a u-blox NEO-7) via USB by using the book Linux Device Drivers.
I already can probe and read out data from the device successfully. But, there is a problem when reading the device with multiple applications simultaneously (I used "cat /dev/ublox" to read indefinitely). When the active/reading applications is cancelled via "Ctrl + C", the next reading attempt from the other application fails (exactly method call usb_submit_urb(...) returns -EINVAL).
I use following ideas for my implementation:
The kernel module methods should be re-entrant. Therefore, I use a mutex to protect critical sections. E.g. allowing only one reader simultaneously.
To safe ressources, I reuse the struct urb for different reading requests (see an explanation)
Device-specific data like USB endpoint address and so on is held in a device-specific struct called ublox_device.
After submitting the USB read request, the calling process is sent to sleep until the asynchronous complete handler is called.
I verified that the ideas are implemented correctly: I have run two instances of "cat /dev/ublox" simultaneously and I got the correct output (only one instance accessed the critical read section at a time). And also reusing the "struct urb" is working. Both instances read out data alternatively.
The problem only occurs if the currently active instance is cancelled via "Ctrl + C". I can solve the problem by not reusing the "struct urb" but I would like to avoid that. I.e. by allocating a new "struct urb" for each read request via usb_alloc_urb(...) (usually it is allocated once when probing the USB device).
My code follows the USB skeleton driver from Greg Kroah-Hartman who also reuse the "struct urb" for different reading requests.
Maybe someone has a clue what's going wrong here.
The complete code can be found on pastebin. Here is a small excerpt of the read method and the USB request complete handler.
static ssize_t ublox_read(struct file *file, char *buffer, size_t count, loff_t *pos)
{
struct ublox_device *ublox_device = file->private_data;
...
return_value = mutex_lock_interruptible(&ublox_device->bulk_in_mutex);
if (return_value < 0)
return -EINTR;
...
retry:
usb_fill_bulk_urb(...);
ublox_device->read_in_progress = 1;
/* Next call fails if active application is cancelled via "Ctrl + C" */
return_value = usb_submit_urb(ublox_device->bulk_in_urb, GFP_KERNEL);
if (return_value) {
printk(KERN_ERR "usb_submit_urb(...) failed!\n");
ublox_device->read_in_progress = 0;
goto exit;
}
/* Go to sleep until read operation has finished */
return_value = wait_event_interruptible(ublox_device->bulk_in_wait_queue, (!ublox_device->read_in_progress));
if (return_value < 0)
goto exit;
...
exit:
mutex_unlock(&ublox_device->bulk_in_mutex);
return return_value;
}
static void ublox_read_bulk_callback(struct urb *urb)
{
struct ublox_device *ublox_device = urb->context;
int status = urb->status;
/* Evaluate status... */
...
ublox_device->transferred_bytes = urb->actual_length;
ublox_device->read_in_progress = 0;
wake_up_interruptible(&ublox_device->bulk_in_wait_queue);
}
Now, I allocate a new struct urb for each read request. This avoids the problem with the messed up struct urb after an active read request is cancelled by the calling application. The allocated struct is freed in the complete handler.
I will come back to LKML when I optimize my code. For now, it is okay to allocate a new struct urb for each single read request. The complete code of the kernel module is on pastebin.
static ssize_t ublox_read(struct file *file, char *buffer, size_t count, loff_t *pos)
{
struct ublox_device *ublox_device = file->private_data;
...
retry:
ublox_device->bulk_in_urb = usb_alloc_urb(0, GFP_KERNEL);
...
usb_fill_bulk_urb(...);
...
return_value = usb_submit_urb(ublox_device->bulk_in_urb, GFP_KERNEL);
...
}
static void ublox_read_bulk_callback(struct urb *urb)
{
struct ublox_device *ublox_device = urb->context;
...
usb_free_urb(ublox_device->bulk_in_urb);
...
}

Kernel driver, ioremap necessary in MMU-less system?

So, I am a total newbie when it comes to kernel drivers and have a question regarding ioremap function.
I am writing a driver for accessing some registers defined in a custom VHDL-module on a SoC with a ARM Cortex-M3 and FPGA fabric.
Looking at examples I figured I should use ioremap, but since the Cortex-M3 does not have a MMU, I don't really see the point, as per the following example:
/* Physical addresses */
static u32* rcu_trig_recv_physaddr = ((u32 *) 0x50040000);
static int rcu_trig_recv_size = 0x10; // size of 16 for testing
/* Virtual addresses */
static u32* rcu_trig_recv_virtbase = NULL;
/*removed code not relevant for the question*/
static int __init rcumodule_init(void)
{
int iResult = 0; // holding result of operations
u32 buffer;
// Register the driver
iResult = register_chrdev(rcuc_majorID, "rcuc", &rcuc_fops);
if (iResult < 0) {
printk(KERN_INFO "module init: can't register driver\n");
}
else{
printk(KERN_INFO "module init: success!\n");
}
// Map physical address to virtual address
if(rcu_trig_recv_size){
rcu_trig_recv_virtbase = (u32*) ioremap_nocache( (u32 *)rcu_trig_recv_physaddr, rcu_trig_recv_size );
printk("Remapped TRGRECV from 0x%p to 0x%p\n", rcu_trig_recv_physaddr, rcu_trig_recv_virtbase);
}
// try to read some stuff, expecting 0x17240f09
buffer = readl(rcu_trig_recv_virtbase);
printk("read %lx, at 0x%p\n", buffer, rcu_trig_recv_virtbase);
return iResult;
}
This then return, when I insmod the driver:
# insmod trigger.ko
module init: success!
Remapped TRGRECV from 0x50040000 to 0x50040000
read 17240f09, at 0x50040000
According to this, I would just be better off reading the physical address instead. Or is that a bad idea and I should be messing with my registers in a better way?
It's possible that you can get away with this if you know your code will never need to be used on another device, but you're much safer sticking with using ioremap(). Basing your code around obtaining and using the pointers provided by memory-mapped IO will make your code more portable and maintainable than utilizing hard-coded physical addresses.
Even if you don't plan on taking this code to a different device, using physical addresses could potentially break your code when simply upgrading to a newer chip in the same line.

Resources