Understanding /proc/sys/vm/lowmem_reserve_ratio - linux

I am not able to understand the meaning of the variable "lowmem_reserve_ratio" by reading the explanation from Documentation/sysctl/vm.txt.
I have also tried to google it but all the explanations found are exactly similar as present in vm.txt.
It will be really helpful if sb explains it or mention some link about it.
Here goes the original explanation:-
The lowmem_reserve_ratio is an array. You can see them by reading this file.
-
% cat /proc/sys/vm/lowmem_reserve_ratio
256 256 32
-
Note: # of this elements is one fewer than number of zones. Because the highest
zone's value is not necessary for following calculation.
But, these values are not used directly. The kernel calculates # of protection
pages for each zones from them. These are shown as array of protection pages
in /proc/zoneinfo like followings. (This is an example of x86-64 box).
Each zone has an array of protection pages like this.
-
Node 0, zone DMA
pages free 1355
min 3
low 3
high 4
:
:
numa_other 0
protection: (0, 2004, 2004, 2004)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pagesets
cpu: 0 pcp: 0
:
-
These protections are added to score to judge whether this zone should be used
for page allocation or should be reclaimed.
In this example, if normal pages (index=2) are required to this DMA zone and
watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
not be used because pages_free(1355) is smaller than watermark + protection[2]
(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
normal page requirement. If requirement is DMA zone(index=0), protection[0]
(=0) is used.
zone[i]'s protection[j] is calculated by following expression.
(i < j):
zone[i]->protection[j]
= (total sums of present_pages from zone[i+1] to zone[j] on the node)
/ lowmem_reserve_ratio[i];
(i = j):
(should not be protected. = 0;
(i > j):
(not necessary, but looks 0)
The default values of lowmem_reserve_ratio[i] are
256 (if zone[i] means DMA or DMA32 zone)
32 (others).
As above expression, they are reciprocal number of ratio.
256 means 1/256. # of protection pages becomes about "0.39%" of total present
pages of higher zones on the node.
If you would like to protect more pages, smaller values are effective.
The minimum value is 1 (1/1 -> 100%).

having the same problem as you, I googled (a lot) and stumbled apon this page which might (or might not) be more understandable than the kernel doc.
(I do not quote here because it will be unreadable)

I found the wording in that document really confusing too. Looking at the source in mm/page_alloc.c helped to clear it up, so let me try my hand at a more straightforward explanation:
As is said in the page you quoted, these numbers "are reciprocal number of ratio". Worded differently: these numbers are divisors. So when calculating the reserve pages for a given zone in a node, you take the sum of pages in that node in zones higher than that one, divide it by the provided divisor, and that's how many pages you're reserving for that zone.
Example: let's assume a 1 GiB node with 768 MiB in zone Normal and 256 MiB in zone HighMem (assume no zone DMA). Let's assume the default highmem reserve "ratio" (divisor) of 32. And let's assume the typical 4 KiB page size. Now we can calculate the reserve area for zone Normal:
Sum of "higher" zones than zone Normal (just HighMem): 256 MiB = (1024 KiB / 1 MiB) * (1 page / 4 KiB) = 65536 pages
Area reserved in zone Normal for this node: 65536 pages / 32 = 2048 pages = 8 MiB.
The concept stays the same when you add more zones and nodes. Just remember that the reserved size is in pages---you never reserve a fraction of a page.

I find the kernel source code that explain very well and clear.
/*
* setup_per_zone_lowmem_reserve - called whenever
* sysctl_lowmem_reserve_ratio changes. Ensures that each zone
* has a correct pages reserved value, so an adequate number of
* pages are left in the zone after a successful __alloc_pages().
*/
static void setup_per_zone_lowmem_reserve(void)
{
struct pglist_data *pgdat;
enum zone_type j, idx;
for_each_online_pgdat(pgdat) {
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long managed_pages = zone->managed_pages;
zone->lowmem_reserve[j] = 0;
idx = j;
while (idx) {
struct zone *lower_zone;
idx--;
if (sysctl_lowmem_reserve_ratio[idx] < 1)
sysctl_lowmem_reserve_ratio[idx] = 1;
lower_zone = pgdat->node_zones + idx;
lower_zone->lowmem_reserve[j] = managed_pages /
sysctl_lowmem_reserve_ratio[idx];
managed_pages += lower_zone->managed_pages;
}
}
}
/* update totalreserve_pages */
calculate_totalreserve_pages();
}
And here even list an demo.
/*
* results with 256, 32 in the lowmem_reserve sysctl:
* 1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
* 1G machine -> (16M dma, 784M normal, 224M high)
* NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA
* HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL
* HIGHMEM allocation will leave (224M+784M)/256 of ram reserved in ZONE_DMA
*
* TBD: should special case ZONE_DMA32 machines here - in those we normally
* don't need any ZONE_NORMAL reservation
*/
int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
#ifdef CONFIG_ZONE_DMA
256,
#endif
#ifdef CONFIG_ZONE_DMA32
256,
#endif
#ifdef CONFIG_HIGHMEM
32,
#endif
32,
};
In a word, the expression looks like,
zone[1]->lowmem_reserve[2] = zone[2]->managed_pages / sysctl_lowmem_reserve_ratio[1]
zone[0]->lowmem_reserve[2] = (zone[1] + zone[2])->managed_pages / sysctl_lowmem_reserve_ratio[0]

Related

How does Linux use values for PCIDs?

I'm trying to understand how Linux uses PCIDs (aka ASIDs) on Intel architecture. While I was investigating the Linux kernel's source code and patches I found such a define with the comment:
/*
* 6 because 6 should be plenty and struct tlb_state will fit in two cache
* lines.
*/
#define TLB_NR_DYN_ASIDS 6
Here is, I suppose, said that Linux uses only 6 PCID values, but what about this comment:
/*
* The x86 feature is called PCID (Process Context IDentifier). It is similar
* to what is traditionally called ASID on the RISC processors.
*
* We don't use the traditional ASID implementation, where each process/mm gets
* its own ASID and flush/restart when we run out of ASID space.
*
* Instead we have a small per-cpu array of ASIDs and cache the last few mm's
* that came by on this CPU, allowing cheaper switch_mm between processes on
* this CPU.
*
* We end up with different spaces for different things. To avoid confusion we
* use different names for each of them:
*
* ASID - [0, TLB_NR_DYN_ASIDS-1]
* the canonical identifier for an mm
*
* kPCID - [1, TLB_NR_DYN_ASIDS]
* the value we write into the PCID part of CR3; corresponds to the
* ASID+1, because PCID 0 is special.
*
* uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
* for KPTI each mm has two address spaces and thus needs two
* PCID values, but we can still do with a single ASID denomination
* for each mm. Corresponds to kPCID + 2048.
*
*/
As it is said in the previous comment, I suppose that Linux uses only 6 values for PCIDs, so in brackets we see just single values (not arrays). So ASID here can be only 0 and 5, kPCID can be only 1 and 6 and uPCID can only be 2049 and 2048 + 6 = 2054, right?
At this moment I have a few questions:
Why are there only 6 values for PCIDs? (Why is it plenty?)
Why will tlb_state structure fit in two cache lines if we choose 6 PCIDs?
Why does Linux use exactly these values for ASID, kPCID, and uPCID (I'm referring to the second comment)?
As it is said in the previous comment I suppose that Linux uses only 6 values for PCIDs so in brackets we see just single values (not arrays)
No, this is wrong, those are ranges. [0, TLB_NR_DYN_ASIDS-1] means from 0 to TLB_NR_DYN_ASIDS-1 inclusive. Keep reading for more details.
There are a few things to consider:
The difference between ASID (Address Space IDentifier) and PCID (Process-Context IDentifier) is just nomenclature: Linux calls this feature ASID across all architectures. Intel calls its implementation PCID. Linux ASIDs start at 0, Intel's PCIDs start at 1 because 0 is special and means "no PCID".
On x86 processors that support the feature, PCIDs are 12-bit values, so technically 4095 different PCIDs are possible (1 through 4095, as 0 is special).
Due to Kernel Page-Table Isolation Linux will nonetheless need two different PCIDs per task. The distinction between kPCID and uPCID is made for this reason, as each task effectively has two different virtual address spaces whose address translations need to be cached separately thus using different PCID. So we are down to 2047 usable pairs of PCIDs (plus the last single one that would just be unused).
Any normal system can easily exceed 2047 tasks on a single CPU, so no matter how many bits you use, you will never be able to have enough PCIDs for all existing tasks. On systems with a lot of CPUs you will also not have enough PCIDs for all active tasks.
Due to 4, you cannot implement PCID support as a simple assignment of a unique value for each existing/active task (e.g. like it is done for PIDs). Multiple tasks will need to "share" the same PCID sooner or later (not at the same time, but at different points in time). The logic to manage PCIDs will therefore need to be different.
The choice made by Linux developers was to use PCIDs as a way to optimize accesses to the most recently used mms (struct mm). This was implemented using a global per-CPU array (cpu_tlbstate.ctxs) that is linearly scanned on each mm-switch. Even small values of TLB_NR_DYN_ASIDS can easily trash performance instead of improving it. Apparently, 6 was a good number to choose as it provided a decent performance improvement. This means that only the 6 most-recently-used mms will use non-zero PCIDs (OK, technically the 6 most-recently-used user/kernel mm pairs).
You can see this reasoning explained more concisely in the commit message of the patch that implemented PCID support.
Why will tlb_state structure fit in two cache lines if we choose 6 PCIDs?
Well that's just simple math:
struct tlb_state {
struct mm_struct * loaded_mm; /* 0 8 */
union {
struct mm_struct * last_user_mm; /* 8 8 */
long unsigned int last_user_mm_spec; /* 8 8 */
}; /* 8 8 */
u16 loaded_mm_asid; /* 16 2 */
u16 next_asid; /* 18 2 */
bool invalidate_other; /* 20 1 */
/* XXX 1 byte hole, try to pack */
short unsigned int user_pcid_flush_mask; /* 22 2 */
long unsigned int cr4; /* 24 8 */
struct tlb_context ctxs[6]; /* 32 96 */
/* size: 128, cachelines: 2, members: 8 */
/* sum members: 127, holes: 1, sum holes: 1 */
};
(information extracted through pahole from a kernel image with debug symbols)
The array of struct tlb_context is used to keep track of ASIDs and it holds TLB_NR_DYN_ASIDS (6) entries.

NtQueryObject returns wrong insufficient required size via WOW64, why?

I am using the NT native API NtQueryObject()/ZwQueryObject() from user mode (and I am aware of the risks in general and I have written kernel mode drivers for Windows in the past in my professional capacity).
Generally when one uses the typical "query information" function (of which there are a few) the protocol is first to ask with a too small buffer to retrieve the required size with STATUS_INFO_LENGTH_MISMATCH, then allocate a buffer of said size and query again -- this time using the buffer and previously returned size.
In order to get the list of object types (67 on my build) on the system I am doing just that:
ULONG Size = 0;
NTSTATUS Status = NtQueryObject(NULL, ObjectTypesInformation, &Size, sizeof(Size), &Size);
And in Size I get 8280 (WOW64) and 8968 (x64). I then proceed to allocate the buffer with calloc() and query again:
ULONG Size2 = 0;
BYTE* Buf = (BYTE*)::calloc(1, Size);
Status = NtQueryObject(NULL, ObjectTypesInformation, Buf, Size, &Size2);
NB: ObjectTypesInformation is 3. It isn't declared in winternl.h, but Nebbett (as ObjectAllTypesInformation) and others describe it. Since I am not querying for a particular object's traits but the system-wide list of object types, I pass NULL for the object handle.
Curiously on WOW64, i.e. 32-bit, the value in Size2 upon return from the second query is 16 Bytes (= 8296) bigger than the previously returned required size.
As far as alignment is concerned, I'd expect at most 8 Bytes for this sort of thing and indeed neither 8280 nor 8296 are at a 16 Byte alignment boundary, but on an 8 Byte one.
Certainly I can add some slack space on top of the returned required size (e.g. ALIGN_UP to the next 32 Byte alignment boundary), but this seems highly irregular to be honest. And I'd rather want to understand what's going on than to implement a workaround that breaks, because I miss something crucial.
The practical issue for the code is that in Debug configurations it tells me there's a corrupted heap somewhere, upon freeing Buf. Which suggests that NtQueryObject() was indeed writing these extra 16 Bytes beyond the buffer I provided.
Question: Any idea why it is doing that?
As usual for NT native API the sources of information are scarce. The x64 version of the exact same code returns the exact number of bytes required. So my thinking here is that WOW64 is the issue. A somewhat cursory look into wow64.dll with IDA didn't reveal any immediate points for suspicion regarding what goes wrong in translating the results to 32-bit here.
PS: Windows 10 (10.0.19043, ntdll.dll "timestamp" 77755782)
PPS: this may be related: https://wj32.org/wp/2012/11/30/obquerytypeinfo-and-ntqueryobject-buffer-overrun-in-windows-8/ Tested it, by checking that OBJECT_TYPE_INFORMATION::TypeName.Length + sizeof(WCHAR) == OBJECT_TYPE_INFORMATION::TypeName.MaximumLength in all returned items, which was the case.
The only part of ObjectTypesInformation that's public is the first field defined in winternl.h header in the Windows SDK:
typedef struct __PUBLIC_OBJECT_TYPE_INFORMATION {
UNICODE_STRING TypeName;
ULONG Reserved [22]; // reserved for internal use
} PUBLIC_OBJECT_TYPE_INFORMATION, *PPUBLIC_OBJECT_TYPE_INFORMATION;
For x86 this is 96 bytes, and for x64 this is 104 bytes (assuming you have the right packing mode enabled). The difference is the pointer in UNICODE_STRING which changes the alignment in x64.
Any additional memory space should be related to the TypeName buffer.
UNICODE_STRING accounts for 8 bytes of the difference between 8280 and 8296. The function uses the sizeof(ULONG_PTR) for alignment of the returned string plus an extra WCHAR, so that could easily account for the remaining 8 bytes.
AFAIK: The public use of NtQueryObject is supposed to be limited to kernel-mode use which of course means it always matches the OS native bitness (x86 code can't run as kernel in x64 native OS), so it's probably just a quirk of using the NT functions via the WOW64 thunk.
Alright, I think I figured out the issue with the help of WinDbg and a thorough look at wow64.dll using IDA.
NB: the wow64.dll I have has the same build number, but differs slightly in data only (checksum, security directory entry, pieces from version resources). The code is identical, which was to be expected, given deterministic builds and how they affect the PE timestamp.
There's an internal function called whNtQueryObject_SpecialQueryCase (according to PDBs), which covers the ObjectTypesInformation class queries.
For the above wow64.dll I used the following points of interest in WinDbg, from a 32 bit program which calls NtQueryObject(NULL, ObjectTypesInformation, ...) (the program itself is irrelevant, though):
0:000> .load wow64exts
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B0E0
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B14E
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B1A7
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B24A
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B252
Explanation of the above points of interest:
+B0E0: computing length required for 64 bit query, based on passed length for 32 bit
+B14E: call to NtQueryObject()
+B1A7: loop body for copying 64 to 32 bit buffer contents, after successful NtQueryObject() call
+B24A: computing written length by subtracting current (last + 1) entry from base buffer address
+B252: downsizing returned (64 bit) required length to 32 bit
The logic of this function in regards to just ObjectTypesInformation is roughly as follows:
Common steps
Take the ObjectInformationLength (32 bit query!) argument and size it up to fit the 64 bit info
Align the retrieved size up to the next 16 byte boundary
If necessary allocate the resulting amount from some PEB::ProcessHeap and store in TLS slot 3; otherwise using this as a scratch space
Call NtQueryObject() passing the buffer and length from the two previous steps
The length passed to NtQueryObject() is the one from step 1, not the one aligned to a 16 byte boundary. There seems to be some sort of header to this scratch space, so perhaps that's where the 16 byte alignment comes from?
Case 1: buffer size too small (here: 4), just querying required length
The up-sized length in this case equals 4, which is too small and consequently NtQueryObject() returns STATUS_INFO_LENGTH_MISMATCH. Required size is reported as 8968.
Down-size from the 64 bit required length to 32 bit and end up 16 bytes too short
Return the status from NtQueryObject() and the down-sized required length form the previous step
Case 2: buffer size supposedly (!) sufficient
Copy OBJECT_TYPES_INFORMATION::NumberOfTypes from queried buffer to 32 bit one
Step to the first entry (OBJECT_TYPE_INFORMATION) of source (64 bit) and target (32 bit) buffer, 8 and 4 byte aligned respectively
For for each entry up to OBJECT_TYPES_INFORMATION::NumberOfTypes:
Copy UNICODE_STRING::Length and UNICODE_STRING::MaximumLength for TypeName member
memcpy() UNICODE_STRING::Length bytes from source to target UNICODE_STRING::Buffer (target entry + sizeof(OBJECT_TYPE_INFORMATION32)
Add terminating zero (WCHAR) past the memcpy'd string
Copy the individual members past the TypeName from 64 to 32 bit struct
Compute pointer of next entry by aligning UNICODE_STRING::MaximumLength up to an 8 byte boundary (i.e. the ULONG_PTR alignment mentioned in the other answer) + sizeof(OBJECT_TYPE_INFORMATION64) (already 8 byte aligned!)
The next target entry (32 bit) gets 4 byte aligned instead
At the end compute required (32 bit) length by subtracting the value we arrived at for the "next" entry (i.e. one past the last) from the base address of the buffer passed by the WOW64 program (32 bit) to NtQueryObject()
In my debugged scenario these were: 0x008ce050 - 0x008cbfe8 = 0x00002068 (= 8296), which is 16 bytes larger than the buffer length we were told during case 1 (8280)!
The issue
That crucial last step differs between merely querying and actually getting the buffer filled. There is no further bounds checking in that loop I described for case 2.
And this means it will just overrun the passed buffer and return a written length bigger than the buffer length passed to it.
Possible solutions and workarounds
I'll have to approach this mathematically after some sleep, the workaround is obviously to top up the required length returned from case 1 in order to avoid the buffer overrun. The easiest method is to use my up_size_from_32bit() from the example below and use that on the returned required size. This way you are allocating enough for the 64 bit buffer, while querying the 32 bit one. This should never overrun during the copy loop.
However, the fix in wow64.dll is a little more involved, I guess. While adding bounds checking to the loop would help avert the overrun, it would mean that the caller would have to query for the required size twice, because the first time around it lies to us.
Which means the query-only case (1) would have to allocate that internal buffer after querying the required length for 64 bit, then get it filled and then walk the entries (just like the copy loop), skipping over the last entry to compute the required length the same as it is now done after the copy loop.
Example program demonstrating the "static" computation by wow64.dll
Build for x64, just the way wow64.dll was!
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <cstdio>
typedef struct
{
ULONG JustPretending[24];
} OBJECT_TYPE_INFORMATION32;
typedef struct
{
ULONG JustPretending[26];
} OBJECT_TYPE_INFORMATION64;
constexpr ULONG size_delta_3264 = sizeof(OBJECT_TYPE_INFORMATION64) - sizeof(OBJECT_TYPE_INFORMATION32);
constexpr ULONG down_size_to_32bit(ULONG len)
{
return len - size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION64));
}
constexpr ULONG up_size_from_32bit(ULONG len)
{
return len + size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION32));
}
// Trying to mimic the wdm.h macro
constexpr size_t align_up_by(size_t address, size_t alignment)
{
return (address + (alignment - 1)) & ~(alignment - 1);
}
constexpr auto u32 = 8280UL;
constexpr auto u64 = 8968UL;
constexpr auto from_64 = down_size_to_32bit(u64);
constexpr auto from_32 = up_size_from_32bit(u32);
constexpr auto from_32_16_byte_aligned = (ULONG)align_up_by(from_32, 16);
int wmain()
{
wprintf(L"32 to 64 bit: %u -> %u -(16-byte-align)-> %u\n", u32, from_32, from_32_16_byte_aligned);
wprintf(L"64 to 32 bit: %u -> %u\n", u64, from_64);
return 0;
}
static_assert(sizeof(OBJECT_TYPE_INFORMATION32) == 96, "Size for 64 bit struct does not match.");
static_assert(sizeof(OBJECT_TYPE_INFORMATION64) == 104, "Size for 64 bit struct does not match.");
static_assert(u32 == from_64, "Must match (from 64 to 32 bit)");
static_assert(u64 == from_32, "Must match (from 32 to 64 bit)");
static_assert(from_32_16_byte_aligned % 16 == 0, "16 byte alignment failed");
static_assert(from_32_16_byte_aligned > from_32, "We're aligning up");
This does not mimic the computation that happens in case 2, though.

How to pass efficiently the hugepages-backed buffer to the BM DMA device in Linux?

I need to provide a huge circular buffer (a few GB) for the bus-mastering DMA PCIe device implemented in FPGA.
The buffers should not be reserved at the boot time. Therefore, the buffer may be not contiguous.
The device supports scatter-gather (SG) operation, but for performance reasons, the addresses and lengths of consecutive contiguous segments of the buffer are stored inside the FPGA.
Therefore, usage of standard 4KB pages is not acceptable (there would be up to 262144 segments for each 1GB of the buffer).
The right solution should allocate the buffer consisting of 2MB hugepages in the user space (reducing the maximum number of segments by factor of 512).
The virtual address of the buffer should be transferred to the kernel driver via ioctl. Then the addresses and the length of the segments should be calculated and written to the FPGA.
In theory, I could use get_user_pages to create the list of the pages, and then call sg_alloc_table_from_pages to obtain the SG list suitable to program the DMA engine in FPGA.
Unfortunately, in this approach I must prepare the intermediate list of page structures with length of 262144 pages per 1GB of the buffer. This list is stored in RAM, not in the FPGA, so it is less problematic, but anyway it would be good to avoid it.
In fact I don't need to keep the pages maped for the kernel, as the hugepages are protected against swapping out, and they are mapped for the user space application that will process the received data.
So what I'm looking for is a function sg_alloc_table_from_user_hugepages, that could take such a user-space address of the hugepages-based memory buffer, and transfer it directly into the right scatterlist, without performing unnecessary and memory-consuming mapping for the kernel.
Of course such a function should verify that the buffer indeed consists of hugepages.
I have found and read these posts: (A), (B), but couldn't find a good answer.
Is there any official method to do it in the current Linux kernel?
At the moment I have a very inefficient solution based on get_user_pages_fast:
int sgt_prepare(const char __user *buf, size_t count,
struct sg_table * sgt, struct page *** a_pages,
int * a_n_pages)
{
int res = 0;
int n_pages;
struct page ** pages = NULL;
const unsigned long offset = ((unsigned long)buf) & (PAGE_SIZE-1);
//Calculate number of pages
n_pages = (offset + count + PAGE_SIZE - 1) >> PAGE_SHIFT;
printk(KERN_ALERT "n_pages: %d",n_pages);
//Allocate the table for pages
pages = vzalloc(sizeof(* pages) * n_pages);
printk(KERN_ALERT "pages: %p",pages);
if(pages == NULL) {
res = -ENOMEM;
goto sglm_err1;
}
//Now pin the pages
res = get_user_pages_fast(((unsigned long)buf & PAGE_MASK), n_pages, 0, pages);
printk(KERN_ALERT "gupf: %d",res);
if(res < n_pages) {
int i;
for(i=0; i<res; i++)
put_page(pages[i]);
res = -ENOMEM;
goto sglm_err1;
}
//Now create the sg-list
res = sg_alloc_table_from_pages(sgt, pages, n_pages, offset, count, GFP_KERNEL);
printk(KERN_ALERT "satf: %d",res);
if(res < 0)
goto sglm_err2;
*a_pages = pages;
*a_n_pages = n_pages;
return res;
sglm_err2:
//Here we jump if we know that the pages are pinned
{
int i;
for(i=0; i<n_pages; i++)
put_page(pages[i]);
}
sglm_err1:
if(sgt) sg_free_table(sgt);
if(pages) kfree(pages);
* a_pages = NULL;
* a_n_pages = 0;
return res;
}
void sgt_destroy(struct sg_table * sgt, struct page ** pages, int n_pages)
{
int i;
//Free the sg list
if(sgt->sgl)
sg_free_table(sgt);
//Unpin pages
for(i=0; i < n_pages; i++) {
set_page_dirty(pages[i]);
put_page(pages[i]);
}
}
The sgt_prepare function builds the sg_table sgt structure that i can use to create the DMA mapping. I have verified that it contains the number of entries equal to the number of hugepages used.
Unfortunately, it requires that the list of the pages is created (allocated and returned via the a_pages pointer argument), and kept as long as the buffer is used.
Therefore, I really dislike that solution. Now I have 256 2MB hugepages used as a DMA buffer. It means that I have to create and keeep unnecessary 128*1024 page structures. I also waste 512 MB of kernel address space for unnecessary kernel mapping.
The interesting question is if the a_pages may be kept only temporarily (until the sg-list is created)? In theory it should be possible, as the pages are still locked...

Is there any limit in setting zram device disksize on Linux?

I'm trying to create a zram device on my target device. My target can not allocate memory if the zram disksize is above 100GB, but it's okay with the disksize of 50GB or less.
Is there any limit in setting zram device disksize on Linux? My target device only has 2GB of RAM memory.
I guess you can give a number up to UINT64_MAX - 4095 = 18446744073709547520 on a 64-bit platform.
https://github.com/torvalds/linux/blob/master/drivers/block/zram/zram_drv.h#L101
https://github.com/torvalds/linux/blob/master/drivers/block/zram/zram_drv.c#L1506
https://github.com/torvalds/linux/blob/master/drivers/block/zram/zram_drv.c#L901
So what we have:
... disksize_store(...) {
u64 disksize;
...
// ok, we can give at least UINT64_MAX here.
disksize = unsigned long long memparse(...);
// PAGE_ALIGN, PAGE_SIZE = 1<<12
disksize = PAGE_ALIGN(disksize)
= (((disksize)+((PAGE_SIZE)-1))&(~((typeof(disksize))(PAGE_SIZE)-1)))
= (disksize + ((1<<12)-1))&(~((1<<12)-1))
= (disksize + 4095) & 0xfffffffffffff000
// ^^^^^^^^^^^^^^^ this can overflow
// so max number is UINT64_MAX - 4095 so it doesn't overflow
// otherwise this macro will return 0
...
if (!zram_meta_alloc(..., disksize) {
...
return ...;
}
...
zram->disksize = disksize;
...
}
So let's see into zram_meta_alloc:
... zram_meta_alloc(..., disksize) {
...
num_pages = disksize >> PAGE_SHIFT;
// max num_pages = 0xfffffffffffff = UINT64_MAX >> PAGE_SHIFT
... = vzalloc(num_pages * sizeof(*zram->table));
// ^^^^^^^^^^^^^^^ this can overflow
...
}
vzallloc takes as argument unsigned long. ULONG_MAX should be UINT64_MAX on 64-bit platform. sizeof(*zram->table) is equal to sizeof(unsigned long) + sizeof(unsigned long) + [optional: + sizeof(ktime_t)] + padding (see here). Without padding, assuming 64-bit platform, sizeof(unsigned long) = 8 that should be equal to 8+8[+8] = 16 or 24. But anyway, maximum num_pages is equal to UINT64_MAX >> 12, so to overflow it on 64bit multiplication we would need sizeof(*zram->table) = 2^PAGE_SIZE = 4096, and that shouldn't happen (unless the compiler decides to give over 4000 bytes of padding into the zram->table struct). So we are left with UINT64_MAX - 4095.
So we are left, that the maximum number of disksize is UINT64_MAX-4095. If you give the disksize equal to UINT64_MAX - x, where 0 <= x < 4095, than because of PAGE_ALIGN macro, the disksize will be effectively set to 0. Probably this should be brought up to a kernel developer and they should modify the PAGE_ALIGN macro to support such numbers.
6 days ago to vzalloc calls the call to array_size was added to protect against overflow with this commit.
There is no limit but there is an overhead.
"Note that zram uses about 0.1% of the size of the disk when not in use so a huge zram is wasteful."
https://www.kernel.org/doc/Documentation/blockdev/zram.txt
Also disk_size is a virtual size purely dependent on the input and the compression ratio that receives via chosen alg. Disk-size is the max uncompressed size and general disk parameters.
The only 'actual' control is via mem_limit which is compressed size + disk & zram overheads.
Compression ratio is completely dependent on comp alg chosen from /proc/crypto as zlib & zstd are far more effective but are far slower. It is also very dependent on input as with text zlib & zstd can be over double that what lzo & lz4 will achieve.
If the input is already compressed any alg might garner little to zero compression and without a mem_limit could grab much precious memory from the system.
Mem_limit is the max you are prepared zram to grab from system and a disk-size any more than the compression ratio expected applied to mem_limit is likely a waste.
It will never get used but be part of the 0.1% empty creation overhead.
Maybe try https://github.com/StuartIanNaylor/zram-config

Different memory allocations in linux and windows?

I have a tree (T*Tree: binary tree with many elements in the node) implemented in C++.
I want to insert around 5,000,000 integer values in it (let's say from 1 till 5,000,000). The tree size should be around 8 * 5,000,000 byte or 41MB in memory (according to my implementation which is reasonable).
When I display the size of the tree(in my program by calculating the size of every node), it is 41MB as normal. However when I checked in Windows 32bit>>"Task Manager" I found the memory taken is 732MB!!
I checked that there is no extra malloc in my code. Even after I freed the tree by traversing from node to node and deleting them(and the keys inside) the size in "Task Manager" becomes 513MB only!!
After that I compiled same code in Linux Ubuntu 32bit(virtual machine on another PC) and ran the program. Again tree size does not change in my program i.e. 41MB as normal, but in "System Monitor" memory is 230MB and when freeing the tree nodes in my program the memory in "System Monitor" remains same 230MB.
And in both Windows and Linux if I freed & reinitialized the tree and insert again 5,000,0000 integer values, the memory is increased by double like if the previous space is not freed and used somewhere (which I am not able to find where).
The question:
1) why are those huge memory differences in Windows and Linux although the code & input data is same?
2) why freeing the Tree nodes doesn't reduce the memory to some reasonable value like 10MB.
code: https://drive.google.com/open?id=0ByKaCojxzNa9dEt6cEJNeDI4eXc
below are some snippets:
typedef struct Keylist {
unsigned int k;
struct Keylist *next_ptr;
};
typedef struct Keylist Keylist;
typedef struct TstarTreeNode {
//Binary Node specific
struct TstarTreeNode *left;
struct TstarTreeNode *right;
//Bool rightVisitedDuringInsert;
//AVL Node specific
int height;
//T Node specific
int length; //length of keys array for easy locating
struct Keylist *keys; //later you deal with it like one dimentional array
int max; //max key
int min; //min key
//T* Node specific
struct TstarTreeNode *successor;
};
typedef struct TstarTreeNode TstarTreeNode;
/*****************************************************************************
* *
* Define a structure for binary trees. *
* *
*****************************************************************************/
typedef struct TstarTree {
int size; //number of element(not number of nodes) in a tree
int MinCount; //Min Count of elements in a Node
int MaxCount; //Max Count of elements in a Node
TstarTreeNode *root;
//Provide functions for comarison elements and destroying elements
int (*compare)(int key1, int key2); //// -1 smaller, 0 equal, 1 bigger
int (*inRange)(int key, int min, int max); // -1 smaller, 0 in range, 1 bigger
} ;
typedef struct TstarTree TstarTree;
Insert function of the tree uses dynamic allocation i.e. malloc.
Update
according to what "John Zwinck" pointed out (thanks John), I have two things now:
1) The huge memory taken in Windows was because of the compiling options in Visual Studio, which I think enabled debugging and a lot of extra things. When I compiled in Windows using Cygwin without that options i.e. "gcc main.c tstarTree.c -o main" I got same result as in Linux. The size now in Windows>>"Task Manager" becomes 230MB
2) If OS is 64bit then let's see how the size is calculated (as John said and as I modified):
5 million unsigned int k. 20 MB.
5 million 4-byte pads (after k to align next_ptr). 20 MB.
5 million 8-byte next_ptr. 40 MB.
5 million times the overhead of malloc(). I think for 64bit OS it is 32 bytes each (according to John provided link). so 160 MB.
N TstarTreeNodes, each of which is 48 bytes in the full code.
N times the overhead of malloc() (I think, 32 bytes each).
N is the number of nodes. I have a resulting balanced complete tree of height 16 so I assume the number of nodes are 2^17-1. so the last two items become 6.2MB(i.e. 2^17 * 48) + 4.1MB(i.e. 2^17 * 32) =10MB
So the total is: 20+20+40+160+10= 250MB which is somehow reasonable and close to 230MB.
However I have Windows/Linux 32bit it will be (I think):
5 million unsigned int k. 20 MB.
5 million 4-byte next_ptr. 20 MB.
5 million times the overhead of malloc(). I think for 32bit OS it is 16 bytes each. so 80 MB.
N TstarTreeNodes, each of which is 32 bytes in the full code.
N times the overhead of malloc() (I think, 16 bytes each).
N is the number of nodes. I have a resulting balanced complete tree of height 16 so I assume the number of nodes are 2^17-1. so the last two items become 4.1MB(i.e. 2^17 * 32) + 2MB(i.e. 2^17 * 16) =6MB
So the total is: 20+20+80+6= 126MB it is a little far from 230MB which I get in "Task Manager" (if you know why please tell me?)
Currently the remaining important question is, why isn't the tree freed from memory when I am freeing all the nodes and keys in the tree using this code:
void freekeys(struct Keylist ** keys){
if ((*keys) == NULL)
{
return;
}
freekeys(&(*keys)->next_ptr);
(*keys)->next_ptr = NULL;
free((*keys));
(*keys) = NULL;
}
void freeTree(struct TstarTreeNode ** tree){
if ((*tree) == NULL)
{
return;
}
freeTree(&(*tree)->left);
freeTree(&(*tree)->right);
freekeys(&(*tree)->keys);
(*tree)->keys = NULL;
(*tree)->left = NULL;
(*tree)->right = NULL;
(*tree)->successor = NULL;
free((*tree));
(*tree) = NULL;
}
and in main():
TstarTree * tree;
...
freeTree(&tree->root);
free(tree);
Note:
The tree is working perfectly (insert, update, delete, lookup, display...) but when trying to free the tree from memory nothing changed in its size
You say your data takes:
8 * 5,000,000 byte or 41MB in memory
But that is not correct. Looking at your code there are two main structures:
struct Keylist {
unsigned int k;
Keylist *next_ptr;
};
struct TstarTreeNode {
TstarTreeNode *left, *right;
Keylist *keys;
TstarTreeNode *successor;
};
Let's say we have 5 million integers to store, as in your example. What will we need?
5 million unsigned int k. 20 MB.
5 million 4-byte pads (after k to align next_ptr). 20 MB.
5 million 8-byte next_ptr. 40 MB.
5 million times the overhead of malloc(). Likely 16 bytes each. 80 MB.
N TstarTreeNodes, each of which is 48 bytes in the full code.
N times the overhead of malloc() (again, 16 bytes each).
If N is 500,000 (for example, I don't know the real value but you do), those last two items add up to 32 MB. That brings the total to at least 192 MB as a bare minimum. Therefore, seeing 230 MB of memory usage in Linux is not surprising.
Some systems, especially when optimization is not fully enabled at build time, will add more bookkeeping and debugging information to each block allocated with malloc(). Are you building with optimization fully enabled?
One way you can save a lot of overhead is to stop using Keylist and just store the integers in plain arrays (created with malloc(), but only one per TstarTreeNode).

Resources