How does Linux use values for PCIDs? - linux

I'm trying to understand how Linux uses PCIDs (aka ASIDs) on Intel architecture. While I was investigating the Linux kernel's source code and patches I found such a define with the comment:
/*
* 6 because 6 should be plenty and struct tlb_state will fit in two cache
* lines.
*/
#define TLB_NR_DYN_ASIDS 6
Here is, I suppose, said that Linux uses only 6 PCID values, but what about this comment:
/*
* The x86 feature is called PCID (Process Context IDentifier). It is similar
* to what is traditionally called ASID on the RISC processors.
*
* We don't use the traditional ASID implementation, where each process/mm gets
* its own ASID and flush/restart when we run out of ASID space.
*
* Instead we have a small per-cpu array of ASIDs and cache the last few mm's
* that came by on this CPU, allowing cheaper switch_mm between processes on
* this CPU.
*
* We end up with different spaces for different things. To avoid confusion we
* use different names for each of them:
*
* ASID - [0, TLB_NR_DYN_ASIDS-1]
* the canonical identifier for an mm
*
* kPCID - [1, TLB_NR_DYN_ASIDS]
* the value we write into the PCID part of CR3; corresponds to the
* ASID+1, because PCID 0 is special.
*
* uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
* for KPTI each mm has two address spaces and thus needs two
* PCID values, but we can still do with a single ASID denomination
* for each mm. Corresponds to kPCID + 2048.
*
*/
As it is said in the previous comment, I suppose that Linux uses only 6 values for PCIDs, so in brackets we see just single values (not arrays). So ASID here can be only 0 and 5, kPCID can be only 1 and 6 and uPCID can only be 2049 and 2048 + 6 = 2054, right?
At this moment I have a few questions:
Why are there only 6 values for PCIDs? (Why is it plenty?)
Why will tlb_state structure fit in two cache lines if we choose 6 PCIDs?
Why does Linux use exactly these values for ASID, kPCID, and uPCID (I'm referring to the second comment)?

As it is said in the previous comment I suppose that Linux uses only 6 values for PCIDs so in brackets we see just single values (not arrays)
No, this is wrong, those are ranges. [0, TLB_NR_DYN_ASIDS-1] means from 0 to TLB_NR_DYN_ASIDS-1 inclusive. Keep reading for more details.
There are a few things to consider:
The difference between ASID (Address Space IDentifier) and PCID (Process-Context IDentifier) is just nomenclature: Linux calls this feature ASID across all architectures. Intel calls its implementation PCID. Linux ASIDs start at 0, Intel's PCIDs start at 1 because 0 is special and means "no PCID".
On x86 processors that support the feature, PCIDs are 12-bit values, so technically 4095 different PCIDs are possible (1 through 4095, as 0 is special).
Due to Kernel Page-Table Isolation Linux will nonetheless need two different PCIDs per task. The distinction between kPCID and uPCID is made for this reason, as each task effectively has two different virtual address spaces whose address translations need to be cached separately thus using different PCID. So we are down to 2047 usable pairs of PCIDs (plus the last single one that would just be unused).
Any normal system can easily exceed 2047 tasks on a single CPU, so no matter how many bits you use, you will never be able to have enough PCIDs for all existing tasks. On systems with a lot of CPUs you will also not have enough PCIDs for all active tasks.
Due to 4, you cannot implement PCID support as a simple assignment of a unique value for each existing/active task (e.g. like it is done for PIDs). Multiple tasks will need to "share" the same PCID sooner or later (not at the same time, but at different points in time). The logic to manage PCIDs will therefore need to be different.
The choice made by Linux developers was to use PCIDs as a way to optimize accesses to the most recently used mms (struct mm). This was implemented using a global per-CPU array (cpu_tlbstate.ctxs) that is linearly scanned on each mm-switch. Even small values of TLB_NR_DYN_ASIDS can easily trash performance instead of improving it. Apparently, 6 was a good number to choose as it provided a decent performance improvement. This means that only the 6 most-recently-used mms will use non-zero PCIDs (OK, technically the 6 most-recently-used user/kernel mm pairs).
You can see this reasoning explained more concisely in the commit message of the patch that implemented PCID support.
Why will tlb_state structure fit in two cache lines if we choose 6 PCIDs?
Well that's just simple math:
struct tlb_state {
struct mm_struct * loaded_mm; /* 0 8 */
union {
struct mm_struct * last_user_mm; /* 8 8 */
long unsigned int last_user_mm_spec; /* 8 8 */
}; /* 8 8 */
u16 loaded_mm_asid; /* 16 2 */
u16 next_asid; /* 18 2 */
bool invalidate_other; /* 20 1 */
/* XXX 1 byte hole, try to pack */
short unsigned int user_pcid_flush_mask; /* 22 2 */
long unsigned int cr4; /* 24 8 */
struct tlb_context ctxs[6]; /* 32 96 */
/* size: 128, cachelines: 2, members: 8 */
/* sum members: 127, holes: 1, sum holes: 1 */
};
(information extracted through pahole from a kernel image with debug symbols)
The array of struct tlb_context is used to keep track of ASIDs and it holds TLB_NR_DYN_ASIDS (6) entries.

Related

NtQueryObject returns wrong insufficient required size via WOW64, why?

I am using the NT native API NtQueryObject()/ZwQueryObject() from user mode (and I am aware of the risks in general and I have written kernel mode drivers for Windows in the past in my professional capacity).
Generally when one uses the typical "query information" function (of which there are a few) the protocol is first to ask with a too small buffer to retrieve the required size with STATUS_INFO_LENGTH_MISMATCH, then allocate a buffer of said size and query again -- this time using the buffer and previously returned size.
In order to get the list of object types (67 on my build) on the system I am doing just that:
ULONG Size = 0;
NTSTATUS Status = NtQueryObject(NULL, ObjectTypesInformation, &Size, sizeof(Size), &Size);
And in Size I get 8280 (WOW64) and 8968 (x64). I then proceed to allocate the buffer with calloc() and query again:
ULONG Size2 = 0;
BYTE* Buf = (BYTE*)::calloc(1, Size);
Status = NtQueryObject(NULL, ObjectTypesInformation, Buf, Size, &Size2);
NB: ObjectTypesInformation is 3. It isn't declared in winternl.h, but Nebbett (as ObjectAllTypesInformation) and others describe it. Since I am not querying for a particular object's traits but the system-wide list of object types, I pass NULL for the object handle.
Curiously on WOW64, i.e. 32-bit, the value in Size2 upon return from the second query is 16 Bytes (= 8296) bigger than the previously returned required size.
As far as alignment is concerned, I'd expect at most 8 Bytes for this sort of thing and indeed neither 8280 nor 8296 are at a 16 Byte alignment boundary, but on an 8 Byte one.
Certainly I can add some slack space on top of the returned required size (e.g. ALIGN_UP to the next 32 Byte alignment boundary), but this seems highly irregular to be honest. And I'd rather want to understand what's going on than to implement a workaround that breaks, because I miss something crucial.
The practical issue for the code is that in Debug configurations it tells me there's a corrupted heap somewhere, upon freeing Buf. Which suggests that NtQueryObject() was indeed writing these extra 16 Bytes beyond the buffer I provided.
Question: Any idea why it is doing that?
As usual for NT native API the sources of information are scarce. The x64 version of the exact same code returns the exact number of bytes required. So my thinking here is that WOW64 is the issue. A somewhat cursory look into wow64.dll with IDA didn't reveal any immediate points for suspicion regarding what goes wrong in translating the results to 32-bit here.
PS: Windows 10 (10.0.19043, ntdll.dll "timestamp" 77755782)
PPS: this may be related: https://wj32.org/wp/2012/11/30/obquerytypeinfo-and-ntqueryobject-buffer-overrun-in-windows-8/ Tested it, by checking that OBJECT_TYPE_INFORMATION::TypeName.Length + sizeof(WCHAR) == OBJECT_TYPE_INFORMATION::TypeName.MaximumLength in all returned items, which was the case.
The only part of ObjectTypesInformation that's public is the first field defined in winternl.h header in the Windows SDK:
typedef struct __PUBLIC_OBJECT_TYPE_INFORMATION {
UNICODE_STRING TypeName;
ULONG Reserved [22]; // reserved for internal use
} PUBLIC_OBJECT_TYPE_INFORMATION, *PPUBLIC_OBJECT_TYPE_INFORMATION;
For x86 this is 96 bytes, and for x64 this is 104 bytes (assuming you have the right packing mode enabled). The difference is the pointer in UNICODE_STRING which changes the alignment in x64.
Any additional memory space should be related to the TypeName buffer.
UNICODE_STRING accounts for 8 bytes of the difference between 8280 and 8296. The function uses the sizeof(ULONG_PTR) for alignment of the returned string plus an extra WCHAR, so that could easily account for the remaining 8 bytes.
AFAIK: The public use of NtQueryObject is supposed to be limited to kernel-mode use which of course means it always matches the OS native bitness (x86 code can't run as kernel in x64 native OS), so it's probably just a quirk of using the NT functions via the WOW64 thunk.
Alright, I think I figured out the issue with the help of WinDbg and a thorough look at wow64.dll using IDA.
NB: the wow64.dll I have has the same build number, but differs slightly in data only (checksum, security directory entry, pieces from version resources). The code is identical, which was to be expected, given deterministic builds and how they affect the PE timestamp.
There's an internal function called whNtQueryObject_SpecialQueryCase (according to PDBs), which covers the ObjectTypesInformation class queries.
For the above wow64.dll I used the following points of interest in WinDbg, from a 32 bit program which calls NtQueryObject(NULL, ObjectTypesInformation, ...) (the program itself is irrelevant, though):
0:000> .load wow64exts
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B0E0
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B14E
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B1A7
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B24A
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B252
Explanation of the above points of interest:
+B0E0: computing length required for 64 bit query, based on passed length for 32 bit
+B14E: call to NtQueryObject()
+B1A7: loop body for copying 64 to 32 bit buffer contents, after successful NtQueryObject() call
+B24A: computing written length by subtracting current (last + 1) entry from base buffer address
+B252: downsizing returned (64 bit) required length to 32 bit
The logic of this function in regards to just ObjectTypesInformation is roughly as follows:
Common steps
Take the ObjectInformationLength (32 bit query!) argument and size it up to fit the 64 bit info
Align the retrieved size up to the next 16 byte boundary
If necessary allocate the resulting amount from some PEB::ProcessHeap and store in TLS slot 3; otherwise using this as a scratch space
Call NtQueryObject() passing the buffer and length from the two previous steps
The length passed to NtQueryObject() is the one from step 1, not the one aligned to a 16 byte boundary. There seems to be some sort of header to this scratch space, so perhaps that's where the 16 byte alignment comes from?
Case 1: buffer size too small (here: 4), just querying required length
The up-sized length in this case equals 4, which is too small and consequently NtQueryObject() returns STATUS_INFO_LENGTH_MISMATCH. Required size is reported as 8968.
Down-size from the 64 bit required length to 32 bit and end up 16 bytes too short
Return the status from NtQueryObject() and the down-sized required length form the previous step
Case 2: buffer size supposedly (!) sufficient
Copy OBJECT_TYPES_INFORMATION::NumberOfTypes from queried buffer to 32 bit one
Step to the first entry (OBJECT_TYPE_INFORMATION) of source (64 bit) and target (32 bit) buffer, 8 and 4 byte aligned respectively
For for each entry up to OBJECT_TYPES_INFORMATION::NumberOfTypes:
Copy UNICODE_STRING::Length and UNICODE_STRING::MaximumLength for TypeName member
memcpy() UNICODE_STRING::Length bytes from source to target UNICODE_STRING::Buffer (target entry + sizeof(OBJECT_TYPE_INFORMATION32)
Add terminating zero (WCHAR) past the memcpy'd string
Copy the individual members past the TypeName from 64 to 32 bit struct
Compute pointer of next entry by aligning UNICODE_STRING::MaximumLength up to an 8 byte boundary (i.e. the ULONG_PTR alignment mentioned in the other answer) + sizeof(OBJECT_TYPE_INFORMATION64) (already 8 byte aligned!)
The next target entry (32 bit) gets 4 byte aligned instead
At the end compute required (32 bit) length by subtracting the value we arrived at for the "next" entry (i.e. one past the last) from the base address of the buffer passed by the WOW64 program (32 bit) to NtQueryObject()
In my debugged scenario these were: 0x008ce050 - 0x008cbfe8 = 0x00002068 (= 8296), which is 16 bytes larger than the buffer length we were told during case 1 (8280)!
The issue
That crucial last step differs between merely querying and actually getting the buffer filled. There is no further bounds checking in that loop I described for case 2.
And this means it will just overrun the passed buffer and return a written length bigger than the buffer length passed to it.
Possible solutions and workarounds
I'll have to approach this mathematically after some sleep, the workaround is obviously to top up the required length returned from case 1 in order to avoid the buffer overrun. The easiest method is to use my up_size_from_32bit() from the example below and use that on the returned required size. This way you are allocating enough for the 64 bit buffer, while querying the 32 bit one. This should never overrun during the copy loop.
However, the fix in wow64.dll is a little more involved, I guess. While adding bounds checking to the loop would help avert the overrun, it would mean that the caller would have to query for the required size twice, because the first time around it lies to us.
Which means the query-only case (1) would have to allocate that internal buffer after querying the required length for 64 bit, then get it filled and then walk the entries (just like the copy loop), skipping over the last entry to compute the required length the same as it is now done after the copy loop.
Example program demonstrating the "static" computation by wow64.dll
Build for x64, just the way wow64.dll was!
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <cstdio>
typedef struct
{
ULONG JustPretending[24];
} OBJECT_TYPE_INFORMATION32;
typedef struct
{
ULONG JustPretending[26];
} OBJECT_TYPE_INFORMATION64;
constexpr ULONG size_delta_3264 = sizeof(OBJECT_TYPE_INFORMATION64) - sizeof(OBJECT_TYPE_INFORMATION32);
constexpr ULONG down_size_to_32bit(ULONG len)
{
return len - size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION64));
}
constexpr ULONG up_size_from_32bit(ULONG len)
{
return len + size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION32));
}
// Trying to mimic the wdm.h macro
constexpr size_t align_up_by(size_t address, size_t alignment)
{
return (address + (alignment - 1)) & ~(alignment - 1);
}
constexpr auto u32 = 8280UL;
constexpr auto u64 = 8968UL;
constexpr auto from_64 = down_size_to_32bit(u64);
constexpr auto from_32 = up_size_from_32bit(u32);
constexpr auto from_32_16_byte_aligned = (ULONG)align_up_by(from_32, 16);
int wmain()
{
wprintf(L"32 to 64 bit: %u -> %u -(16-byte-align)-> %u\n", u32, from_32, from_32_16_byte_aligned);
wprintf(L"64 to 32 bit: %u -> %u\n", u64, from_64);
return 0;
}
static_assert(sizeof(OBJECT_TYPE_INFORMATION32) == 96, "Size for 64 bit struct does not match.");
static_assert(sizeof(OBJECT_TYPE_INFORMATION64) == 104, "Size for 64 bit struct does not match.");
static_assert(u32 == from_64, "Must match (from 64 to 32 bit)");
static_assert(u64 == from_32, "Must match (from 32 to 64 bit)");
static_assert(from_32_16_byte_aligned % 16 == 0, "16 byte alignment failed");
static_assert(from_32_16_byte_aligned > from_32, "We're aligning up");
This does not mimic the computation that happens in case 2, though.

Where do we define the type of structure returned by the kernel when using the "perf_event_open" system call using mmap?

I'm trying to use the syscall perf_event_open to get some performance data from the system.
I am currently working on periodic data retrieval using shared memory with a ring buffer.
But I can't find what structure is returned in each section of the ring buffer. The manual page enumerate all possibilities, but that's all.
I can't figure out which member of the perf_event_attr structure to fill in to control what type of structure will be returned to the ring buffer.
If you have some informations about that, I'll be happy to read it !
The https://github.com/torvalds/linux/blob/master/tools/perf/design.txt documentation has description of mmaped ring, and perf script / perf script -D can decode ring data when it is saved as perf.data file. Some parts of the doc are outdated, but it is still useful for perf_event_open syscall description.
First mmap page is metadata page, rest 2^n pages are filled with events where every event has header struct perf_event_header of 8 bytes.
Like stated, asynchronous events, like counter overflow or PROT_EXEC
mmap tracking are logged into a ring-buffer. This ring-buffer is
created and accessed through mmap().
The mmap size should be 1+2^n pages, where the first page is a
meta-data page (struct perf_event_mmap_page) that contains various
bits of information such as where the ring-buffer head is.
/*
* Structure of the page that can be mapped via mmap
*/
struct perf_event_mmap_page {
__u32 version; /* version number of this structure */
__u32 compat_version; /* lowest version this is compat with */
...
}
The following 2^n pages are the ring-buffer which contains events of the form:
#define PERF_RECORD_MISC_KERNEL (1 << 0)
#define PERF_RECORD_MISC_USER (1 << 1)
#define PERF_RECORD_MISC_OVERFLOW (1 << 2)
struct perf_event_header {
__u32 type;
__u16 misc;
__u16 size;
};
enum perf_event_type
The design.txt doc has incorrect values for enum perf_event_type, check actual perf_events kernel subsystem source codes - https://github.com/torvalds/linux/blob/master/include/uapi/linux/perf_event.h#L707. That uapi/linux/perf_event.h file also has some struct hints in comments, like
* #
* # The RAW record below is opaque data wrt the ABI
* #
* # That is, the ABI doesn't make any promises wrt to
* # the stability of its content, it may vary depending
* # on event, hardware, kernel version and phase of
* # the moon.
* #
* # In other words, PERF_SAMPLE_RAW contents are not an ABI.
* #
*
* { u32 size;
* char data[size];}&& PERF_SAMPLE_RAW
*...
* { u64 size;
* char data[size];
* u64 dyn_size; } && PERF_SAMPLE_STACK_USER
*...
PERF_RECORD_SAMPLE = 9,

Different memory allocations in linux and windows?

I have a tree (T*Tree: binary tree with many elements in the node) implemented in C++.
I want to insert around 5,000,000 integer values in it (let's say from 1 till 5,000,000). The tree size should be around 8 * 5,000,000 byte or 41MB in memory (according to my implementation which is reasonable).
When I display the size of the tree(in my program by calculating the size of every node), it is 41MB as normal. However when I checked in Windows 32bit>>"Task Manager" I found the memory taken is 732MB!!
I checked that there is no extra malloc in my code. Even after I freed the tree by traversing from node to node and deleting them(and the keys inside) the size in "Task Manager" becomes 513MB only!!
After that I compiled same code in Linux Ubuntu 32bit(virtual machine on another PC) and ran the program. Again tree size does not change in my program i.e. 41MB as normal, but in "System Monitor" memory is 230MB and when freeing the tree nodes in my program the memory in "System Monitor" remains same 230MB.
And in both Windows and Linux if I freed & reinitialized the tree and insert again 5,000,0000 integer values, the memory is increased by double like if the previous space is not freed and used somewhere (which I am not able to find where).
The question:
1) why are those huge memory differences in Windows and Linux although the code & input data is same?
2) why freeing the Tree nodes doesn't reduce the memory to some reasonable value like 10MB.
code: https://drive.google.com/open?id=0ByKaCojxzNa9dEt6cEJNeDI4eXc
below are some snippets:
typedef struct Keylist {
unsigned int k;
struct Keylist *next_ptr;
};
typedef struct Keylist Keylist;
typedef struct TstarTreeNode {
//Binary Node specific
struct TstarTreeNode *left;
struct TstarTreeNode *right;
//Bool rightVisitedDuringInsert;
//AVL Node specific
int height;
//T Node specific
int length; //length of keys array for easy locating
struct Keylist *keys; //later you deal with it like one dimentional array
int max; //max key
int min; //min key
//T* Node specific
struct TstarTreeNode *successor;
};
typedef struct TstarTreeNode TstarTreeNode;
/*****************************************************************************
* *
* Define a structure for binary trees. *
* *
*****************************************************************************/
typedef struct TstarTree {
int size; //number of element(not number of nodes) in a tree
int MinCount; //Min Count of elements in a Node
int MaxCount; //Max Count of elements in a Node
TstarTreeNode *root;
//Provide functions for comarison elements and destroying elements
int (*compare)(int key1, int key2); //// -1 smaller, 0 equal, 1 bigger
int (*inRange)(int key, int min, int max); // -1 smaller, 0 in range, 1 bigger
} ;
typedef struct TstarTree TstarTree;
Insert function of the tree uses dynamic allocation i.e. malloc.
Update
according to what "John Zwinck" pointed out (thanks John), I have two things now:
1) The huge memory taken in Windows was because of the compiling options in Visual Studio, which I think enabled debugging and a lot of extra things. When I compiled in Windows using Cygwin without that options i.e. "gcc main.c tstarTree.c -o main" I got same result as in Linux. The size now in Windows>>"Task Manager" becomes 230MB
2) If OS is 64bit then let's see how the size is calculated (as John said and as I modified):
5 million unsigned int k. 20 MB.
5 million 4-byte pads (after k to align next_ptr). 20 MB.
5 million 8-byte next_ptr. 40 MB.
5 million times the overhead of malloc(). I think for 64bit OS it is 32 bytes each (according to John provided link). so 160 MB.
N TstarTreeNodes, each of which is 48 bytes in the full code.
N times the overhead of malloc() (I think, 32 bytes each).
N is the number of nodes. I have a resulting balanced complete tree of height 16 so I assume the number of nodes are 2^17-1. so the last two items become 6.2MB(i.e. 2^17 * 48) + 4.1MB(i.e. 2^17 * 32) =10MB
So the total is: 20+20+40+160+10= 250MB which is somehow reasonable and close to 230MB.
However I have Windows/Linux 32bit it will be (I think):
5 million unsigned int k. 20 MB.
5 million 4-byte next_ptr. 20 MB.
5 million times the overhead of malloc(). I think for 32bit OS it is 16 bytes each. so 80 MB.
N TstarTreeNodes, each of which is 32 bytes in the full code.
N times the overhead of malloc() (I think, 16 bytes each).
N is the number of nodes. I have a resulting balanced complete tree of height 16 so I assume the number of nodes are 2^17-1. so the last two items become 4.1MB(i.e. 2^17 * 32) + 2MB(i.e. 2^17 * 16) =6MB
So the total is: 20+20+80+6= 126MB it is a little far from 230MB which I get in "Task Manager" (if you know why please tell me?)
Currently the remaining important question is, why isn't the tree freed from memory when I am freeing all the nodes and keys in the tree using this code:
void freekeys(struct Keylist ** keys){
if ((*keys) == NULL)
{
return;
}
freekeys(&(*keys)->next_ptr);
(*keys)->next_ptr = NULL;
free((*keys));
(*keys) = NULL;
}
void freeTree(struct TstarTreeNode ** tree){
if ((*tree) == NULL)
{
return;
}
freeTree(&(*tree)->left);
freeTree(&(*tree)->right);
freekeys(&(*tree)->keys);
(*tree)->keys = NULL;
(*tree)->left = NULL;
(*tree)->right = NULL;
(*tree)->successor = NULL;
free((*tree));
(*tree) = NULL;
}
and in main():
TstarTree * tree;
...
freeTree(&tree->root);
free(tree);
Note:
The tree is working perfectly (insert, update, delete, lookup, display...) but when trying to free the tree from memory nothing changed in its size
You say your data takes:
8 * 5,000,000 byte or 41MB in memory
But that is not correct. Looking at your code there are two main structures:
struct Keylist {
unsigned int k;
Keylist *next_ptr;
};
struct TstarTreeNode {
TstarTreeNode *left, *right;
Keylist *keys;
TstarTreeNode *successor;
};
Let's say we have 5 million integers to store, as in your example. What will we need?
5 million unsigned int k. 20 MB.
5 million 4-byte pads (after k to align next_ptr). 20 MB.
5 million 8-byte next_ptr. 40 MB.
5 million times the overhead of malloc(). Likely 16 bytes each. 80 MB.
N TstarTreeNodes, each of which is 48 bytes in the full code.
N times the overhead of malloc() (again, 16 bytes each).
If N is 500,000 (for example, I don't know the real value but you do), those last two items add up to 32 MB. That brings the total to at least 192 MB as a bare minimum. Therefore, seeing 230 MB of memory usage in Linux is not surprising.
Some systems, especially when optimization is not fully enabled at build time, will add more bookkeeping and debugging information to each block allocated with malloc(). Are you building with optimization fully enabled?
One way you can save a lot of overhead is to stop using Keylist and just store the integers in plain arrays (created with malloc(), but only one per TstarTreeNode).

Understanding /proc/sys/vm/lowmem_reserve_ratio

I am not able to understand the meaning of the variable "lowmem_reserve_ratio" by reading the explanation from Documentation/sysctl/vm.txt.
I have also tried to google it but all the explanations found are exactly similar as present in vm.txt.
It will be really helpful if sb explains it or mention some link about it.
Here goes the original explanation:-
The lowmem_reserve_ratio is an array. You can see them by reading this file.
-
% cat /proc/sys/vm/lowmem_reserve_ratio
256 256 32
-
Note: # of this elements is one fewer than number of zones. Because the highest
zone's value is not necessary for following calculation.
But, these values are not used directly. The kernel calculates # of protection
pages for each zones from them. These are shown as array of protection pages
in /proc/zoneinfo like followings. (This is an example of x86-64 box).
Each zone has an array of protection pages like this.
-
Node 0, zone DMA
pages free 1355
min 3
low 3
high 4
:
:
numa_other 0
protection: (0, 2004, 2004, 2004)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pagesets
cpu: 0 pcp: 0
:
-
These protections are added to score to judge whether this zone should be used
for page allocation or should be reclaimed.
In this example, if normal pages (index=2) are required to this DMA zone and
watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
not be used because pages_free(1355) is smaller than watermark + protection[2]
(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
normal page requirement. If requirement is DMA zone(index=0), protection[0]
(=0) is used.
zone[i]'s protection[j] is calculated by following expression.
(i < j):
zone[i]->protection[j]
= (total sums of present_pages from zone[i+1] to zone[j] on the node)
/ lowmem_reserve_ratio[i];
(i = j):
(should not be protected. = 0;
(i > j):
(not necessary, but looks 0)
The default values of lowmem_reserve_ratio[i] are
256 (if zone[i] means DMA or DMA32 zone)
32 (others).
As above expression, they are reciprocal number of ratio.
256 means 1/256. # of protection pages becomes about "0.39%" of total present
pages of higher zones on the node.
If you would like to protect more pages, smaller values are effective.
The minimum value is 1 (1/1 -> 100%).
having the same problem as you, I googled (a lot) and stumbled apon this page which might (or might not) be more understandable than the kernel doc.
(I do not quote here because it will be unreadable)
I found the wording in that document really confusing too. Looking at the source in mm/page_alloc.c helped to clear it up, so let me try my hand at a more straightforward explanation:
As is said in the page you quoted, these numbers "are reciprocal number of ratio". Worded differently: these numbers are divisors. So when calculating the reserve pages for a given zone in a node, you take the sum of pages in that node in zones higher than that one, divide it by the provided divisor, and that's how many pages you're reserving for that zone.
Example: let's assume a 1 GiB node with 768 MiB in zone Normal and 256 MiB in zone HighMem (assume no zone DMA). Let's assume the default highmem reserve "ratio" (divisor) of 32. And let's assume the typical 4 KiB page size. Now we can calculate the reserve area for zone Normal:
Sum of "higher" zones than zone Normal (just HighMem): 256 MiB = (1024 KiB / 1 MiB) * (1 page / 4 KiB) = 65536 pages
Area reserved in zone Normal for this node: 65536 pages / 32 = 2048 pages = 8 MiB.
The concept stays the same when you add more zones and nodes. Just remember that the reserved size is in pages---you never reserve a fraction of a page.
I find the kernel source code that explain very well and clear.
/*
* setup_per_zone_lowmem_reserve - called whenever
* sysctl_lowmem_reserve_ratio changes. Ensures that each zone
* has a correct pages reserved value, so an adequate number of
* pages are left in the zone after a successful __alloc_pages().
*/
static void setup_per_zone_lowmem_reserve(void)
{
struct pglist_data *pgdat;
enum zone_type j, idx;
for_each_online_pgdat(pgdat) {
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long managed_pages = zone->managed_pages;
zone->lowmem_reserve[j] = 0;
idx = j;
while (idx) {
struct zone *lower_zone;
idx--;
if (sysctl_lowmem_reserve_ratio[idx] < 1)
sysctl_lowmem_reserve_ratio[idx] = 1;
lower_zone = pgdat->node_zones + idx;
lower_zone->lowmem_reserve[j] = managed_pages /
sysctl_lowmem_reserve_ratio[idx];
managed_pages += lower_zone->managed_pages;
}
}
}
/* update totalreserve_pages */
calculate_totalreserve_pages();
}
And here even list an demo.
/*
* results with 256, 32 in the lowmem_reserve sysctl:
* 1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
* 1G machine -> (16M dma, 784M normal, 224M high)
* NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA
* HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL
* HIGHMEM allocation will leave (224M+784M)/256 of ram reserved in ZONE_DMA
*
* TBD: should special case ZONE_DMA32 machines here - in those we normally
* don't need any ZONE_NORMAL reservation
*/
int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
#ifdef CONFIG_ZONE_DMA
256,
#endif
#ifdef CONFIG_ZONE_DMA32
256,
#endif
#ifdef CONFIG_HIGHMEM
32,
#endif
32,
};
In a word, the expression looks like,
zone[1]->lowmem_reserve[2] = zone[2]->managed_pages / sysctl_lowmem_reserve_ratio[1]
zone[0]->lowmem_reserve[2] = (zone[1] + zone[2])->managed_pages / sysctl_lowmem_reserve_ratio[0]

In Perl module Proc::ProccessTable, why does pctcpu sometimes return 'inf', 'nan', or a value greater than 100?

The Perl module Proc::ProcessTable occasionally observes that the pctcpu attribute as 'inf', 'nan', or a value greater then 100. Why does it do this? And are there any guidelines on how to deal with this kind of information?
We have observed this on various platforms including Linux 2.4 running on 8 logical processors.
I would guess that 'inf' or 'nan' is the result of some impossibly large value or a divide by zero.
For values greater then 100, could this possibly mean that more then one processor was used?
And for dealing with this information, is the best practice merely marking the data point as untrustworthy and normalizing to 100%?
I do not know why that happens and I cannot stress test the module right now trying to generate such cases.
However, a principle I have followed all my research is not to replace data I know to be non-sense with something that looks reasonable. You basically have missing observations and you should treat them as such. I would not attach a numerical value at all so as not to pretend I have information when I in fact do not.
Then, your statistics for the non-missing points will be meaningful and you can look at any patterns in the missing observations separately.
UPDATE: Looking at the calc_prec() function in the source code:
/* calc_prec()
*
* calculate the two cpu/memory precentage values
*/
static void calc_prec(char *format_str, struct procstat *prs, struct obstack *mem_pool)
{
float pctcpu = 100.0f * (prs->utime / 1e6) / (time(NULL) - prs->start_time);
/* calculate pctcpu - NOTE: This assumes the cpu time is in microsecond units! */
sprintf(prs->pctcpu, "%3.2f", pctcpu);
field_enable(format_str, F_PCTCPU);
/* calculate pctmem */
if (system_memory > 0) {
sprintf(prs->pctmem, "%3.2f", (float) prs->rss / system_memory * 100.f);
field_enable(format_str, F_PCTMEM);
}
}
First, IMHO, it would be better to just divide by 1e4 rather than multiplying by 100.0f after the division. Second, it is possible (if polled immediately after process start) for the time delta to be 0. Third, I would have just done the whole thing in double.
As an aside, this function looks like a good example of why you should not have comments in code.
#include <stdio.h>
#include <time.h>
volatile float calc_percent(
unsigned long utime,
time_t now,
time_t start
) {
return 100.0f * ( utime / 1e6) / (now - start);
}
int main(void) {
printf("%3.2f\n", calc_percent(1e6, time(NULL), time(NULL)));
printf("%3.2f\n", calc_percent(0, time(NULL), time(NULL)));
return 0;
}
This outputs inf in the first case and nan in the second case when compiled with Cygwin gcc-4 on Windows. I do not know if this behavior is standard or just what happens with this particular combination of OS+compiler.

Resources