BUILD_BUG_ON(dt_virt_base % SZ_2M) gives me error when NR_CPUS is reduced to 2 - linux

When I set NR_CPUS to 2, I get this error while building linux-5.10.0rc (when NR_CPUS is 4, it does not).
arch/arm64/mm/mmu.c: In function 'fixmap_remap_fdt':
././include/linux/compiler_types.h:315:38: error: call to
'__compiletime_assert_404' declared with attribute error: BUILD_BUG_ON
failed: dt_virt_base % SZ_2M 315 | _compiletime_assert(condition,
msg, _compiletime_assert, COUNTER)
| ^
The function fixmap_remap_fdt looks like this and the line BUILD_BUG_ON(dt_virt_base % SZ_2M); seems to generate error.
void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
{
const u64 dt_virt_base = __fix_to_virt(FIX_FDT);
int offset;
void *dt_virt;
/*
* skip some comments
*/
BUILD_BUG_ON(MIN_FDT_ALIGN < 8);
if (!dt_phys || dt_phys % MIN_FDT_ALIGN)
return NULL;
/*
* skip (some comments)
*/
BUILD_BUG_ON(dt_virt_base % SZ_2M); // <=== line generating error
BUILD_BUG_ON(__fix_to_virt(FIX_FDT_END) >> SWAPPER_TABLE_SHIFT !=
__fix_to_virt(FIX_BTMAP_BEGIN) >> SWAPPER_TABLE_SHIFT);
I know this BUILD_BUG_ON is supposed to generate error when some condition is true during build time and this is possible because some conditions can be known at build time. But in this case, this dt_virt_base is some virtual address of fixmap (predefined virtual addresses for special purposes in kernel) and it is aligned to SZ_2M. See the definition in arch/arm64/include/asm/fixmap.h.
enum fixed_addresses {
FIX_HOLE,
/*
* Reserve a virtual window for the FDT that is 2 MB larger than the
* maximum supported size, and put it at the top of the fixmap region.
* The additional space ensures that any FDT that does not exceed
* MAX_FDT_SIZE can be mapped regardless of whether it crosses any
* 2 MB alignment boundaries.
*
* Keep this at the top so it remains 2 MB aligned.
*/
#define FIX_FDT_SIZE (MAX_FDT_SIZE + SZ_2M)
FIX_FDT_END,
FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
FIX_EARLYCON_MEM_BASE,
FIX_TEXT_POKE0,
And this indexs (enum type) means number of pages from the fixmap top virtual address. So if we have this index, we can get the virtual address. __fix_to_virt is defined in include/asm-generic/fixmap.h as below.
#define __fix_to_virt(x) (FIXADDR_TOP - ((x) << PAGE_SHIFT))
So, when I reduce NR_CPUS from 4 to 2, I get this build time compile error. But I can't understand why this is giving me the error. Could anyone tell me what is causing the error?

Related

Where do we define the type of structure returned by the kernel when using the "perf_event_open" system call using mmap?

I'm trying to use the syscall perf_event_open to get some performance data from the system.
I am currently working on periodic data retrieval using shared memory with a ring buffer.
But I can't find what structure is returned in each section of the ring buffer. The manual page enumerate all possibilities, but that's all.
I can't figure out which member of the perf_event_attr structure to fill in to control what type of structure will be returned to the ring buffer.
If you have some informations about that, I'll be happy to read it !
The https://github.com/torvalds/linux/blob/master/tools/perf/design.txt documentation has description of mmaped ring, and perf script / perf script -D can decode ring data when it is saved as perf.data file. Some parts of the doc are outdated, but it is still useful for perf_event_open syscall description.
First mmap page is metadata page, rest 2^n pages are filled with events where every event has header struct perf_event_header of 8 bytes.
Like stated, asynchronous events, like counter overflow or PROT_EXEC
mmap tracking are logged into a ring-buffer. This ring-buffer is
created and accessed through mmap().
The mmap size should be 1+2^n pages, where the first page is a
meta-data page (struct perf_event_mmap_page) that contains various
bits of information such as where the ring-buffer head is.
/*
* Structure of the page that can be mapped via mmap
*/
struct perf_event_mmap_page {
__u32 version; /* version number of this structure */
__u32 compat_version; /* lowest version this is compat with */
...
}
The following 2^n pages are the ring-buffer which contains events of the form:
#define PERF_RECORD_MISC_KERNEL (1 << 0)
#define PERF_RECORD_MISC_USER (1 << 1)
#define PERF_RECORD_MISC_OVERFLOW (1 << 2)
struct perf_event_header {
__u32 type;
__u16 misc;
__u16 size;
};
enum perf_event_type
The design.txt doc has incorrect values for enum perf_event_type, check actual perf_events kernel subsystem source codes - https://github.com/torvalds/linux/blob/master/include/uapi/linux/perf_event.h#L707. That uapi/linux/perf_event.h file also has some struct hints in comments, like
* #
* # The RAW record below is opaque data wrt the ABI
* #
* # That is, the ABI doesn't make any promises wrt to
* # the stability of its content, it may vary depending
* # on event, hardware, kernel version and phase of
* # the moon.
* #
* # In other words, PERF_SAMPLE_RAW contents are not an ABI.
* #
*
* { u32 size;
* char data[size];}&& PERF_SAMPLE_RAW
*...
* { u64 size;
* char data[size];
* u64 dyn_size; } && PERF_SAMPLE_STACK_USER
*...
PERF_RECORD_SAMPLE = 9,

Is it possible to write to a non 4-bytes aligned address with HLSL compute shader?

I am trying to convert an existing OpenCL kernel to an HLSL compute shader.
The OpenCL kernel samples each pixel in an RGBA texture and writes each color channel to a tighly packed array.
So basically, I need to write to a tightly packed uchar array in a pattern that goes somewhat like this:
r r r ... r g g g ... g b b b ... b a a a ... a
where each letter stands for a single byte (red / green / blue / alpha) that originates from a pixel channel.
going through the documentation for RWByteAddressBuffer Store method, it clearly states:
void Store(
in uint address,
in uint value
);
address [in]
Type: uint
The input address in bytes, which must be a multiple of 4.
In order to write the correct pattern to the buffer, I must be able to write a single byte to a non aligned address. In OpenCL / CUDA this is pretty trivial.
Is it technically possible to achieve that with HLSL?
Is this a known limitation? possible workarounds?
As far as I know it is not possible to write directly to a non aligned address in this scenario. You can, however, use a little trick to achieve what you want. Below you can see the code of the entire compute shader which does exactly what you want. The function StoreValueAtByte in particular is what you are looking for.
Texture2D<float4> Input;
RWByteAddressBuffer Output;
void StoreValueAtByte(in uint index_of_byte, in uint value) {
// Calculate the address of the 4-byte-slot in which index_of_byte resides
uint addr_align4 = floor(float(index_of_byte) / 4.0f) * 4;
// Calculate which byte within the 4-byte-slot it is
uint location = index_of_byte % 4;
// Shift bits to their proper location within its 4-byte-slot
value = value << ((3 - location) * 8);
// Write value to buffer
Output.InterlockedOr(addr_align4, value);
}
[numthreads(20, 20, 1)]
void CSMAIN(uint3 ID : SV_DispatchThreadID) {
// Get width and height of texture
uint tex_width, tex_height;
Input.GetDimensions(tex_width, tex_height);
// Make sure thread does not operate outside the texture
if(tex_width > ID.x && tex_height > ID.y) {
uint num_pixels = tex_width * tex_height;
// Calculate address of where to write color channel data of pixel
uint addr_red = 0 * num_pixels + ID.y * tex_width + ID.x;
uint addr_green = 1 * num_pixels + ID.y * tex_width + ID.x;
uint addr_blue = 2 * num_pixels + ID.y * tex_width + ID.x;
uint addr_alpha = 3 * num_pixels + ID.y * tex_width + ID.x;
// Get color of pixel and convert from [0,1] to [0,255]
float4 color = Input[ID.xy];
uint4 color_final = uint4(round(color.x * 255), round(color.y * 255), round(color.z * 255), round(color.w * 255));
// Store color channel values in output buffer
StoreValueAtByte(addr_red, color_final.x);
StoreValueAtByte(addr_green, color_final.y);
StoreValueAtByte(addr_blue, color_final.z);
StoreValueAtByte(addr_alpha, color_final.w);
}
}
I hope the code is self explanatory since it is hard to explain, but I'll try anyway.
The fist thing the function StoreValueAtByte does is to calculate the address of the 4-byte-slot enclosing the byte you want to write to. After that the position of the byte inside the 4-byte-slot is calculated (is it the fist, second, third or the fourth byte in the slot). Since the byte you want to write is already inside an 4-byte variable (namely value) and occupies the rightmost byte, you then just have to shift the byte to its proper position inside the 4-byte variable. After that you just have to write the variable value to the buffer at the 4-byte-aligned address. This is done using bitwise OR because multiple threads write to the same address interfering each other leading to write-after-write-hazards. This of course only works if you initialize the entire output buffer with zeros before issuing the dispatch-call.

How to identify read or write operations of page fault when using sigaction handler on SIGSEGV?(LINUX)

I use sigaction to handle page fault exception, and the handler function is defind like this:
void sigaction_handler(int signum, siginfo_t *info, void *_context)
So it's easy to get page fault address by reading info->si_addr.
The question is, how to know whether this operation is memory READ or WRITE ?
I found the type of _context parameter is ucontext_t defined in /usr/include/sys/ucontext.h
There is a cr2 field defined in mcontext_t, but unforunately, it is only avaliable when x86_64 is not defind, thus I could not used cr2 to identify read/write operations.
On anotherway, there is a struct named sigcontext defined in /usr/include/bits/sigcontext.h
This struct contains cr2 field. But I don't know where to get it.
You can check this in x86_64 by referring to the ucontext's mcontext struct and the err register:
void pf_sighandler(int sig, siginfo_t *info, ucontext_t *ctx) {
...
if (ctx->uc_mcontext.gregs[REG_ERR] & 0x2) {
// Write fault
} else {
// Read fault
}
...
}
Here is the generation of SIGSEGV from the kernel arch/x86/mm/fault.c, __bad_area_nosemaphore() function:
http://lxr.missinglinkelectronics.com/linux+v3.12/arch/x86/mm/fault.c#L760
760 tsk->thread.cr2 = address;
761 tsk->thread.error_code = error_code;
762 tsk->thread.trap_nr = X86_TRAP_PF;
763
764 force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);
There is error_code field, and it values are defined in arch/x86/mm/fault.c too:
http://lxr.missinglinkelectronics.com/linux+v3.12/arch/x86/mm/fault.c#L23
23/*
24 * Page fault error code bits:
25 *
26 * bit 0 == 0: no page found 1: protection fault
27 * bit 1 == 0: read access 1: write access
28 * bit 2 == 0: kernel-mode access 1: user-mode access
29 * bit 3 == 1: use of reserved bit detected
30 * bit 4 == 1: fault was an instruction fetch
31 */
32enum x86_pf_error_code {
33
34 PF_PROT = 1 << 0,
35 PF_WRITE = 1 << 1,
36 PF_USER = 1 << 2,
37 PF_RSVD = 1 << 3,
38 PF_INSTR = 1 << 4,
39};
So, exact information about access type is stored in the thread_struct.error_code: http://lxr.missinglinkelectronics.com/linux+v3.12/arch/x86/include/asm/processor.h#L470
The error_code field is not exported into siginfo_t struct as I see (it is defined in
http://man7.org/linux/man-pages/man2/sigaction.2.html .. search for si_signo).
So you can
Hack the kernel to export tsk->thread.error_code (or check, is it exported already or not, for example in ptrace)
Get the memory address, read /proc/self/maps, parse them and check access bits on the page. If the page is present and read-only, the only possible fault is from writing, if page is not present both kinds of access are possible, and if... there should be no write-only pages.
Also you can try to find the address of failed instruction, read it and disassemble.
The error_code information can be accessed through:
err = ((ucontext_t*)context)->uc_mcontext.gregs[REG_ERR]
It is passed by the hardware on the stack, which is then passed to the signal handler by the kernel, since the kernel passes the entire `frame'. Then
bool write_fault = !(err & 0x2);
will be true if the access was a write access, and false otherwise.

Understanding /proc/sys/vm/lowmem_reserve_ratio

I am not able to understand the meaning of the variable "lowmem_reserve_ratio" by reading the explanation from Documentation/sysctl/vm.txt.
I have also tried to google it but all the explanations found are exactly similar as present in vm.txt.
It will be really helpful if sb explains it or mention some link about it.
Here goes the original explanation:-
The lowmem_reserve_ratio is an array. You can see them by reading this file.
-
% cat /proc/sys/vm/lowmem_reserve_ratio
256 256 32
-
Note: # of this elements is one fewer than number of zones. Because the highest
zone's value is not necessary for following calculation.
But, these values are not used directly. The kernel calculates # of protection
pages for each zones from them. These are shown as array of protection pages
in /proc/zoneinfo like followings. (This is an example of x86-64 box).
Each zone has an array of protection pages like this.
-
Node 0, zone DMA
pages free 1355
min 3
low 3
high 4
:
:
numa_other 0
protection: (0, 2004, 2004, 2004)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pagesets
cpu: 0 pcp: 0
:
-
These protections are added to score to judge whether this zone should be used
for page allocation or should be reclaimed.
In this example, if normal pages (index=2) are required to this DMA zone and
watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
not be used because pages_free(1355) is smaller than watermark + protection[2]
(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
normal page requirement. If requirement is DMA zone(index=0), protection[0]
(=0) is used.
zone[i]'s protection[j] is calculated by following expression.
(i < j):
zone[i]->protection[j]
= (total sums of present_pages from zone[i+1] to zone[j] on the node)
/ lowmem_reserve_ratio[i];
(i = j):
(should not be protected. = 0;
(i > j):
(not necessary, but looks 0)
The default values of lowmem_reserve_ratio[i] are
256 (if zone[i] means DMA or DMA32 zone)
32 (others).
As above expression, they are reciprocal number of ratio.
256 means 1/256. # of protection pages becomes about "0.39%" of total present
pages of higher zones on the node.
If you would like to protect more pages, smaller values are effective.
The minimum value is 1 (1/1 -> 100%).
having the same problem as you, I googled (a lot) and stumbled apon this page which might (or might not) be more understandable than the kernel doc.
(I do not quote here because it will be unreadable)
I found the wording in that document really confusing too. Looking at the source in mm/page_alloc.c helped to clear it up, so let me try my hand at a more straightforward explanation:
As is said in the page you quoted, these numbers "are reciprocal number of ratio". Worded differently: these numbers are divisors. So when calculating the reserve pages for a given zone in a node, you take the sum of pages in that node in zones higher than that one, divide it by the provided divisor, and that's how many pages you're reserving for that zone.
Example: let's assume a 1 GiB node with 768 MiB in zone Normal and 256 MiB in zone HighMem (assume no zone DMA). Let's assume the default highmem reserve "ratio" (divisor) of 32. And let's assume the typical 4 KiB page size. Now we can calculate the reserve area for zone Normal:
Sum of "higher" zones than zone Normal (just HighMem): 256 MiB = (1024 KiB / 1 MiB) * (1 page / 4 KiB) = 65536 pages
Area reserved in zone Normal for this node: 65536 pages / 32 = 2048 pages = 8 MiB.
The concept stays the same when you add more zones and nodes. Just remember that the reserved size is in pages---you never reserve a fraction of a page.
I find the kernel source code that explain very well and clear.
/*
* setup_per_zone_lowmem_reserve - called whenever
* sysctl_lowmem_reserve_ratio changes. Ensures that each zone
* has a correct pages reserved value, so an adequate number of
* pages are left in the zone after a successful __alloc_pages().
*/
static void setup_per_zone_lowmem_reserve(void)
{
struct pglist_data *pgdat;
enum zone_type j, idx;
for_each_online_pgdat(pgdat) {
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long managed_pages = zone->managed_pages;
zone->lowmem_reserve[j] = 0;
idx = j;
while (idx) {
struct zone *lower_zone;
idx--;
if (sysctl_lowmem_reserve_ratio[idx] < 1)
sysctl_lowmem_reserve_ratio[idx] = 1;
lower_zone = pgdat->node_zones + idx;
lower_zone->lowmem_reserve[j] = managed_pages /
sysctl_lowmem_reserve_ratio[idx];
managed_pages += lower_zone->managed_pages;
}
}
}
/* update totalreserve_pages */
calculate_totalreserve_pages();
}
And here even list an demo.
/*
* results with 256, 32 in the lowmem_reserve sysctl:
* 1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
* 1G machine -> (16M dma, 784M normal, 224M high)
* NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA
* HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL
* HIGHMEM allocation will leave (224M+784M)/256 of ram reserved in ZONE_DMA
*
* TBD: should special case ZONE_DMA32 machines here - in those we normally
* don't need any ZONE_NORMAL reservation
*/
int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
#ifdef CONFIG_ZONE_DMA
256,
#endif
#ifdef CONFIG_ZONE_DMA32
256,
#endif
#ifdef CONFIG_HIGHMEM
32,
#endif
32,
};
In a word, the expression looks like,
zone[1]->lowmem_reserve[2] = zone[2]->managed_pages / sysctl_lowmem_reserve_ratio[1]
zone[0]->lowmem_reserve[2] = (zone[1] + zone[2])->managed_pages / sysctl_lowmem_reserve_ratio[0]

What is the return value of sched_find_first_bit if it doesn't find anything?

The kernel is 2.4.
On a side note, does anybody knows a good place where I can search for that kind of information? Searching Google for function definitions is frustrating.
If you plan on spending any significant time searching through or understanding the Linux kernel, I recommend downloading a copy and using Cscope.
Using Cscope on large projects (example: the Linux kernel)
I found the following in a copy of the Linux kernel 2.4.18.
The key seems to be the comment before this last piece of code below. It appears that the return value of sched_find_first_bit is undefined if no bit is set.
From linux-2.4/include/linux/sched.h:185
/*
* The maximum RT priority is configurable. If the resulting
* bitmap is 160-bits , we can use a hand-coded routine which
* is optimal. Otherwise, we fall back on a generic routine for
* finding the first set bit from an arbitrarily-sized bitmap.
*/
#if MAX_PRIO 127
#define sched_find_first_bit(map) _sched_find_first_bit(map)
#else
#define sched_find_first_bit(map) find_first_bit(map, MAX_PRIO)
#endif
From linux-2.4/include/asm-i386/bitops.h:303
/**
* find_first_bit - find the first set bit in a memory region
* #addr: The address to start the search at
* #size: The maximum size to search
*
* Returns the bit-number of the first set bit, not the number of the byte
* containing a bit.
*/
static __inline__ int find_first_bit(void * addr, unsigned size)
{
int d0, d1;
int res;
/* This looks at memory. Mark it volatile to tell gcc not to move it around */
__asm__ __volatile__(
"xorl %%eax,%%eax\n\t"
"repe; scasl\n\t"
"jz 1f\n\t"
"leal -4(%%edi),%%edi\n\t"
"bsfl (%%edi),%%eax\n"
"1:\tsubl %%ebx,%%edi\n\t"
"shll $3,%%edi\n\t"
"addl %%edi,%%eax"
:"=a" (res), "=&c" (d0), "=&D" (d1)
:"1" ((size + 31) >> 5), "2" (addr), "b" (addr));
return res;
}
From linux-2.4/include/asm-i386/bitops.h:425
/*
* Every architecture must define this function. It's the fastest
* way of searching a 140-bit bitmap where the first 100 bits are
* unlikely to be set. It's guaranteed that at least one of the 140
* bits is cleared.
*/
static inline int _sched_find_first_bit(unsigned long *b)
{
if (unlikely(b[0]))
return __ffs(b[0]);
if (unlikely(b[1]))
return __ffs(b[1]) + 32;
if (unlikely(b[2]))
return __ffs(b[2]) + 64;
if (b[3])
return __ffs(b[3]) + 96;
return __ffs(b[4]) + 128;
}
From linux-2.4/include/asm-i386/bitops.h:409
/**
* __ffs - find first bit in word.
* #word: The word to search
*
* Undefined if no bit exists, so code should check against 0 first.
*/
static __inline__ unsigned long __ffs(unsigned long word)
{
__asm__("bsfl %1,%0"
:"=r" (word)
:"rm" (word));
return word;
}

Resources