Who zeroes pages while calling calloc() in Linux?

Who zeroes pages while calling calloc() in Linux? - linux

I am aware that an implementer has a choice of whether he wants to zero a malloc page or let OS give him a zeroed page (for more optimization purposes).
My question is simple - in Ubuntu 14.04 LTS which comes with linux kernel 3.16 and gcc 4.8.4, who will zero my pages? Is it in user land or kernel land?

It can depend on where the memory came from. The calloc code is userland, and will zero a memory page that gets re-used by a process. This happens when the memory is previously used and then freed, but not returned to the OS. However, if the page is newly allocated to the process, it will come already cleared to 0 by the OS (for security purposes), and so does not need to be cleared by calloc. This means calloc can potentially be faster than calling malloc followed by memset, since it can skip the memset if it knows it will already by zeroed.

That depends on the implementer of your standard library, not on the host system. It is not possible to give a specific answer for a particular OS, since it may be the build target of multiple compilers and their libraries - including on other systems, if you consider the possibility of cross-compiling (building on one type of system to target another).
Most implementations I've seen of calloc() use a call of malloc() followed by either a call of memset() or (with some implementations that target unix) a legacy function called bzero() - which is, itself, sometimes replaced by a macro call that expands to a call of memset() in a number of recent versions of libraries.
memset() is often hand-optimised. But, again, it is up to the implementer of the library.

Related

Is it possible to force a range of virtual addresses?

I have an Ada program that was written for a specific (embedded, multi-processor, 32-bit) architecture. I'm attempting to use this same code in a simulation on 64-bit RHEL as a shared object (since there are multiple versions and I have a requirement to choose a version at runtime).
The problem I'm having is that there are several places in the code where the people who wrote it (not me...) have used Unchecked_Conversions to convert System.Addresses to 32-bit integers. Not only that, but there are multiple routines with hard-coded memory addresses. I can make minor changes to this code, but completely porting it to x86_64 isn't really an option. There are routines that handle interrupts, CPU task scheduling, etc.
This code has run fine in the past when it was statically-linked into a previous version of the simulation (consisting of Fortran/C/C++). Now, however, the main executable starts, then loads a shared object based on some inputs. This shared object then checks some other inputs and loads the appropriate Ada shared object.
Looking through the code, it's apparent that it should work fine if I can keep the logical memory addresses between 0 and 2,147,483,647 (32-bit signed int). Is there a way to either force the shared object loader to leave space in the lower ranges for the Ada code or perhaps make the Ada code "think" that it's addresses are between 0 and 2,147,483,647?

Is there a way to either force the shared object loader to leave space in the lower ranges for the Ada code
The good news is that the loader will leave the lower ranges untouched.
The bad news is that it will not load any shared object there. There is no interface you could use to influence placement of shared objects.
That said, dlopen from memory (which we implemented in our private fork of glibc) would allow you to do that. But that's not available publicly.
Your other possible options are:
if you can fit the entire process into 32-bit address space, then your solution is trivial: just build everything with -m32.
use prelink to relocate the library to desired address. Since that address should almost always be available, the loader is very likely to load the library exactly there.
link the loader with a custom mmap implementation, which detects the library of interest through some kind of side channel, and does mmap syscall with MAP_32BIT set, or
run the program in a ptrace sandbox. Such sandbox can again intercept mmap syscall, and or-in MAP_32BIT when desirable.
or perhaps make the Ada code "think" that it's addresses are between 0 and 2,147,483,647?
I don't see how that's possible. If the library stores an address of a function or a global in a 32-bit memory location, then loads that address and dereferences it ... it's going to get a 32-bit truncated address and a SIGSEGV on dereference.

Linux kernel assembly and logic

My question is somewhat weird but I will do my best to explain.
Looking at the languages the linux kernel has, I got C and assembly even though I read a text that said [quote] Second iteration of Unix is written completely in C [/quote]
I thought that was misleading but when I said that kernel has assembly code I got 2 questions of the start
What assembly files are in the kernel and what's their use?
Assembly is architecture dependant so how can linux be installed on more than one CPU architecture
And if linux kernel is truly written completely in C than how can it get GCC needed for compiling?
I did a complete find / -name *.s
and just got one assembly file (asm-offset.s) somewhere in the /usr/src/linux-headers-`uname -r/
Somehow I don't think that is helping with the GCC working, so how can linux work without assembly or if it uses assembly where is it and how can it be stable when it depends on the arch.
Thanks in advance

1. Why assembly is used?
Because there are certain things then can be done only in assembly and because assembly results in a faster code. For eg, "you can get access to unusual programming modes of your processor (e.g. 16 bit mode to interface startup, firmware, or legacy code on Intel PCs)".
Read here for more reasons.
2. What assembly file are used?
From: https://www.kernel.org/doc/Documentation/arm/README
"The initial entry into the kernel is via head.S, which uses machine
independent code. The machine is selected by the value of 'r1' on
entry, which must be kept unique."
From https://www.ibm.com/developerworks/library/l-linuxboot/
"When the bzImage (for an i386 image) is invoked, you begin at ./arch/i386/boot/head.S in the start assembly routine (see Figure 3 for the major flow). This routine does some basic hardware setup and invokes the startup_32 routine in ./arch/i386/boot/compressed/head.S. This routine sets up a basic environment (stack, etc.) and clears the Block Started by Symbol (BSS). The kernel is then decompressed through a call to a C function called decompress_kernel (located in ./arch/i386/boot/compressed/misc.c). When the kernel is decompressed into memory, it is called. This is yet another startup_32 function, but this function is in ./arch/i386/kernel/head.S."
Apart from these assembly files, lot of linux kernel code has usage of inline assembly.
3. Architecture dependence?
And you are right about it being architecture dependent, that's why the linux kernel code is ported to different architecture.
Linux porting guide
List of supported arch

Things written mainly in assembly in Linux:
Boot code: boots up the machine and sets it up in a state in which it can start executing C code (e.g: on some processors you may need to manually initialize caches and TLBs, on x86 you have to switch to protected mode, ...)
Interrupts/Exceptions/Traps entry points/returns: there you need to do very processor-specific things, e.g: saving registers and reenabling interrupts, and eventually restoring registers and properly returning to user mode. Some exceptions may be handled entirely in assembly.
Instruction emulation: some CPU models may not support certain instructions, may not support unaligned data access, or may not have an FPU. An option is using emulation when getting the corresponding exception.
VDSO: the VDSO is a virtual library that the kernel maps into userspace. It allows e.g: selecting the optimal syscall sequence for the current CPU (on x86 use sysenter/syscall instead of int 0x80 if available), and implementing certain system calls without requiring a context switch (e.g: gettimeofday()).
Atomic operations and locks: Maybe in a future some of these could be written using C11 support for atomic operations.
Copying memory from/to user mode: Besides using an optimized copy, these check for out-of-bounds access.
Optimized routines: the kernel has optimized version of some routines, e.g: crypto routines, memset, clear_page, csum_copy (checksum and copy to another place IP data in one pass), ...
Support for suspend/resume and other ACPI/EFI/firmware thingies
BPF JIT: newer kernels include a JIT compiler for BPF expressions (used for example by tcpdump, secmode mode 2, ...)
...
To support different architectures, Linux has assembly code (re-)written for each architecture it supports (and sometimes, there are several implementations of some code for different platforms using the same CPU architecture). Just look at all the subdirectories under arch/

Assembly is needed for a couple of reasons.
There are many instructions that are needed for the operation of an operating system that have no C equivalent, at least on most processors. A good example on Intel x86/64 processors is the iret instruciton, which returns from hardware/software interrupts. These interrupts are key to handling hardware events (like a keyboard press) and system calls from programs on older processors.
A computer does not start up in a state that is immediately ready for execution of C code. For an Intel example, when execution gets to the startup routine the processor may not be in 32-bit mode (or 64-bit mode), and the stack required by C also may not be ready. There are some other features present in some processors (like paging) which need to be turned on from assembly as well.
However, most of the Linux kernel is written in C, which interfaces with some platform specific C/assembly code through standardized interfaces. By separating the parts in this way, most of the logic of the Linux kernel can be shared between platforms. The build system simply compiles the platform independent and dependent parts together for specific platforms, which results in different executable kernel files for different platforms (and kernel configurations for that matter).

Assembly code in the kernel is generally used for low-level hardware interaction that can't be done directly from C. They're like a platform- specific foundation that's used by higher-level parts of the kernel that are written in C.
The kernel source tree contains assembly code for a variety of systems. When you compile a kernel for a particular type of system (such as an x86 PC), only the appropriate assembly code for that platform is included in the build process.

Linux is not the second version of Unix (or Unix in general). It is Unix compatible, but Unix and Linux have separate histories and, in terms of code base (of their kernels), are completely separate. Linus Torvald's idea was to write an open source Unix.
Some of the lower level things like some of the architecture dependent parts of memory management are done in assembly. The old (but still available) Linux kernel API for x86, int 0x80, is implemented in assembly. There are probably other places in the kernel that are implemented in assembly, but I don't know any others.
When you compile the kernel, you select an architecture to target. Depending on the target, the right assembly files for that architecture are included in the build.
The reason you don't find anything is because you're searching the headers, not the sources. Download a tar ball from kernel.org and search that.

In linux (or POSIX) function similar to win32 mem api

I'm writing interpreted language on windows, and I use PAGE_GUARD to implement stack and HeapCreate / HeapAlloc for dynamic allocation of my language.
Maybe I'll need to port my lang to other OS.. So, In linux (or POSIX standard..), What is similar to these win32 api? (I hope they is not very different to use..)
Ok, see below if you don't know these win32 APIs:
HeapCreate - simple. Create a new heap:
void *mem = malloc(123); // alloc from default heap
HANDLE hHeap = HeapCreate(...); // create a new heap
void *mem2 = HeapAlloc(hHeap, some_flag, 123); // alloc from new heap
PAGE_GUARD - bit complex; it is used to implement stack. For example, there is a stack, whose max size is 5-pages. For saving memory, I'll alloc only one page and just "reserve" virtual memory address of 4 pages.
---------
| alloc |
---------
|reserve|
---------
|reserve|
---------
|reserve|
---------
|reserve|
---------
When first page of stack is used entirely, and program is about to use more stack, access violation occur. Then, I "commit" second page and continue program.
PAGE_GUARD is just helper to do this. (In win95, there isn't page guard so win95 perform without this helper) If I commit and mark the second page "guarded" ahead-of-time, and program use more stack, then GUARD-exception occur and OS unmark the page automatically. I have to only commit and mark the next page.

Read Advanced Linux Programming. Don't seek an exact equivalent in Linux for each functionality of Win32 that you know or want. Learn to natively think in Linux terms. Study free software similar to yours (see freecode or sourceforge to find some).
And yes, Posix or Linux vs Windows is very different, notably for their notion of processes, etc...
You probably want mmap(2) and mprotect(2); I don't know at all Windows (so I have no idea of what HeapCreate does).
Maybe using the lower layer of cross-platform toolkits like Qt (i.e. QtCore...) or Glib (from Gtk ...) might help you.
Linux C standard library is often GNU libc (but you could use some other, e.g. MUSL libc, which is very readable IMHO). It use syscalls listed in syscalls(2) and implemented by the Linux kernel (in particular, malloc(3) is generally built above mmap(2)...).
Take the habit of studying the source code of free software if that helps you.
BTW, for an interpreter, you could consider using Boehm's conservative garbage collector...

how come an x64 OS can run a code compiled for x86 machine

Basically, what I wonder is how come an x86-64 OS can run a code compiled for x86 machine. I know when first x64 Systems has been introduced, this wasn't a feature of any of them. After that, they somehow managed to do this.
Note that I know that x86 assembly language is a subset of x86-64 assembly language and ISA's is designed in such a way that they can support backward compatibility. But what confuses me here is stack calling conventions. These conventions differ a lot depending on the architecture. For example, in x86, in order to backup frame pointer, proceses pushes where it points to stack(RAM) and pops after it is done. On the other hand, in x86-64, processes doesn't need to update frame pointer at all since all the references is given via stack pointer. And secondly, While in x86 architecture arguments to functions is passed by stack in x86-64, registers are used for that purpose.
Maybe this differences between stack calling conventions of x86-64 and x64 architecture may not affect the way program stack grows as long as different conventions are not used at the same time and this is mostly the case because x32 functions are called by other x32's and same for x64. But, at one point, a function (probably a system function) will call a function whose code is compiled for a x86-64 machine with some arguments, at this point, I am curious about how OS(or some other control unit) handle to get this function work.
Thanks in advance.

Part of the way that the i386/x86-64 architecture is designed is that the CS and other segment registers refer to entries in the GDT. The GDT entries have a few special bits besides the base and limit that describe the operating mode and privilege level of the current running task.
If the CS register refers to a 32-bit code segment, the processor will run in what is essentially i386 compatibility mode. Likewise 64-bit code requires a 64-bit code segment.
So, putting this all together.
When the OS wants to run a 32-bit task, during the task switch into it, it loads a value into CS which refers to a 32-bit code segment. Interrupt handlers also have segment registers associated with them, so when a system call occurs or an interrupt occurs, the handler will switch back to the OS's 64-bit code segment, (allowing the 64-bit OS code to run correctly) and the OS then can do its work and continue scheduling new tasks.
As a follow up with regards to calling convention. Neither i386 or x86-64 require the use of frame pointers. The code is free to do as it pleases. In fact, many compilers (gcc, clang, VS) offer the ability to compile 32-bit code without frame pointers. What is important is that the calling convention is implemented consistently. If all the code expects arguments to be passed on the stack, that's fine, but the called code better agree with that. Likewise, passing via registers is fine too, just everyone has to agree (at least at the library interface level, internal functions can generally do as they please).
Beyond that, just keep in mind that the difference between the two isn't really an issue because every process gets its own private view of memory. A side consequence though is that 32-bit apps can't load 64-bit dlls, and 64-bit apps can't load 32-bit dlls, because a process either has a 32-bit code segment or a 64-bit code segment. It can't be both.

The processor in put into legacy mode, but that requires everything executing at that time to be 32bit code. This switching is handled by the OS.
Windows : It uses WoW64. WoW64 is responsible for changing the processor mode, it also provides the compatible dll and registry functions.
Linux : Until recently Linux used to (like windows) shift to running the processor in legacy mode when ever it started executing 32bit code, you needed all the 32bit glibc libraries installed, and it would break if it tried to work together with 64bit code. Now there are implementing the X32 ABI which should make everything run like smoother and allow 32bit applications to access x64 feature like increased no. of registers. See this article on the x32 abi
PS : I am not very certain on the details of things, but it should give you a start.
Also, this answer combined with Evan Teran's answer probably give a rough picture of everything that is happening.

Does mprotect flush the instruction cache on ARM Linux?

I am writing a JIT on ARM Linux that executes an instruction set that contains self-modifying code. The instruction set does not have any cache flush instructions (similar to x86 in that respect).
If I write out some code to a page and then call mprotect on that page, is that sufficient to invalidate the instruction cache? Or do I also need to use the cacheflush syscall on those pages?

You'd expect that the mmap/mprotect syscalls would establish mappings that are updated immediately, and need no further interaction to use the memory ranges as specified. I see that the kernel does indeed flush caches on mprotect. In that case, no cache flush would be required.
However, I also see that some versions of libc do call cacheflush after mprotect, which would imply that some environments would need the caches flushed (or have previously). I'd take a guess that this is a workaround to a bug.
You could always add the call to cacheflush; although it's extra code, it shouldn't be to harmful - at worst, the caches will already be flushed. You could always write a quick test and see what happens...

In Linux specifically, mprotect DOES cacheflush all caches since at least version 2.6.39 (and even before that for sure). You can see that in the code:
https://elixir.bootlin.com/linux/v2.6.39.4/source/mm/mprotect.c#L122 .
If you are writing a POSIX portable code, I would call cacheflush as the standard C library is not demanding such behavior from the kernel, nor from the implementation.
Edit: You should also be carefull and check what flush_cache_range does in the specific architecture you are implementing for, as in some architecture (like ARM64) this function does nothing...

I believe you do not have to explicitly flush the cache.
Which processor is this? ARMv5? ARMv7?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string