Change default CPU affinity - linux

I would like to know whether it's possible the default affinity for linux processes. The default value is ~0 (truncated to the number of CPUs available) but I'd like to be able to set it for all the process of the system. It would be also nice to do this at boot time so I could effectively prevent any process from using certain CPUs (unless explicitely set by a syscall).
Thanks!
David

From C program:
#define _GNU_SOURCE
#include <sched.h>
int sched_setaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask);
see man sched_setaffinity for further information.
From the shell:
taskset <mask> <command> <args>
or
taskset -p <pid> <mask>
where <mask> is, for example, 0x00000001 for the first CPU.

Related

HugePages on Raspberry Pi 4

I need help about managing Hugepages on raspberry pi 4 running raspberry pi OS 64 bit.
I did not find much reliable information online.
First I recompiled the kernel source enabling Memory Management options --->Transparent Hugepage Support option.
When I run the command:
grep -i huge /proc/meminfo
The output is:
AnonHugePages: 319488 kB
ShmemHugePages: 0 kB
FileHugePages: 0 k
and running the command:
cat /sys/kernel/mm/transparent_hugepage/enabled
the output is:
[always] madvise never
So I think Transparent Huge Pages (AnonHugePages) should be set.
I need to use HugePages to map the largest contiguous memory chunk using mmap function, c code.
mem = mmap(NULL,buf_size,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0);
Looking at https://www.man7.org/linux/man-pages/man2/mmap.2.html there are two flags to manage the hugepages: MAP_HUGETLB flag and MAP_HUGE_2MB, MAP_HUGE_1GB flag.
My question is: To use HugePages should I map in this way?
mem = mmap(NULL,buf_size,PROT_READ|PROT_WRITE,MAP_SHARED,MAP_HUGETLB,fd,0);
Kernel configuration:
CONFIG_SYS_SUPPORTS_HUGETLBFS=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_TRANSPARENT_HUGE_PAGECACHE=y
# CONFIG_HUGETLBFS is not set
Huge pages are a way to enhance the performances of the applications by reducing the number of TLB misses. The mechanism coalesces contiguous standard physical pages (typical size of 4 KB) into a big one (e.g. 2 MB). Linux implements this feature in two flavors: Transparent Huge pages and explicit huge pages.
Transparent Huge Pages
Transparent huge pages (THP) are managed transparently by the kernel. The user space applications have no control on them. The kernel makes its best to allocate huge pages whenever it is possible but it is not guaranteed. Moreover, THP may introduce overhead as an underlying "garbage collector" kernel daemon named khugepaged is in charge of the coalescing of the physical pages to make huge pages. This may consume CPU time with undesirable effects on the performances of the running applications. In systems with time critical applications, it is generally advised to deactivate THP.
THP can be disabled on the boot command line (cf. the end of this answer) or from the shell in sysfs:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
N.B.: Some interesting papers exist on the performance evaluation/issues of the THP:
Transparent Hugepages: measuring the performance impact;
Settling the Myth of Transparent HugePages for Databases.
Explicit huge pages
If the huge pages are required at application level (i.e. from user space). HUGETLBFS kernel configuration must be set to activate the hugetlbfs pseudo-filesystem (the menu in the kernel configurator is something like: "File systems" --> "Pseudo filesystems" --> "HugeTLB file system support"). In the kernel source tree this parameter is in fs/Kconfig:
config HUGETLBFS
bool "HugeTLB file system support"
depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
SYS_SUPPORTS_HUGETLBFS || BROKEN
help
hugetlbfs is a filesystem backing for HugeTLB pages, based on
ramfs. For architectures that support it, say Y here and read
<file:Documentation/admin-guide/mm/hugetlbpage.rst> for details.
If unsure, say N.
For example, on an Ubuntu system, we can check:
$ cat /boot/config-5.4.0-53-generic | grep HUGETLBFS
CONFIG_HUGETLBFS=y
N.B.: On Raspberry Pi, it is possible to configure the apparition of /proc/config.gz and do the same with zcat to check the parameter. To make it, the configuration menu is: "General setup" --> "Kernel .config support" + "Enable access to .config through /proc/config.gz"
When this parameter is set, hugetlbfs pseudo-filesystem is added into the kernel build (cf. fs/Makefile):
obj-$(CONFIG_HUGETLBFS) += hugetlbfs/
The source code of hugetlbfs is located in fs/hugetlbfs/inode.c. At startup, the kernel will mount internal hugetlbfs file systems to support all the available huge page sizes for the architecture it is running on:
static int __init init_hugetlbfs_fs(void)
{
struct vfsmount *mnt;
struct hstate *h;
int error;
int i;
if (!hugepages_supported()) {
pr_info("disabling because there are no supported hugepage sizes\n");
return -ENOTSUPP;
}
error = -ENOMEM;
hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache",
sizeof(struct hugetlbfs_inode_info),
0, SLAB_ACCOUNT, init_once);
if (hugetlbfs_inode_cachep == NULL)
goto out;
error = register_filesystem(&hugetlbfs_fs_type);
if (error)
goto out_free;
/* default hstate mount is required */
mnt = mount_one_hugetlbfs(&hstates[default_hstate_idx]);
if (IS_ERR(mnt)) {
error = PTR_ERR(mnt);
goto out_unreg;
}
hugetlbfs_vfsmount[default_hstate_idx] = mnt;
/* other hstates are optional */
i = 0;
for_each_hstate(h) {
if (i == default_hstate_idx) {
i++;
continue;
}
mnt = mount_one_hugetlbfs(h);
if (IS_ERR(mnt))
hugetlbfs_vfsmount[i] = NULL;
else
hugetlbfs_vfsmount[i] = mnt;
i++;
}
return 0;
out_unreg:
(void)unregister_filesystem(&hugetlbfs_fs_type);
out_free:
kmem_cache_destroy(hugetlbfs_inode_cachep);
out:
return error;
}
A hugetlbfs file system is a sort of RAM file system into which the kernel creates files to back the memory regions mapped by the applications.
The amount of needed huge pages can be reserved by writing the number of needed huge pages into /sys/kernel/mm/hugepages/hugepages-hugepagesize/nr_hugepages.
Then, mmap() is able to map some part of the application address space onto huge pages. Here is an example showing how to do it:
#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>
#define HP_SIZE (2 * 1024 * 1024) // <-- Adjust with size of the supported HP size on your system
int main(void)
{
char *addr, *addr1;
// Map a Huge page
addr = mmap(NULL, HP_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED| MAP_HUGETLB, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap()");
return 1;
}
printf("Mapping located at address: %p\n", addr);
pause();
return 0;
}
In the preceding program, the memory pointed by addr is based on huge pages. Example of usage:
$ gcc alloc_hp.c -o alloc_hp
$ ./alloc_hp
mmap(): Cannot allocate memory
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
0
$ sudo sh -c "echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages"
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
1
$ ./alloc_hp
Mapping located at address: 0x7f7ef6c00000
In another terminal, the process map can be observed to verify the size of the memory page (it is blocked in pause() system call):
$ pidof alloc_hp
13009
$ cat /proc/13009/smaps
[...]
7f7ef6c00000-7f7ef6e00000 rw-s 00000000 00:0f 331939 /anon_hugepage (deleted)
Size: 2048 kB
KernelPageSize: 2048 kB <----- The page size is 2MB
MMUPageSize: 2048 kB
[...]
In the preceding map, the file name /anon_hugepage for the huge page region is made internally by the kernel. It is marked deleted because the kernel removes the associated memory file which will make the file disappear as soon as there are no longer references on it (e.g. when the calling process ends, the underlying file is closed upon exit(), the reference counter on the file drops to 0 and the remove operation finishes to make it disappear).
Allocation of other huge page sizes
On Raspberry Pi 4B, the default huge page size is 2MB but the card supports several other huge page sizes:
$ ls -l /sys/kernel/mm/hugepages
total 0
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-1048576kB
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-2048kB
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-32768kB
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-64kB
To use them, it is necessary to mount a hugetlbfs type file system corresponding to the size of the desired huge page. The kernel documentation provides details on the available mount options. For example, to mount a hugetlbfs file system on /mnt/huge with 8 Huge Pages of size 64KB, the command is:
mount -t hugetlbfs -o pagesize=64K,size=512K,min_size=512K none /mnt/huge
Then it is possible to map huge pages of 64KB in a user program. The following program creates the /tmp/hpfs directory on which it mounts a hugetlbfs file system with a size of 4 huge pages of 64KB. A file named /memfile_01 is created and extended to the size of 2 huge pages. The file is mapped into memory thanks to mmap() system call. It is not passed MAP_HUGETLB flag as the provided file descriptor is for a file created on a hugetlbfs filesystem. Then, the program calls pause() to suspend its execution in order to make some observations in another terminal:
#include <sys/types.h>
#include <errno.h>
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <fcntl.h>
#define ERR(fmt, ...) do { \
fprintf(stderr, \
"ERROR#%s#%d: "fmt, \
__FUNCTION__, __LINE__, ## __VA_ARGS__); \
} while(0)
#define HP_SIZE (64 * 1024)
#define HPFS_DIR "/tmp/hpfs"
#define HPFS_SIZE (4 * HP_SIZE)
int main(void)
{
void *addr;
char cmd[256];
int status;
int rc;
char mount_opts[256];
int fd;
rc = mkdir(HPFS_DIR, 0777);
if (0 != rc && EEXIST != errno) {
ERR("mkdir(): %m (%d)\n", errno);
return 1;
}
snprintf(mount_opts, sizeof(mount_opts), "pagesize=%d,size=%d,min_size=%d", HP_SIZE, 2*HP_SIZE, HP_SIZE);
rc = mount("none", HPFS_DIR, "hugetlbfs", 0, mount_opts);
if (0 != rc) {
ERR("mount(): %m (%d)\n", errno);
return 1;
}
fd = open(HPFS_DIR"/memfile_01", O_RDWR|O_CREAT, 0777);
if (fd < 0) {
ERR("open(%s): %m (%d)\n", "memfile_01", errno);
return 1;
}
rc = ftruncate(fd, 2 * HP_SIZE);
if (0 != rc) {
ERR("ftruncate(): %m (%d)\n", errno);
return 1;
}
addr = mmap(NULL, 2 * HP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
if (MAP_FAILED == addr) {
ERR("mmap(): %m (%d)\n", errno);
return 1;
}
// The file can be closed
rc = close(fd);
if (0 != rc) {
ERR("close(%d): %m (%d)\n", fd, errno);
return 1;
}
pause();
return 0;
} // main
The preceding program must be run as root as it calls mount():
$ gcc mount_tlbfs.c -o mount_tlbfs
$ cat /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepages
0
$ sudo sh -c "echo 8 > /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepages"
$ cat /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepages
8
$ sudo ./mount_tlbfs
In another terminal, the /proc/[pid]/smaps file can be displayed to check the huge page allocation. As soon as the program writes into the huge pages, the Lazy allocation mechanism triggers the effective allocation of the huge pages.
Cf. This article for future details
Early reservation
The huge pages are made with consecutive physical memory pages. The reservation should be done early in the system startup (especially on heavy loaded systems) as the physical memory may be so fragmented that it is sometimes impossible to allocate huge pages afterward. To reserve as early as possible, this can be done on the kernel boot command line:
hugepages=
[HW] Number of HugeTLB pages to allocate at boot.
If this follows hugepagesz (below), it specifies
the number of pages of hugepagesz to be allocated.
If this is the first HugeTLB parameter on the command
line, it specifies the number of pages to allocate for
the default huge page size. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: <integer>
hugepagesz=
[HW] The size of the HugeTLB pages. This is used in
conjunction with hugepages (above) to allocate huge
pages of a specific size at boot. The pair
hugepagesz=X hugepages=Y can be specified once for
each supported huge page size. Huge page sizes are
architecture dependent. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
transparent_hugepage=
[KNL]
Format: [always|madvise|never]
Can be used to control the default behavior of the system
with respect to transparent hugepages.
See Documentation/admin-guide/mm/transhuge.rst
for more details.
On Raspberry Pi, the boot command line can typically be updated in /boot/cmdline.txt and the current boot command line used by the running kernel can be seen in /proc/cmdline.
N.B.:
This recipe is explained in more details here and here
There is a user space library called libhugetlbfs which offers a layer of abstraction on top of the kernel's hugetlbfs mechanism described here. It comes with library services like get_huge_pages() and accompanying tools like hugectl. The goal of this user space service is to map the heap and text+data segments of STATICALLY linked executables into huge pages (the mapping of dynamically linked programs is not supported). All of this relies on the kernel features described in this answer.

Finding original MAC address from hardware itself

Is it possible to read the MAC address from the NIC directly? I have the code below but it just reads from the layer above but not the card itself.
I'm trying to figure out how to find the original MAC address of an Ethernet NIC on my Linux box. I understand how to find the current MAC address using ifconfig.
But the address can be changed, say by using
ifconfig eth0 hw ether uu:vv:ww:yy:xx:zz
or setting it "permanently" using /etc/sysconfig/network-scripts/ifcfg-eth0.
How do I find the original MAC address? There must be a way to find it, because it is still burned permanently into the card, but I can't find a tool to read the burned in address.
Is there any utility or command for that?
I suppose it should be possible to write C code for it, below code gives my current MAC but not the original MAC:
#include <stdio.h> /* Standard I/O */
#include <stdlib.h> /* Standard Library */
#include <errno.h> /* Error number and related */
#define ENUMS
#include <sys/socket.h>
#include <net/route.h>
#include <net/if.h>
#include <features.h> /* for the glibc version number */
#if __GLIBC__ >= 2 && __GLIBC_MINOR >= 1
#include <netpacket/packet.h>
#include <net/ethernet.h> /* the L2 protocols */
#else
#include <asm/types.h>
#include <linux/if_packet.h>
#include <linux/if_ether.h> /* The L2 protocols */
#endif
#include <netinet/in.h>
#include <arpa/inet.h>
#include <sys/un.h>
#include <sys/ioctl.h>
#include <netdb.h>
int main( int argc, char * argv[] ){
unsigned char mac[IFHWADDRLEN];
int i;
get_local_hwaddr( argv[1], mac );
for( i = 0; i < IFHWADDRLEN; i++ ){
printf( "%02X:", (unsigned int)(mac[i]) );
}
}
int get_local_hwaddr(const char *ifname, unsigned char *mac)
{
struct ifreq ifr;
int fd;
int rv; // return value - error value from df or ioctl call
/* determine the local MAC address */
strcpy(ifr.ifr_name, ifname);
fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP);
if (fd < 0)
rv = fd;
else {
rv = ioctl(fd, SIOCGIFHWADDR, &ifr);
if (rv >= 0) /* worked okay */
memcpy(mac, ifr.ifr_hwaddr.sa_data, IFHWADDRLEN);
}
return rv;
}
OS: Red Hat Linux, 2.6.18.8-1
Certainly in ethtool you can just print the address:
-P --show-permaddr
Queries the specified network device for permanent hardware address.
e.g.
ethtool -P eth0
produces
Permanent address: 94:de:80:6a:21:25
Try cat /sys/class/net/eth0/address or cat /sys/class/net/em1/address if using Fedora. It should work.
The original answer is here: Notes of a Systems Admin
The only way to find the original MAC address is to use the same method the network card driver does - unfortunately, I don't believe there is a generic way to tell the driver to provide its MAC address "as provided by the hardware". Of course, there are cases where there isn't a hardware network card for that particular interface - virtual network drivers for virtualization and when using bridges and software switches for example.
And of course, the hardware may be such that you can't actually read the "original" MAC address when it has been overwritten by software, because there is only one set of registers for the MAC address itself.
I had a quick look at the pcnet32.c drivers (because it's one of the models of network card that I have a rough idea how it works and where the different registers are, etc, so I can see what it does). As far as I can see, it supports no method of actually asking "what is your PROM Ethernet address" - the MAC address is read out during the "probe1" section of the module initialization, and stored away. No further access to those hardware registers is made.
Well, the old ethernet address remains in the first bytes of the card eeprom (at least for some types of cards), so it is possible to extract it using ethtool
bash$ sudo ethtool -e eth1
Offset Values
------ ------
0x0000 tt uu ww xx yy zz 79 03
0x....
where tt:uu:ww:xx:yy:zz is old mac address
This may not be the programmatic way, but why not search dmesg. All of my machines' NICs spit out the MAC address at detection time.
Try something like this:
dmesg|grep eth0
Different NICs display the MAC address differently, but the log will always contain the kernel given name of the adapter (in most cases eth0 or wlan0).
In Ubuntu 18.4 LTS
First you need to find the network interface names on the computer by command
ls /sys/class/net
My output is
docker0 enp0s31f6 lo
Now you can use the command as discussed above, dont forget the sudo
sudo ethtool -P enp0s31f6
This command lists all the ethernet devices and original HW addresses.
dmesg | grep eth | grep IRQ | awk {'print "permanent address of " $5 " " $9'} |tr "," " "

How to monitor number of syscalls executed by kernel?

I need to monitor amount of system calls executed by Linux.
I'm aware that vmstat has ability to show this for BSD and AIX systems, but for Linux it can't (according to man page).
Is there any counter in /proc? Or is there any other way to monitor it?
I wrote a simple SystemTap script(based on syscalls_by_pid.stp).
It produces output like this:
ProcessName #SysCalls
munin-graph 38609
munin-cron 8160
fping 4502
check_http_demo 2584
check_nrpe 2045
sh 1836
nagios 886
sendmail 747
smokeping 649
check_http 571
check_nt 376
pcscd 216
ping 108
check_ping 100
crond 87
stapio 69
init 56
syslog-ng 27
sshd 17
ntpd 9
hp-asrd 8
hald-addon-stor 7
automount 6
httpd 4
stap 3
flow-capture 2
gam_server 2
Total 61686
The script itself:
#! /usr/bin/env stap
#
# Print the system call count by process name in descending order.
#
global syscalls
probe begin {
print ("Collecting data... Type Ctrl-C to exit and display results\n")
}
probe syscall.* {
syscalls[execname()]++
}
probe end {
printf ("%-20s %-s\n\n", "ProcessName", "#SysCalls")
summary = 0
foreach (procname in syscalls-) {
printf("%-20s %-10d\n", procname, syscalls[procname])
summary = summary + syscalls[procname]
}
printf ("\n%-20s %-d\n", "Total", summary)
}
You can use pstrace as said Jeff Foster to trace the system call.
Also, you can use strace and ltrace
strace - trace system calls and signals
ltrace - A library call tracer
You can use ptrace to monitor all syscalls (see here)
I believe OProfile can do this.
I am not aware of a centralized way to monitor syscalls throughout the entire OS. Maybe do a ptrace on the init process and follow all children? But I don't know if that will work.
Your best bet is to write a patch to the kernel itself to do this. The closest thing to this that I've seen is a cgroup implementation for enforcing permissions on what syscalls can be executed at runtime. You can find the patch here:
https://github.com/luksow/syscalls-cgroup
It shouldn't be too much more work to throw a counter in there, from a kernel programming perspective.

Linux: Detect 64-bit kernel (long mode) from 32-bit user mode program

What's the best and most reliable way to detect if a 32-bit user mode program is running on a 64-bit kernel or not (i.e. if the system is in 'long mode')? I'd rather not call external programs if possible (or have to load any kernel modules).
Note: I want to detect whether a 64-bit kernel is being used (or really, whether the CPU is in long mode), not simply if a 64-bit capable processor is present (/proc/cpuinfo tells me that but not whether the 64-bit capability is being used).
The kernel fakes a 32-bit processor if uname is compiled 32-bit or if setarch i686 is used.
Call the uname() function and check the returned machine string, which will be x86_64 for a 64-bit Intel platform.
One way of reversing the effect of the use of setarch is to reset the personality:
#include <stdio.h>
#include <sys/utsname.h>
#include <sys/personality.h>
int main()
{
struct utsname u;
personality(PER_LINUX);
uname(&u);
puts(u.machine);
return 0;
}
This shows the right results when compiled in 32-bit mode and run on a 64-bit system:
$ gcc -m32 -o u u.c
$ ./u
x86_64
$ setarch i686 ./u
x86_64
EDIT: Fixed code to reverse effect of setarch.
Reference.
Assuming that uname() is cheating, there are still several mechanisms. One way is to check the width of the address of any of the kernel symbols.
#include <stdio.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
int main(int argc, char **argv) {
char *inputline = malloc(1024);
char *oinputline = inputline;
int fd = open("/proc/kallsyms", O_RDONLY);
int numnibbles = 0;
if (fd == -1) {
perror("open");
free(inputline);
exit(1);
}
read(fd, inputline, 1024);
close(fd);
while(!isspace(*inputline)) {
numnibbles++;
inputline++;
}
printf("%dbit\n", numnibbles*4);
free(oinputline);
exit (0);
}
If the kernel is configured for it, you can read the kernel config from /proc/config.gz
zcat /proc/config.gz | grep CONFIG_64BIT
# CONFIG_64BIT is not set
I'm not sure how portable you need it to be- it doesn't seem like a super common config option.

How is stack size of Linux process related to pthread, fork and exec

I have a question about the stack size of a process on Linux. Is this stack size determined at linkage time and is coded in the ELF file?
I wrote a program which prints its stack size by
pthread_attr_getstacksize(&attr, &stacksize);
And if I run this program directly from a shell, it gives a value of about 10MB. But when I exec it from a thread which belongs to a multi-thread program, it gives a value of about 2MB.
So I want to know what factors affect the stack size of a process which is fork and exec-ed from some parent process. And is it possible to set the stack size of a process in its parent at run time before fork and exec the child?
As the manpage for pthread_create(3) says:
"On Linux/x86-32, the default stack size for a new thread is 2 megabytes", Unless the RLIMIT_STACK resource limit (ulimit -s) is set: in that case, "it determines the default stack size of new threads".
You can check this fact by retrieving the current value of RLIMIT_STACK with getrlimit(2), as in the following program:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
int main()
{
/* Warning: error checking removed to keep the example small */
pthread_attr_t attr;
size_t stacksize;
struct rlimit rlim;
pthread_attr_init(&attr);
pthread_attr_getstacksize(&attr, &stacksize);
getrlimit(RLIMIT_STACK, &rlim);
/* Don't know the exact type of rlim_t, but surely it will
fit into a size_t variable. */
printf("%zd\n", (size_t) rlim.rlim_cur);
printf("%zd\n", stacksize);
pthread_attr_destroy(&attr);
return 0;
}
These are the results when trying to run it (compiled to a.out) from the command line:
$ ulimit -s
8192
$ ./a.out
8388608
8388608
$ ulimit -s unlimited
$ ./a.out
-1
2097152
$ ulimit -s 4096
$ ./a.out
4194304
4194304
According to the man page for fork(), "The child process is created with a single thread—the one that called fork()."
So, the stack size of the main thread for the child process will be the stack size of the thread that calls fork().
But, when one of the exec() functions is called (ultimately calling execve() to do the real work), the process image is replaced with the new program. At that time, the stack is re-created according to the stack size soft limit (kernel 2.6.23 and later), which can be seen by calling getrlimit(RLIMIT_STACK, &rlimitStruct).
You can control this before calling exec by setting the soft limit using setrlimit(RLIMIT_STACK, &rlimitStruct) (provided you don't try to increase the hard limit or set the soft limit higher than the hard limit).

Resources