Twice as many page faults when reading from a large malloced array instead of just storing? - linux

I am doing a simple test on monitoring page faults with the code below, What I don't know is how a simple one line of code below doubled my page fault count.
if I use
ptr[i+4096] = 'A'
I got 25,722 page-faults with perf tool, which is what I expected,
but if I use
tmp = ptr[i+4096]
instead, the page-faults doubled to 51,322
I don't how to explain it. Below is the complete code. Thanks!
void do_something() {
int i;
char* ptr;
char tmp;
ptr = malloc(100*1024*1024);
int j = 0;
int k = 0;
for (i = 0; i < 100*1024*1024; i+=4096) {
//ptr[i+4096] = 'A' ;
tmp = ptr[i+4096];
for (j = 0 ; j < 4096; j++)
ptr[i+j] = (char) (i & 0xff); // pagefault
}
free(ptr);
}
int main(int argc, char* argv[]) {
do_something();
return 0;
}
Machine Info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2687W v3 # 3.10GHz
Stepping: 2
CPU MHz: 3096.188
BogoMIPS: 6197.81
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
3.10.0-514.32.3.el7.x86_64 #1

malloc() will often satisfy requests for memory by asking the OS for new pages, e.g., via mmap. Such pages are generally allocated lazily: no actual page is allocated until the first access.
What happens then depends on the type of the first access: when you do a read first, Linux will map in a shared read-only COW page of zeros to satisfy it, and then if you later you write it takes a second fault to allocate the private writeable page.
When you just do the write first, the first step is skipped. That's the usual case since code generally isn't reading from newly allocated memory which has undefined contents (at least when you get it from malloc).
Note that the above is a description of how newly allocated pages work in Linux - when you use malloc there is another layer: malloc will generally try to satisfy requests for blocks the process freed earlier, rather than continually requesting new memory. In the case memory is re-used, it will generally already be paged in and the above won't apply. Of course for your initial big allocation of 1024 MiB, where is no memory to re-use so you can be sure the allocator is getting it from the OS.

Related

Can the logical erase block size of an MTD device be increased?

The minimum erase block size for jffs2 (mtd-utils version 1.5.0, mkfs.jffs2 revision 1.60) seems to be 8KiB:
Erase size 0x1000 too small. Increasing to 8KiB minimum
However I am running Linux 3.10 with an at25df321a,
m25p80 spi32766.0: at25df321a (4096 Kbytes),
and the erase block size is only 4KiB:
mtd5
Name: spi32766.0
Type: nor
Eraseblock size: 4096 bytes, 4.0 KiB
Amount of eraseblocks: 1024 (4194304 bytes, 4.0 MiB)
Minimum input/output unit size: 1 byte
Sub-page size: 1 byte
Character device major/minor: 90:10
Bad blocks are allowed: false
Device is writable: true
Is there a way to make the mtd system treat multiple erase blocks as one? Maybe some ioctl or module parameter?
If I flash a jffs2 image with larger erase block size, I get lots of kernel error messages, missing files and sometimes panic.
workaround
I found that flasherase --jffs2 results in a working filesystem inspite of the 4KiB erase block size. So I hacked the mkfs.jfss2.c file and the resulting image seems to work fine. I'll give it some testing.
diff -rupN orig/mkfs.jffs2.c new/mkfs.jffs2.c
--- orig/mkfs.jffs2.c 2014-10-20 15:43:31.751696500 +0200
+++ new/mkfs.jffs2.c 2014-10-20 15:43:12.623431400 +0200
## -1659,11 +1659,11 ## int main(int argc, char **argv)
}
erase_block_size *= units;
- /* If it's less than 8KiB, they're not allowed */
- if (erase_block_size < 0x2000) {
- fprintf(stderr, "Erase size 0x%x too small. Increasing to 8KiB minimum\n",
+ /* If it's less than 4KiB, they're not allowed */
+ if (erase_block_size < 0x1000) {
+ fprintf(stderr, "Erase size 0x%x too small. Increasing to 4KiB minimum\n",
erase_block_size);
- erase_block_size = 0x2000;
+ erase_block_size = 0x1000;
}
break;
}
http://lists.infradead.org/pipermail/linux-mtd/2010-September/031876.html
JFFS2 should be able to fit at least one node to eraseblock. The
maximum node size is 4KiB+few bytes. This is why the minimum
eraseblocks size is 8KiB.
But in practice, even 8KiB is bad because you and up with wasting a
lot of space at the end of eraseblocks.
You should join several erasblock into one virtual eraseblock of 64 or
128 KiB and use it - this will be more optimal.
Some drivers have already implemented this. I know about
MTD_SPI_NOR_USE_4K_SECTORS
Linux configuration option. It have to be set to "n" to enable large erase sectors of size 0x00010000.

Accessing large memory (32 GB) using /dev/zero

I want to use /dev/zero for storing lots of temporary data (32 GB or around that). I am doing this:
fd = open("/dev/zero", O_RDWR );
// <Exit on error>
vbase = (uint64_t*) mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
// <Exit on error>
ftruncate(fd, (off_t) MEMSIZE);
I am changing MEMSIZE from 1GB to 32 GB (performing a memtest) to see if I can really access all that range. I am running out of memory at 1 GB.
Is there something I am missing ? Am I mmap'ing correctly ?
Or am I running into some system limit ? How can I check if this is happening ?
P.S: I run many programs that generate many gigs of data within a single file, so I dont know if there is an artificial upper limit, just that I seem to be running into something.
I have to admit I'm confused about what you're actually trying to do. Anyway, a couple of reason why what you do might not work:
From the mmap(2) manpage: "MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initialized to zero. The fd and offset arguments are ignored;"
From the null(4) manpage: "Data written to a null or zero special file is discarded."
So anyway, before MAP_ANONYMOUS, mmap'ing /dev/zero was sometimes used to get anonymous (i.e. not backed by any file) memory. No need to do both. In either case, actually writing to all that memory implies that you need some kind of backing store for it, either physical memory or swap space. If you cannot guarantee that, maybe it's better to mmap() a real file on a filesystem with enough space?
Look into Linux kernel mmap implementation:
vm_mmap vm_mmap_pgoff  do_mmap_pgoff  mmap_region  file->f_op->mmap(file, vma)
In the function do_mmap_pgoff, it checks the max_map_count
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;
root> sysctl -a | grep map_count
vm.max_map_count = 65530
In the function mmap_region, it checks the process virtual address limit (whether it is unlimited).
int may_expand_vm(struct mm_struct *mm, unsigned long npages)
{
unsigned long cur = mm->total_vm; /* pages */
unsigned long lim;
lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
if (cur + npages > lim)
return 0;
return 1;
}
root> ulimit -a | grep virtual
virtual memory (kbytes, -v) unlimited
In linux kernel, init task has the rlimit setting by default.
[RLIMIT_AS] = { RLIM_INFINITY, RLIM_INFINITY }, \
#ifndef RLIM_INFINITY
# define RLIM_INFINITY (~0UL)
#endif
In order to prove it, use the test_mem program
tmp> ./test_mem
RLIMIT_AS limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
RLIMIT_DATA limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
struct rlimit rl;
int ret;
ret = getrlimit(RLIMIT_AS, &rl);
if (ret == 0) {
printf("RLIMIT_AS limit got sucessfully:\n");
printf("soft_limit=%lld, hard_limit=%lld\n", (long long)rl.rlim_cur, (long long)rl.rlim_max);
}
That means unlimited means 0xFFFFFFFF for 32bit app in the 64bit OS. Change the shell virtual address limit, it could reflect correctly.
root> ulimit -v 1024000
tmp> ./test_mem
RLIMIT_AS limit got sucessfully:
soft_limit=1048576000, hard_limit=1048576000
RLIMIT_DATA limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
In mmap_region, there is an accountable check
accountable_mapping  security_vm_enough_memory_mm  cap_vm_enough_memory  __vm_enough_memory  overcommit/swap/admin and user reserve handling.
Please follow the three steps to check whether they can meet.

Poor memcpy performance in user space for mmap'ed physical memory in Linux

Of 192GB RAM installed on my computer, I have 188GB RAM above 4GB (at hardware address 0x100000000) reserved by the Linux kernel at boot time (mem=4G memmap=188G$4G). A data acquisition kernel modules accumulates data into this large area used as a ring buffer using DMA. A user space application mmap's this ring buffer into user space, then copies blocks from the ring buffer at the current location for processing once they are ready.
Copying these 16MB blocks from the mmap'ed area using memcpy does not perform as I expected. It appears that the performance depends on the size of the memory reserved at boot time (and later mmap'ed into user space). http://www.wurmsdobler.org/files/resmem.zip contains the source code for a kernel module which does implements the mmap file operation:
module_param(resmem_hwaddr, ulong, S_IRUSR);
module_param(resmem_length, ulong, S_IRUSR);
//...
static int resmem_mmap(struct file *filp, struct vm_area_struct *vma) {
remap_pfn_range(vma, vma->vm_start,
resmem_hwaddr >> PAGE_SHIFT,
resmem_length, vma->vm_page_prot);
return 0;
}
and a test application, which does in essence (with the checks removed):
#define BLOCKSIZE ((size_t)16*1024*1024)
int resMemFd = ::open(RESMEM_DEV, O_RDWR | O_SYNC);
unsigned long resMemLength = 0;
::ioctl(resMemFd, RESMEM_IOC_LENGTH, &resMemLength);
void* resMemBase = ::mmap(0, resMemLength, PROT_READ | PROT_WRITE, MAP_SHARED, resMemFd, 4096);
char* source = ((char*)resMemBase) + RESMEM_HEADER_SIZE;
char* destination = new char[BLOCKSIZE];
struct timeval start, end;
gettimeofday(&start, NULL);
memcpy(destination, source, BLOCKSIZE);
gettimeofday(&end, NULL);
float time = (end.tv_sec - start.tv_sec)*1000.0f + (end.tv_usec - start.tv_usec)/1000.0f;
std::cout << "memcpy from mmap'ed to malloc'ed: " << time << "ms (" << BLOCKSIZE/1000.0f/time << "MB/s)" << std::endl;
I have carried out memcpy tests of a 16MB data block for the different sizes of reserved RAM (resmem_length) on Ubuntu 10.04.4, Linux 2.6.32, on a SuperMicro 1026GT-TF-FM109:
| | 1GB | 4GB | 16GB | 64GB | 128GB | 188GB
|run 1 | 9.274ms (1809.06MB/s) | 11.503ms (1458.51MB/s) | 11.333ms (1480.39MB/s) | 9.326ms (1798.97MB/s) | 213.892ms ( 78.43MB/s) | 206.476ms ( 81.25MB/s)
|run 2 | 4.255ms (3942.94MB/s) | 4.249ms (3948.51MB/s) | 4.257ms (3941.09MB/s) | 4.298ms (3903.49MB/s) | 208.269ms ( 80.55MB/s) | 200.627ms ( 83.62MB/s)
My observations are:
From the first to the second run, memcpy from mmap'ed to malloc'ed seems to benefit that the contents might already be cached somewhere.
There is a significant performance degradation from >64GB, which can be noticed both when using a memcpy.
I would like to understand why that so is. Perhaps somebody in the Linux kernel developers group thought: 64GB should be enough for anybody (does this ring a bell?)
Kind regards,
peter
Based on feedback from SuperMicro, the performance degradation is due to NUMA, non-uniform memory access. The SuperMicro 1026GT-TF-FM109 uses the X8DTG-DF motherboard with one Intel 5520 Tylersburg chipset at its heart, connected to two Intel Xeon E5620 CPUs, each of which has 96GB RAM attached.
If I lock my application to CPU0, I can observe different memcpy speeds depending on what memory area was reserved and consequently mmap'ed. If the reserved memory area is off-CPU, then mmap struggles for some time to do its work, and any subsequent memcpy to and from the "remote" area consumes more time (data block size = 16MB):
resmem=64G$4G (inside CPU0 realm): 3949MB/s
resmem=64G$96G (outside CPU0 realm): 82MB/s
resmem=64G$128G (outside CPU0 realm): 3948MB/s
resmem=92G$4G (inside CPU0 realm): 3966MB/s
resmem=92G$100G (outside CPU0 realm): 57MB/s
It nearly makes sense. Only the third case, 64G$128, which means the uppermost 64GB also yield good results. This contradicts somehow the theory.
Regards,
peter
Your CPU probably doesn't have enough cache to deal with it efficiently. Either use lower memory, or get a CPU with a bigger cache.

RES != CODE + DATA in the output information of the top command,why?

what 'man top' said is: RES = CODE + DATA
q: RES -- Resident size (kb)
The non-swapped physical memory a task has used.
RES = CODE + DATA.
r: CODE -- Code size (kb)
The amount of physical memory devoted to executable code, also known as the 'text resident set' size or TRS.
s: DATA -- Data+Stack size (kb)
The amount of physical memory devoted to other than executable code, also known as the 'data >resident set' size or DRS.
what when i run 'top -p 4258',i get the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ CODE DATA COMMAND
258 root 16 0 3160 1796 1328 S 0.0 0.3 0:00.10 476 416 bash
1796 != 476+416
why?
ps:
linux distribution:
linux-iguu:~ # lsb_release -a
LSB Version: core-2.0-noarch:core-3.0-noarch:core-2.0-ia32:core-3.0-ia32:desktop-3.1-ia32:desktop-3.1-noarch:graphics-2.0-ia32:graphics-2.0-noarch:graphics-3.1-ia32:graphics-3.1-noarch
Distributor ID: SUSE LINUX
Description: SUSE Linux Enterprise Server 9 (i586)
Release: 9
Codename: n/a
kernel version:
linux-iguu:~ # uname -a
Linux linux-iguu 2.6.16.60-0.21-default #1 Tue May 6 12:41:02 UTC 2008 i686 i686 i386 GNU/Linux
I'll explain this with the help of an example of what happens when a program allocates and uses memory. Specifically, this program:
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
int main(){
int *data, size, count, i;
printf( "fyi: your ints are %d bytes large\n", sizeof(int) );
printf( "Enter number of ints to malloc: " );
scanf( "%d", &size );
data = malloc( sizeof(int) * size );
if( !data ){
perror( "failed to malloc" );
exit( EXIT_FAILURE );
}
printf( "Enter number of ints to initialize: " );
scanf( "%d", &count );
for( i = 0; i < count; i++ ){
data[i] = 1337;
}
printf( "I'm going to hang out here until you hit <enter>" );
while( getchar() != '\n' );
while( getchar() != '\n' );
exit( EXIT_SUCCESS );
}
This is a simple program that asks you how many integers to allocate, allocates them, asks how many of those integers to initialize, and then initializes them. For a run where I allocate 1250000 integers and initialize 500000 of them:
$ ./a.out
fyi: your ints are 4 bytes large
Enter number of ints to malloc: 1250000
Enter number of ints to initialize: 500000
Top reports the following information:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP CODE DATA COMMAND
<program start>
11129 xxxxxxx 16 0 3628 408 336 S 0 0.0 0:00.00 3220 4 124 a.out
<allocate 1250000 ints>
11129 xxxxxxx 16 0 8512 476 392 S 0 0.0 0:00.00 8036 4 5008 a.out
<initialize 500000 ints>
11129 xxxxxxx 15 0 8512 2432 396 S 0 0.0 0:00.00 6080 4 5008 a.out
The relevant information is:
DATA CODE RES VIRT
before allocation: 124 4 408 3628
after 5MB allocation: 5008 4 476 8512
after 2MB initialization: 5008 4 2432 8512
After I malloc'd 5MB of data, both VIRT and DATA increased by ~5MB, but RES did not. RES did increase after I touched 2MB of the integers I allocated, but DATA and VIRT stayed the same.
VIRT is the total amount of virtual memory used by the process, including what is shared and what is over-committed. DATA is the amount of virtual memory used that isn't shared and that isn't code-text. I.e., it is the virtual stack and heap of the process. RES is not virtual: it is a measurment of how much memory the process is actually using at that specific time.
So in your case, the large inequality CODE+DATA < RES is likely the shared libraries included by the process. In my example (and yours), SHR+CODE+DATA is a closer approximation to RES.
Hope this helps.
There's a lot of hand-waving and voodoo associated with top and ps. There are many articles (rants?) online about the descrepancies. E.g., this and this.
This explanation is terrific to resolve my some queries. Thanks!
And meanwhile, trying to add something got during my understanding of linux memory management knowledge. If any misunderstand, please correct me!
Modern OS process concepts are based on virtual memory. Virtual memory system includes the RAM+SWAP;
So I think most of the memory concepts related with processes refer to the virtual memory, except that there are some supplement notes.
Any virtual address(page) allocated to a process is in below state:
a) allocated, but no mapping to any physical memory(something like COW)
b) allocated, already mapped to physical memory
c) allocated, already mapped to swapped memory.
The fields ouput of top command:
a) VIRT -- it refers to all virtual memory that the process have the right
to access, no matter it is already mapped to physical memory or swapped
memory, or even has no any mapping.
b) RES -- it refers to the virtual address already mapped to physical address and it still in RAM.
c) SWAP -- refers to the virtual address already mapped to physical address and it is swapped into SWAP space.
d) SHR -- it refers to the shared memory available to a process(VM?)
e) CODE + DATA -- CODE could be in a state of 2.b/2.c, and DATA could be in any of 3 state 2.a/2.b/3.c, and 3.b/3.c also have a fields name called "USED".
4) So the calculation maybe look like:
a) VIRT(VM) = RES(VM in memory) + SWAP(VM in swap) + VM unmapped(DATA, SHR?).
b) USED = RES + SWAP
c) SWAP = CODE(vm in memory) + DATA(vm in memory) + SHR(vm in memory?)
d) RES = CODE(vm in memory) + DATA(vm in memory) + SHR(vm in memory?)
At least DATA segment still have a "DATA(VM unmapped)", this could be observed from above malloc example. That's a little different from the manpage of top command which says "DATA: The amount of physical memory devoted to other than executable code, also known as the Data Resident Set size or DRS". Thanks again.
So amount of (CODE + DATA + SHR) usually larger than RES, because at least DATA(vm unmapped) actually calculated in "DATA", not like the manpge claiming.
Regards,

max thread per process in linux

I wrote a simple program to calculate the maximum number of threads that a process can have in linux (Centos 5). here is the code:
int main()
{
pthread_t thrd[400];
for(int i=0;i<400;i++)
{
int err=pthread_create(&thrd[i],NULL,thread,(void*)i);
if(err!=0)
cout << "thread creation failed: " << i <<" error code: " << err << endl;
}
return 0;
}
void * thread(void* i)
{
sleep(100);//make the thread still alive
return 0;
}
I figured out that max number for threads is only 300!? What if i need more than that?
I have to mention that pthread_create returns 12 as error code.
Thanks before
There is a thread limit for linux and it can be modified runtime by writing desired limit to /proc/sys/kernel/threads-max. The default value is computed from the available system memory. In addition to that limit, there's also another limit: /proc/sys/vm/max_map_count which limits the maximum mmapped segments and at least recent kernels will mmap memory per thread. It should be safe to increase that limit a lot if you hit it.
However, the limit you're hitting is lack of virtual memory in 32bit operating system. Install a 64 bit linux if your hardware supports it and you'll be fine. I can easily start 30000 threads with a stack size of 8MB. The system has a single Core 2 Duo + 8 GB of system memory (I'm using 5 GB for other stuff in the same time) and it's running 64 bit Ubuntu with kernel 2.6.32. Note that memory overcommit (/proc/sys/vm/overcommit_memory) must be allowed because otherwise system would need at least 240 GB of committable memory (sum of real memory and swap space).
If you need lots of threads and cannot use 64 bit system your only choice is to minimize the memory usage per thread to conserve virtual memory. Start with requesting as little stack as you can live with.
Your system limits may not be allowing you to map the stacks of all the threads you require. Look at /proc/sys/vm/max_map_count, and see this answer. I'm not 100% sure this is your problem, because most people run into problems at much larger thread counts.
I had also encountered the same problem when my number of threads crosses some threshold.
It was because of the user level limit (number of process a user can run at a time) set to 1024 in /etc/security/limits.conf .
so check your /etc/security/limits.conf and look for entry:-
username -/soft/hard -nproc 1024
change it to some larger values to something 100k(requires sudo privileges/root) and it should work for you.
To learn more about security policy ,see http://linux.die.net/man/5/limits.conf.
check the stack size per thread with ulimit, in my case Redhat Linux 2.6:
ulimit -a
...
stack size (kbytes, -s) 10240
Each of your threads will get this amount of memory (10MB) assigned for it's stack. With a 32bit program and a maximum address space of 4GB, that is a maximum of only 4096MB / 10MB = 409 threads !!! Minus program code, minus heap-space will probably lead to your observed max. of 300 threads.
You should be able to raise this by compiling a 64bit application or setting ulimit -s 8192 or even ulimit -s 4096. But if this is advisable is another discussion...
You will run out of memory too unless u shrink the default thread stack size. Its 10MB on our version of linux.
EDIT:
Error code 12 = out of memory, so I think the 1mb stack is still too big for you. Compiled for 32 bit, I can get a 100k stack to give me 30k threads. Beyond 30k threads I get Error code 11 which means no more threads allowed. A 1MB stack gives me about 4k threads before error code 12. 10MB gives me 427 threads. 100MB gives me 42 threads. 1 GB gives me 4... We have 64 bit OS with 64 GB ram. Is your OS 32 bit? When I compile for 64bit, I can use any stack size I want and get the limit of threads.
Also I noticed if i turn the profiling stuff (Tools|Profiling) on for netbeans and run from the ide...I only can get 400 threads. Weird. Netbeans also dies if you use up all the threads.
Here is a test app you can run:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <signal.h>
// this prevents the compiler from reordering code over this COMPILER_BARRIER
// this doesnt do anything
#define COMPILER_BARRIER() __asm__ __volatile__ ("" ::: "memory")
sigset_t _fSigSet;
volatile int _cActive = 0;
pthread_t thrd[1000000];
void * thread(void *i)
{
int nSig, cActive;
cActive = __sync_fetch_and_add(&_cActive, 1);
COMPILER_BARRIER(); // make sure the active count is incremented before sigwait
// sigwait is a handy way to sleep a thread and wake it on command
sigwait(&_fSigSet, &nSig); //make the thread still alive
COMPILER_BARRIER(); // make sure the active count is decrimented after sigwait
cActive = __sync_fetch_and_add(&_cActive, -1);
//printf("%d(%d) ", i, cActive);
return 0;
}
int main(int argc, char** argv)
{
pthread_attr_t attr;
int cThreadRequest, cThreads, i, err, cActive, cbStack;
cbStack = (argc > 1) ? atoi(argv[1]) : 0x100000;
cThreadRequest = (argc > 2) ? atoi(argv[2]) : 30000;
sigemptyset(&_fSigSet);
sigaddset(&_fSigSet, SIGUSR1);
sigaddset(&_fSigSet, SIGSEGV);
printf("Start\n");
pthread_attr_init(&attr);
if ((err = pthread_attr_setstacksize(&attr, cbStack)) != 0)
printf("pthread_attr_setstacksize failed: err: %d %s\n", err, strerror(err));
for (i = 0; i < cThreadRequest; i++)
{
if ((err = pthread_create(&thrd[i], &attr, thread, (void*)i)) != 0)
{
printf("pthread_create failed on thread %d, error code: %d %s\n",
i, err, strerror(err));
break;
}
}
cThreads = i;
printf("\n");
// wait for threads to all be created, although we might not wait for
// all threads to make it through sigwait
while (1)
{
cActive = _cActive;
if (cActive == cThreads)
break;
printf("Waiting A %d/%d,", cActive, cThreads);
sched_yield();
}
// wake em all up so they exit
for (i = 0; i < cThreads; i++)
pthread_kill(thrd[i], SIGUSR1);
// wait for them all to exit, although we might be able to exit before
// the last thread returns
while (1)
{
cActive = _cActive;
if (!cActive)
break;
printf("Waiting B %d/%d,", cActive, cThreads);
sched_yield();
}
printf("\nDone. Threads requested: %d. Threads created: %d. StackSize=%lfmb\n",
cThreadRequest, cThreads, (double)cbStack/0x100000);
return 0;
}

Resources