unexpected behavior of linux malloc

unexpected behavior of linux malloc - linux

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include<pthread.h>
#define BLOCKSIZE 1024*1024
// #define BLOCKSIZE 4096
int main (int argc, char *argv[])
{
void *myblock = NULL;
int count = 0;
while (1)
{
myblock = malloc(BLOCKSIZE);
if (!myblock){
puts("error"); break;
}
memset(myblock,1, BLOCKSIZE);
count++;
}
printf("Currently allocated %d \n",count);
printf("end");
exit(0);
}
When BLOCKSIZE is 1024*1024. All is ok. Malloc return NULL, loop is break. Program print text and exit.
When BLOCKSIZE is 4096
Malloc never returns NULL Program crash. => Out of memory , killed by kernel .
Why?

It's pitch black, you are likely to be eaten by an OOM killer.
Linux has this thing called an OOM killer which wanders about killing off processes when it finds memory allocation is very heavy. The selection of which process(es) to kill is based on certain properties of each process (such as one allocating a lot of memory being a prime candidate).
It does this, partly due to its optimistic memory allocation strategy (it will generally give you address space whether or not there's enough backing memory on devices for it, something known as overcommit).
It's likely in this case that, when allocating 1M at a time, an allocation fails before the OOM killer finds you. With 4K, you're discovered before the allocation routines decide you've had enough.
You can configure the OOM killer to leave you alone if that's your desire, by writing an adjustment value of -17 to your oom_adj entry in procfs. It's not advisable unless you know what your doing since it puts other (perhaps more important) processes at risk. Other values from -16 to +15 adjust the likelihood that your process will be selected.
You can also turn off overcommit altogether by writing vm.overcommit_memory=2 to /etc/sysctl.conf but that again can present problems in your environment.

Related

perf stat time elapsed for sys and user [duplicate]

$ time foo
real 0m0.003s
user 0m0.000s
sys 0m0.004s
$
What do real, user and sys mean in the output of time? Which one is meaningful when benchmarking my app?

Real, User and Sys process time statistics
One of these things is not like the other. Real refers to actual elapsed time; User and Sys refer to CPU time used only by the process.
Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).
User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.
Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process. See below for a brief description of kernel mode (also known as 'supervisor' mode) and the system call mechanism.
User+Sys will tell you how much actual CPU time your process used. Note that this is across all CPUs, so if the process has multiple threads (and this process is running on a computer with more than one processor) it could potentially exceed the wall clock time reported by Real (which usually occurs). Note that in the output these figures include the User and Sys time of all child processes (and their descendants) as well when they could have been collected, e.g. by wait(2) or waitpid(2), although the underlying system calls return the statistics for the process and its children separately.
Origins of the statistics reported by time (1)
The statistics reported by time are gathered from various system calls. 'User' and 'Sys' come from wait (2) (POSIX) or times (2) (POSIX), depending on the particular system. 'Real' is calculated from a start and end time gathered from the gettimeofday (2) call. Depending on the version of the system, various other statistics such as the number of context switches may also be gathered by time.
On a multi-processor machine, a multi-threaded process or a process forking children could have an elapsed time smaller than the total CPU time - as different threads or processes may run in parallel. Also, the time statistics reported come from different origins, so times recorded for very short running tasks may be subject to rounding errors, as the example given by the original poster shows.
A brief primer on Kernel vs. User mode
On Unix, or any protected-memory operating system, 'Kernel' or 'Supervisor' mode refers to a privileged mode that the CPU can operate in. Certain privileged actions that could affect security or stability can only be done when the CPU is operating in this mode; these actions are not available to application code. An example of such an action might be manipulation of the MMU to gain access to the address space of another process. Normally, user-mode code cannot do this (with good reason), although it can request shared memory from the kernel, which could be read or written by more than one process. In this case, the shared memory is explicitly requested from the kernel through a secure mechanism and both processes have to explicitly attach to it in order to use it.
The privileged mode is usually referred to as 'kernel' mode because the kernel is executed by the CPU running in this mode. In order to switch to kernel mode you have to issue a specific instruction (often called a trap) that switches the CPU to running in kernel mode and runs code from a specific location held in a jump table. For security reasons, you cannot switch to kernel mode and execute arbitrary code - the traps are managed through a table of addresses that cannot be written to unless the CPU is running in supervisor mode. You trap with an explicit trap number and the address is looked up in the jump table; the kernel has a finite number of controlled entry points.
The 'system' calls in the C library (particularly those described in Section 2 of the man pages) have a user-mode component, which is what you actually call from your C program. Behind the scenes, they may issue one or more system calls to the kernel to do specific services such as I/O, but they still also have code running in user-mode. It is also quite possible to directly issue a trap to kernel mode from any user space code if desired, although you may need to write a snippet of assembly language to set up the registers correctly for the call.
More about 'sys'
There are things that your code cannot do from user mode - things like allocating memory or accessing hardware (HDD, network, etc.). These are under the supervision of the kernel, and it alone can do them. Some operations like malloc orfread/fwrite will invoke these kernel functions and that then will count as 'sys' time. Unfortunately it's not as simple as "every call to malloc will be counted in 'sys' time". The call to malloc will do some processing of its own (still counted in 'user' time) and then somewhere along the way it may call the function in kernel (counted in 'sys' time). After returning from the kernel call, there will be some more time in 'user' and then malloc will return to your code. As for when the switch happens, and how much of it is spent in kernel mode... you cannot say. It depends on the implementation of the library. Also, other seemingly innocent functions might also use malloc and the like in the background, which will again have some time in 'sys' then.

To expand on the accepted answer, I just wanted to provide another reason why real ≠ user + sys.
Keep in mind that real represents actual elapsed time, while user and sys values represent CPU execution time. As a result, on a multicore system, the user and/or sys time (as well as their sum) can actually exceed the real time. For example, on a Java app I'm running for class I get this set of values:
real 1m47.363s
user 2m41.318s
sys 0m4.013s

• real: The actual time spent in running the process from start to finish, as if it was measured by a human with a stopwatch
• user: The cumulative time spent by all the CPUs during the computation
• sys: The cumulative time spent by all the CPUs during system-related tasks such as memory allocation.
Notice that sometimes user + sys might be greater than real, as
multiple processors may work in parallel.

Minimal runnable POSIX C examples
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep syscall
Non-busy sleep as done by the sleep syscall only counts in real, but not for user or sys.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
Multiple threads
The following example does niters iterations of useless purely CPU-bound work on nthreads threads:
#define _XOPEN_SOURCE 700
#include <assert.h>
#include <inttypes.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
uint64_t niters;
void* my_thread(void *arg) {
uint64_t *argument, i, result;
argument = (uint64_t *)arg;
result = *argument;
for (i = 0; i < niters; ++i) {
result = (result * result) - (3 * result) + 1;
}
*argument = result;
return NULL;
}
int main(int argc, char **argv) {
size_t nthreads;
pthread_t *threads;
uint64_t rc, i, *thread_args;
/* CLI args. */
if (argc > 1) {
niters = strtoll(argv[1], NULL, 0);
} else {
niters = 1000000000;
}
if (argc > 2) {
nthreads = strtoll(argv[2], NULL, 0);
} else {
nthreads = 1;
}
threads = malloc(nthreads * sizeof(*threads));
thread_args = malloc(nthreads * sizeof(*thread_args));
/* Create all threads */
for (i = 0; i < nthreads; ++i) {
thread_args[i] = i;
rc = pthread_create(
&threads[i],
NULL,
my_thread,
(void*)&thread_args[i]
);
assert(rc == 0);
}
/* Wait for all threads to complete */
for (i = 0; i < nthreads; ++i) {
rc = pthread_join(threads[i], NULL);
assert(rc == 0);
printf("%" PRIu64 " %" PRIu64 "\n", i, thread_args[i]);
}
free(threads);
free(thread_args);
return EXIT_SUCCESS;
}
GitHub upstream + plot code.
Then we plot wall, user and sys as a function of the number of threads for a fixed 10^10 iterations on my 8 hyperthread CPU:
Plot data.
From the graph, we see that:
for a CPU intensive single core application, wall and user are about the same
for 2 cores, user is about 2x wall, which means that the user time is counted across all threads.
user basically doubled, and while wall stayed the same.
this continues up to 8 threads, which matches my number of hyperthreads in my computer.
After 8, wall starts to increase as well, because we don't have any extra CPUs to put more work in a given amount of time!
The ratio plateaus at this point.
Note that this graph is only so clear and simple because the work is purely CPU-bound: if it were memory bound, then we would get a fall in performance much earlier with less cores because the memory accesses would be a bottleneck as shown at What do the terms "CPU bound" and "I/O bound" mean?
Quickly checking that wall < user is a simple way to determine that a program is multithreaded, and the closer that ratio is to the number of cores, the more effective the parallelization is, e.g.:
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?
Sys heavy work with sendfile
The heaviest sys workload I could come up with was to use the sendfile, which does a file copy operation on kernel space: Copy a file in a sane, safe and efficient way
So I imagined that this in-kernel memcpy will be a CPU intensive operation.
First I initialize a large 10GiB random file with:
dd if=/dev/urandom of=sendfile.in.tmp bs=1K count=10M
Then run the code:
#define _GNU_SOURCE
#include <assert.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/sendfile.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char **argv) {
char *source_path, *dest_path;
int source, dest;
struct stat stat_source;
if (argc > 1) {
source_path = argv[1];
} else {
source_path = "sendfile.in.tmp";
}
if (argc > 2) {
dest_path = argv[2];
} else {
dest_path = "sendfile.out.tmp";
}
source = open(source_path, O_RDONLY);
assert(source != -1);
dest = open(dest_path, O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
assert(dest != -1);
assert(fstat(source, &stat_source) != -1);
assert(sendfile(dest, source, 0, stat_source.st_size) != -1);
assert(close(source) != -1);
assert(close(dest) != -1);
return EXIT_SUCCESS;
}
GitHub upstream.
which gives basically mostly system time as expected:
real 0m2.175s
user 0m0.001s
sys 0m1.476s
I was also curious to see if time would distinguish between syscalls of different processes, so I tried:
time ./sendfile.out sendfile.in1.tmp sendfile.out1.tmp &
time ./sendfile.out sendfile.in2.tmp sendfile.out2.tmp &
And the result was:
real 0m3.651s
user 0m0.000s
sys 0m1.516s
real 0m4.948s
user 0m0.000s
sys 0m1.562s
The sys time is about the same for both as for a single process, but the wall time is larger because the processes are competing for disk read access likely.
So it seems that it does in fact account for which process started a given kernel work.
Bash source code
When you do just time <cmd> on Ubuntu, it use the Bash keyword as can be seen from:
type time
which outputs:
time is a shell keyword
So we grep source in the Bash 4.19 source code for the output string:
git grep '"user\b'
which leads us to execute_cmd.c function time_command, which uses:
gettimeofday() and getrusage() if both are available
times() otherwise
all of which are Linux system calls and POSIX functions.
GNU Coreutils source code
If we call it as:
/usr/bin/time
then it uses the GNU Coreutils implementation.
This one is a bit more complex, but the relevant source seems to be at resuse.c and it does:
a non-POSIX BSD wait3 call if that is available
times and gettimeofday otherwise
1: https://i.stack.imgur.com/qAfEe.png**Minimal runnable POSIX C examples**
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep
Non-busy sleep does not count in either user or sys, only real.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?

Real shows total turn-around time for a process;
while User shows the execution time for user-defined instructions
and Sys is for time for executing system calls!
Real time includes the waiting time also (the waiting time for I/O etc.)

In very simple terms, I like to think about it like this:
real is the actual amount of time it took to run the command (as if you had timed it with a stopwatch)
user and sys are how much 'work' the CPU had to do to execute the command. This 'work' is expressed in units of time.
Generally speaking:
user is how much work the CPU did to run to run the command's code
sys is how much work the CPU had to do to handle 'system overhead' type tasks (such as allocating memory, file I/O, ect.) in order to support the running command
Since these last two times are counting 'work' done, they don't include time a thread might have spent waiting (such as waiting on another process or for disk I/O to finish).
real, however, is a measure of actual runtime and not 'work', so it does include any time spent waiting (which is why sometimes real > usr+sys).
And finally, sometimes the reverse is true (usr+sys > real) for multi-threaded applications. This also arises because we are comparing 'work-time' to actual time. For example, if 3 processors each run continuously for 10 minutes to execute a command, you will get real = 10m but usr = 30m.

I want to mention some other scenario when the real-time is much much bigger than user + sys. I've created a simple server which respondes after a long time
real 4.784
user 0.01s
sys 0.01s
the issue is that in this scenario the process waits for the response which is not on the user site nor in the system.
Something similar happens when you run the find command. In that case, the time is spent mostly on requesting and getting a response from SSD.

Must mention that at least on my AMD Ryzen CPU, the user is always large than real in multi-threaded program(or single threaded program compiled with -O3).
eg.
real 0m5.815s
user 0m8.213s
sys 0m0.473s

How to easily diagnose problems due to access to unmapped mmap regions?

I've recently found a segfault that neither Valgrind, nor Address Sanitizer could give any useful info about. It happened because the faulty program munmapped a file and then tried to access the formerly mmapped region.
The following example demonstrates the problem:
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
int main()
{
const int fd=open("/tmp/test.txt", O_RDWR);
if(fd<0) abort();
const char buf[]="Hello";
if(write(fd, buf, sizeof buf) != sizeof buf) abort();
char*const volatile ptr=mmap(NULL,sizeof buf,PROT_READ,MAP_SHARED,fd,0);
if(!ptr) abort();
printf("1%c\n", ptr[0]);
if(close(fd)<0) abort();
printf("2%c\n", ptr[0]);
if(munmap(ptr, sizeof buf)<0) abort();
printf("3%c\n", ptr[0]); // Cause a segfault
}
With Address Sanitizer I get the following output:
1H
2H
AddressSanitizer:DEADLYSIGNAL
=================================================================
==8503==ERROR: AddressSanitizer: SEGV on unknown address 0x7fe7d0836000 (pc 0x55bda425c055 bp 0x7ffda5887210 sp 0x7ffda5887140 T0)
==8503==The signal is caused by a READ memory access.
#0 0x55bda425c054 in main /tmp/test/test1.c:22
#1 0x7fe7cf64fb96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
#2 0x55bda425bcd9 in _start (/tmp/test/test1+0xcd9)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/test/test1.c:22 in main
And here's the relevant part of output with Valgrind:
1H
2H
==8863== Invalid read of size 1
==8863== at 0x108940: main (test1.c:22)
==8863== Address 0x4029000 is not stack'd, malloc'd or (recently) free'd
==8863==
==8863==
==8863== Process terminating with default action of signal 11 (SIGSEGV)
==8863== Access not within mapped region at address 0x4029000
==8863== at 0x108940: main (test1.c:22)
Compare this with the case when a malloced region is accessed after free. Test program:
#include <stdio.h>
#include <string.h>
#include <malloc.h>
int main()
{
const char buf[]="Hello";
char*const volatile ptr=malloc(sizeof buf);
if(!ptr)
{
fprintf(stderr, "malloc failed");
return 1;
}
memcpy(ptr,buf,sizeof buf);
printf("1%c\n", ptr[0]);
free(ptr);
printf("2%c\n", ptr[0]); // Cause a segfault
}
Output with Address Sanitizer:
1H
=================================================================
==7057==ERROR: AddressSanitizer: heap-use-after-free on address 0x602000000010 at pc 0x55b8f96b5003 bp 0x7ffff5179b70 sp 0x7ffff5179b60
READ of size 1 at 0x602000000010 thread T0
#0 0x55b8f96b5002 in main /tmp/test/test1.c:17
#1 0x7f4298fd8b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
#2 0x55b8f96b4c49 in _start (/tmp/test/test1+0xc49)
0x602000000010 is located 0 bytes inside of 6-byte region [0x602000000010,0x602000000016)
freed by thread T0 here:
#0 0x7f42994b3b4f in free (/usr/lib/x86_64-linux-gnu/libasan.so.5+0x10bb4f)
#1 0x55b8f96b4fca in main /tmp/test/test1.c:16
#2 0x7f4298fd8b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
previously allocated by thread T0 here:
#0 0x7f42994b3f48 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0x10bf48)
#1 0x55b8f96b4e25 in main /tmp/test/test1.c:8
#2 0x7f4298fd8b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
Output with Valgrind:
1H
==6888== Invalid read of size 1
==6888== at 0x108845: main (test1.c:17)
==6888== Address 0x522d040 is 0 bytes inside a block of size 6 free'd
==6888== at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==6888== by 0x108840: main (test1.c:16)
==6888== Block was alloc'd at
==6888== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==6888== by 0x1087D2: main (test1.c:8)
My question: is there any way to make Valgrind or a Sanitizer, or some other Linux-compatible tool output useful diagnostic about the context of access to munmapped region (like where it had been mmapped and munmapped), similar to the above given output for the access-after-free?

valgrind (and I guess asan does the same) can output a 'use after free' error
because it maintains a list of 'recently freed' blocks.
Such blocks are logically freed, but they are not returned (directly) to
the usable memory for further malloc calls. instead they are marked unaddressable.
The size of this 'recently freed' block list can be tuned using
--freelist-vol=<number> volume of freed blocks queue [20000000]
--freelist-big-blocks=<number> releases first blocks with size>= [1000000]
It would be possible to use a similar technique for munmap-ed memory:
rather than physically unmap it, it could be kept in a list of recently
unmapped blocks, be logically unmapped, but marked unaddressable.
Note that you could simulate that in your program by having a function
my_unmap that does not really do the unmap, but rather use the client requests
of valgrind to mark this memory as unaddressable.

is there any way to make Valgrind or a Sanitizer, or some other Linux-compatible tool output useful diagnostic
I know of no such tool, although it would be relatively easy to make one.
Your problem is sufficiently different from heap corruption problems which require specialized tools, and probably doesn't need such a tool.
The major difference is the "action at a distance" aspect: with heap corruption, the code in which the problem manifests is often very far removed from the code in which the problem originates. Hence the need to track memory state, to have red zones, etc.
In your case, the access to munmapped memory results in immediate crash. So if you just log every mmap and munmap that your program performs, you'll only have to look back for the last munmap that "covered" the address on which you crashed.
In addition, most programs perform relatively few mmap and munmap operations. If your program performs so many that you can't log them all, it's likely that it shouldn't actually do that (mmap and munmap are relatively very expensive system calls).

Program with a memory leak on purpose?

For a presentation about memory leaks, i'd like to present a program where a memory leak could be easy to do, and would have visual effects; how could i do this ?
I don't want to use any language in particular, even if C, java or Python would be prefered.
Thank you.

This would produce memory leaks in your system.
#include <stdio.h>
int main(void) {
// your code goes here
char *c;
for(;;)
c = (char*)malloc(sizeof(char));
return 0;
}

High CPU usage - simple packet receiver on Linux

I'm writing simple application under Linux that gathers all packets from network. I'm using blocking receiving by calling "recvfrom()" function. When I generate big network load with hping3 (~100k raw frames per second, 130 bytes each) "top" tool shows high CPU usage for my process - it is about 37-38%. It is big value for me. When I decrease number of packets, usage is lower - for example top shows 3% for 4k frames per second.
I've check DC++ when it downloads ~10MB/s and its process doesn't use 38% of CPU but 5%. Is there any programmable way in C to reduce CPU usage and still receive a lot of frames?
My CPU:
Intel i5-2400 CPU # 3.10Ghz
My system:
Ubuntu 11.04 kernel 3.6.6 with PREEMPT-RT patch
And here is my code:
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/socket.h>
#include <linux/if_packet.h>
#include <linux/if_ether.h>
#include <linux/if_arp.h>
#include <arpa/inet.h>
/* Socket descriptor. */
int mainSocket;
/* Buffer for frame. */
unsigned char* buffer;
int main(int argc, char* argv[])
{
/** Create socket. **/
mainSocket = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (mainSocket == -1) {
printf("Error: cannot create socket!\n");
}
/** Create buffer for frame **/
buffer = malloc(ETH_FRAME_LEN);
printf("Listing...");
while(1) {
// Length of received packet
int length = recvfrom(mainSocket, buffer, ETH_FRAME_LEN, 0, NULL, NULL);
if(length > 0)
{
// ... do something ...
}
}

I don't know if this will help, but looking on Google I see that:
Raw socket, Packet socket and Zero copy networking in Linux as well as http://lxr.linux.no/linux+v2.6.36/Documentation/networking/packet_mmap.txt talk about using PACKET_MMAP and mmap() to improve the performance of raw sockets
The Overview of Packet Reception suggests setting your process's affinity to match the CPU to which you bind the NIC using RPS.
Does DC++ do a promiscuous receive? I wouldn't have guessed so. So instead of comparing your performance to DC++, perhaps you should compare your performance to the performance of a utility like libpcap.

May be because, TCP/IP stack running on NIC and DC++ is getting stream of data directly from NIC, so your processor is not doing any TCP/IP work. But in your case I think you are directly trying to get data from NIC, so it will not be processed by NIC but by your processor, and as you have infinite loop to fetch data, you doing lot of processing... so CPU usage spiked.

Linux shared memory allocation on x86_64

I have 64 bit REHL linux, Linux ipms-sol1 2.6.32-71.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux
RAM size = ~38GB
I changed default shared memory limits as follows in /etc/sysctl.conf & loaded changed file in memory as sysctl -p
kernel.shmmni=81474836
kernel.shmmax=32212254720
kernel.shmall=7864320
Just for experimental basis I have changed shmmax size to 32GB and tried allocating 10GB in code using shmget() as given below, but it fails to get 10GB of shared memory in single shot but when I reduce my demand for shared space to 8GB it succeeds any clue as to where am I possibly going wrong?
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <stdio.h>
#define SHMSZ 10737418240
main()
{
char c;
int shmspaceid;
key_t key;
char *shm, *s;
struct shmid_ds shmid;
key = 5678;
fprintf(stderr,"Changed code\n");
if ((shmspaceid = shmget(key, SHMSZ, IPC_CREAT | 0666)) < 0) {
fprintf(stderr,"ERROR memory allocation failed\n");
return 1;
}
shmctl(shmspaceid, IPC_RMID, &shmid);
return 0;
}
Regards
Himanshu

I'm not sure that this solution is applicable to shared memory as well, but I know this phenomenon from normal malloc() calls.
It's pretty usual that you cannot allocate very large blocks of memory as you try it here. What the functions call means is "Allocate me a block of continuous memory of 10737418240 bytes". Often times, even if the total system memory could theoretically satisfy this need, the implied "a block of continuous memory" forces the limit of allocatable memory to be much lower.
The in-memory program structure, the number of programs loaded can all contribute to blocking certain areas of memory and not allow there to be 10 continuous gigabytes of memory allocatable.
I have found often times that a reboot will change that (as programs get loaded to a different position on the heap). You can try out your maximum allocatable block size with something like this:
int i=1024;
int error=0;
while(!error) {
char *a=(char*)malloc(i);
error=(a==null);
if(!error)
printf("Successfully allocated %i.\n", i);
i*=2;
}
Hope this helps or is applicable here. I found this out while checking why I could not allocate close to maximum system memory to a JVM.

Shot in the dark: you don't have enough swap space. Shared memory, by default, requires reserving space in swap. You can disable this behavior using SHM_NORESERVE:
http://linux.die.net/man/2/shmget
SHM_NORESERVE (since Linux 2.6.15) This flag serves the same purpose
as the mmap(2) MAP_NORESERVE flag. Do not reserve swap space for this
segment. When swap space is reserved, one has the guarantee that it is
possible to modify the segment. When swap space is not reserved one
might get SIGSEGV upon a write if no physical memory is available. See
also the discussion of the file /proc/sys/vm/overcommit_memory in
proc(5).

I was just looking at this and I recommend printing out the exact errno value and description for the problem, rather than just noting that it failed. For example:
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
//#define SHMSZ 10737418240
#define SHMSZ 8589934592
int main()
{
int shmspaceid;
key_t key = 5678;
struct shmid_ds shmid;
if ((shmspaceid = shmget(key, SHMSZ, IPC_CREAT | 0666)) < 0) {
fprintf(stderr,"ERROR with shmget (%d: %s)\n", (int)(errno), strerror(errno));
return 1;
}
shmctl(shmspaceid, IPC_RMID, &shmid);
return 0;
}
I tried to reproduce your problem with an 8 GB block and 8 GB smhmax and shmall on my 16 GB system, but I could not. It worked fine. I do recommend using ipcs -m to look for other shared blocks that might prevent your 10 GB allocation from being honored. And definitely look closely at the exact error code that shmget() is returning through errno.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string