Linux shared memory allocation on x86_64 - linux

I have 64 bit REHL linux, Linux ipms-sol1 2.6.32-71.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux
RAM size = ~38GB
I changed default shared memory limits as follows in /etc/sysctl.conf & loaded changed file in memory as sysctl -p
kernel.shmmni=81474836
kernel.shmmax=32212254720
kernel.shmall=7864320
Just for experimental basis I have changed shmmax size to 32GB and tried allocating 10GB in code using shmget() as given below, but it fails to get 10GB of shared memory in single shot but when I reduce my demand for shared space to 8GB it succeeds any clue as to where am I possibly going wrong?
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <stdio.h>
#define SHMSZ 10737418240
main()
{
char c;
int shmspaceid;
key_t key;
char *shm, *s;
struct shmid_ds shmid;
key = 5678;
fprintf(stderr,"Changed code\n");
if ((shmspaceid = shmget(key, SHMSZ, IPC_CREAT | 0666)) < 0) {
fprintf(stderr,"ERROR memory allocation failed\n");
return 1;
}
shmctl(shmspaceid, IPC_RMID, &shmid);
return 0;
}
Regards
Himanshu

I'm not sure that this solution is applicable to shared memory as well, but I know this phenomenon from normal malloc() calls.
It's pretty usual that you cannot allocate very large blocks of memory as you try it here. What the functions call means is "Allocate me a block of continuous memory of 10737418240 bytes". Often times, even if the total system memory could theoretically satisfy this need, the implied "a block of continuous memory" forces the limit of allocatable memory to be much lower.
The in-memory program structure, the number of programs loaded can all contribute to blocking certain areas of memory and not allow there to be 10 continuous gigabytes of memory allocatable.
I have found often times that a reboot will change that (as programs get loaded to a different position on the heap). You can try out your maximum allocatable block size with something like this:
int i=1024;
int error=0;
while(!error) {
char *a=(char*)malloc(i);
error=(a==null);
if(!error)
printf("Successfully allocated %i.\n", i);
i*=2;
}
Hope this helps or is applicable here. I found this out while checking why I could not allocate close to maximum system memory to a JVM.

Shot in the dark: you don't have enough swap space. Shared memory, by default, requires reserving space in swap. You can disable this behavior using SHM_NORESERVE:
http://linux.die.net/man/2/shmget
SHM_NORESERVE (since Linux 2.6.15) This flag serves the same purpose
as the mmap(2) MAP_NORESERVE flag. Do not reserve swap space for this
segment. When swap space is reserved, one has the guarantee that it is
possible to modify the segment. When swap space is not reserved one
might get SIGSEGV upon a write if no physical memory is available. See
also the discussion of the file /proc/sys/vm/overcommit_memory in
proc(5).

I was just looking at this and I recommend printing out the exact errno value and description for the problem, rather than just noting that it failed. For example:
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
//#define SHMSZ 10737418240
#define SHMSZ 8589934592
int main()
{
int shmspaceid;
key_t key = 5678;
struct shmid_ds shmid;
if ((shmspaceid = shmget(key, SHMSZ, IPC_CREAT | 0666)) < 0) {
fprintf(stderr,"ERROR with shmget (%d: %s)\n", (int)(errno), strerror(errno));
return 1;
}
shmctl(shmspaceid, IPC_RMID, &shmid);
return 0;
}
I tried to reproduce your problem with an 8 GB block and 8 GB smhmax and shmall on my 16 GB system, but I could not. It worked fine. I do recommend using ipcs -m to look for other shared blocks that might prevent your 10 GB allocation from being honored. And definitely look closely at the exact error code that shmget() is returning through errno.

Related

SIDT instruction returns wrong base address in a Linux user-space process

I made the following x86-64 program to view where the base address of the Interrupt Descriptor Tables starts:
#include <stdio.h>
#include <inttypes.h>
typedef struct __attribute__((packed)) {
uint16_t limit;
uint64_t base;
}idt_data_t;
static inline void store_idt(idt_data_t *idt_data)
{
asm volatile("sidt %0":"=m" (*idt_data));
}
int main(void)
{
idt_data_t idt_data;
store_idt(&idt_data);
printf("IDT Limit : 0x%X\n", idt_data.limit);
printf("IDT Base : 0x%lX\n", idt_data.base);
return 0;
}
And it prints the following:
IDT Limit : 0xFFF
IDT Base : 0xFFFFFE0000000000
The base address doesn't seem to be correct because the address should always be a physical address, am I right?
Also, I'm not sure but the limit seems to be too high. What am I doing wrong?
It's a linear address, not necessarily a physical address. In other words, it's subject to the page table like most other addresses. It has to be in pages that are never paged to disk--it wouldn't be able to handle page faults if not--but it can be in addresses that differ physically from virtually.
On x86-64, each entry of the IDT is 16 bytes long. There are 256 interrupt vectors. 256 * 16 = 4096 = 0x1000. The IDTR limit is a "less than or equal" check, so it's typical to use 0xFFF.
SIDT is a privileged instruction on newer CPUs if the OS enables a certain feature, so it's advisable not to use it in user mode unless you're writing an exploit PoC or something. It's possible that an OS lies about the answer rather than throwing an exception, but I don't know.

How to easily diagnose problems due to access to unmapped mmap regions?

I've recently found a segfault that neither Valgrind, nor Address Sanitizer could give any useful info about. It happened because the faulty program munmapped a file and then tried to access the formerly mmapped region.
The following example demonstrates the problem:
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
int main()
{
const int fd=open("/tmp/test.txt", O_RDWR);
if(fd<0) abort();
const char buf[]="Hello";
if(write(fd, buf, sizeof buf) != sizeof buf) abort();
char*const volatile ptr=mmap(NULL,sizeof buf,PROT_READ,MAP_SHARED,fd,0);
if(!ptr) abort();
printf("1%c\n", ptr[0]);
if(close(fd)<0) abort();
printf("2%c\n", ptr[0]);
if(munmap(ptr, sizeof buf)<0) abort();
printf("3%c\n", ptr[0]); // Cause a segfault
}
With Address Sanitizer I get the following output:
1H
2H
AddressSanitizer:DEADLYSIGNAL
=================================================================
==8503==ERROR: AddressSanitizer: SEGV on unknown address 0x7fe7d0836000 (pc 0x55bda425c055 bp 0x7ffda5887210 sp 0x7ffda5887140 T0)
==8503==The signal is caused by a READ memory access.
#0 0x55bda425c054 in main /tmp/test/test1.c:22
#1 0x7fe7cf64fb96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
#2 0x55bda425bcd9 in _start (/tmp/test/test1+0xcd9)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/test/test1.c:22 in main
And here's the relevant part of output with Valgrind:
1H
2H
==8863== Invalid read of size 1
==8863== at 0x108940: main (test1.c:22)
==8863== Address 0x4029000 is not stack'd, malloc'd or (recently) free'd
==8863==
==8863==
==8863== Process terminating with default action of signal 11 (SIGSEGV)
==8863== Access not within mapped region at address 0x4029000
==8863== at 0x108940: main (test1.c:22)
Compare this with the case when a malloced region is accessed after free. Test program:
#include <stdio.h>
#include <string.h>
#include <malloc.h>
int main()
{
const char buf[]="Hello";
char*const volatile ptr=malloc(sizeof buf);
if(!ptr)
{
fprintf(stderr, "malloc failed");
return 1;
}
memcpy(ptr,buf,sizeof buf);
printf("1%c\n", ptr[0]);
free(ptr);
printf("2%c\n", ptr[0]); // Cause a segfault
}
Output with Address Sanitizer:
1H
=================================================================
==7057==ERROR: AddressSanitizer: heap-use-after-free on address 0x602000000010 at pc 0x55b8f96b5003 bp 0x7ffff5179b70 sp 0x7ffff5179b60
READ of size 1 at 0x602000000010 thread T0
#0 0x55b8f96b5002 in main /tmp/test/test1.c:17
#1 0x7f4298fd8b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
#2 0x55b8f96b4c49 in _start (/tmp/test/test1+0xc49)
0x602000000010 is located 0 bytes inside of 6-byte region [0x602000000010,0x602000000016)
freed by thread T0 here:
#0 0x7f42994b3b4f in free (/usr/lib/x86_64-linux-gnu/libasan.so.5+0x10bb4f)
#1 0x55b8f96b4fca in main /tmp/test/test1.c:16
#2 0x7f4298fd8b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
previously allocated by thread T0 here:
#0 0x7f42994b3f48 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0x10bf48)
#1 0x55b8f96b4e25 in main /tmp/test/test1.c:8
#2 0x7f4298fd8b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
Output with Valgrind:
1H
==6888== Invalid read of size 1
==6888== at 0x108845: main (test1.c:17)
==6888== Address 0x522d040 is 0 bytes inside a block of size 6 free'd
==6888== at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==6888== by 0x108840: main (test1.c:16)
==6888== Block was alloc'd at
==6888== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==6888== by 0x1087D2: main (test1.c:8)
My question: is there any way to make Valgrind or a Sanitizer, or some other Linux-compatible tool output useful diagnostic about the context of access to munmapped region (like where it had been mmapped and munmapped), similar to the above given output for the access-after-free?
valgrind (and I guess asan does the same) can output a 'use after free' error
because it maintains a list of 'recently freed' blocks.
Such blocks are logically freed, but they are not returned (directly) to
the usable memory for further malloc calls. instead they are marked unaddressable.
The size of this 'recently freed' block list can be tuned using
--freelist-vol=<number> volume of freed blocks queue [20000000]
--freelist-big-blocks=<number> releases first blocks with size>= [1000000]
It would be possible to use a similar technique for munmap-ed memory:
rather than physically unmap it, it could be kept in a list of recently
unmapped blocks, be logically unmapped, but marked unaddressable.
Note that you could simulate that in your program by having a function
my_unmap that does not really do the unmap, but rather use the client requests
of valgrind to mark this memory as unaddressable.
is there any way to make Valgrind or a Sanitizer, or some other Linux-compatible tool output useful diagnostic
I know of no such tool, although it would be relatively easy to make one.
Your problem is sufficiently different from heap corruption problems which require specialized tools, and probably doesn't need such a tool.
The major difference is the "action at a distance" aspect: with heap corruption, the code in which the problem manifests is often very far removed from the code in which the problem originates. Hence the need to track memory state, to have red zones, etc.
In your case, the access to munmapped memory results in immediate crash. So if you just log every mmap and munmap that your program performs, you'll only have to look back for the last munmap that "covered" the address on which you crashed.
In addition, most programs perform relatively few mmap and munmap operations. If your program performs so many that you can't log them all, it's likely that it shouldn't actually do that (mmap and munmap are relatively very expensive system calls).

unexpected behavior of linux malloc

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include<pthread.h>
#define BLOCKSIZE 1024*1024
// #define BLOCKSIZE 4096
int main (int argc, char *argv[])
{
void *myblock = NULL;
int count = 0;
while (1)
{
myblock = malloc(BLOCKSIZE);
if (!myblock){
puts("error"); break;
}
memset(myblock,1, BLOCKSIZE);
count++;
}
printf("Currently allocated %d \n",count);
printf("end");
exit(0);
}
When BLOCKSIZE is 1024*1024. All is ok. Malloc return NULL, loop is break. Program print text and exit.
When BLOCKSIZE is 4096
Malloc never returns NULL Program crash. => Out of memory , killed by kernel .
Why?
It's pitch black, you are likely to be eaten by an OOM killer.
Linux has this thing called an OOM killer which wanders about killing off processes when it finds memory allocation is very heavy. The selection of which process(es) to kill is based on certain properties of each process (such as one allocating a lot of memory being a prime candidate).
It does this, partly due to its optimistic memory allocation strategy (it will generally give you address space whether or not there's enough backing memory on devices for it, something known as overcommit).
It's likely in this case that, when allocating 1M at a time, an allocation fails before the OOM killer finds you. With 4K, you're discovered before the allocation routines decide you've had enough.
You can configure the OOM killer to leave you alone if that's your desire, by writing an adjustment value of -17 to your oom_adj entry in procfs. It's not advisable unless you know what your doing since it puts other (perhaps more important) processes at risk. Other values from -16 to +15 adjust the likelihood that your process will be selected.
You can also turn off overcommit altogether by writing vm.overcommit_memory=2 to /etc/sysctl.conf but that again can present problems in your environment.

High CPU usage - simple packet receiver on Linux

I'm writing simple application under Linux that gathers all packets from network. I'm using blocking receiving by calling "recvfrom()" function. When I generate big network load with hping3 (~100k raw frames per second, 130 bytes each) "top" tool shows high CPU usage for my process - it is about 37-38%. It is big value for me. When I decrease number of packets, usage is lower - for example top shows 3% for 4k frames per second.
I've check DC++ when it downloads ~10MB/s and its process doesn't use 38% of CPU but 5%. Is there any programmable way in C to reduce CPU usage and still receive a lot of frames?
My CPU:
Intel i5-2400 CPU # 3.10Ghz
My system:
Ubuntu 11.04 kernel 3.6.6 with PREEMPT-RT patch
And here is my code:
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/socket.h>
#include <linux/if_packet.h>
#include <linux/if_ether.h>
#include <linux/if_arp.h>
#include <arpa/inet.h>
/* Socket descriptor. */
int mainSocket;
/* Buffer for frame. */
unsigned char* buffer;
int main(int argc, char* argv[])
{
/** Create socket. **/
mainSocket = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (mainSocket == -1) {
printf("Error: cannot create socket!\n");
}
/** Create buffer for frame **/
buffer = malloc(ETH_FRAME_LEN);
printf("Listing...");
while(1) {
// Length of received packet
int length = recvfrom(mainSocket, buffer, ETH_FRAME_LEN, 0, NULL, NULL);
if(length > 0)
{
// ... do something ...
}
}
I don't know if this will help, but looking on Google I see that:
Raw socket, Packet socket and Zero copy networking in Linux as well as http://lxr.linux.no/linux+v2.6.36/Documentation/networking/packet_mmap.txt talk about using PACKET_MMAP and mmap() to improve the performance of raw sockets
The Overview of Packet Reception suggests setting your process's affinity to match the CPU to which you bind the NIC using RPS.
Does DC++ do a promiscuous receive? I wouldn't have guessed so. So instead of comparing your performance to DC++, perhaps you should compare your performance to the performance of a utility like libpcap.
May be because, TCP/IP stack running on NIC and DC++ is getting stream of data directly from NIC, so your processor is not doing any TCP/IP work. But in your case I think you are directly trying to get data from NIC, so it will not be processed by NIC but by your processor, and as you have infinite loop to fetch data, you doing lot of processing... so CPU usage spiked.

How to access physical addresses from user space in Linux?

On a ARM based system running Linux, I have a device that's memory mapped to a physical address. From a user space program where all addresses are virtual, how can I read content from this address?
busybox devmem
busybox devmem is a tiny CLI utility that mmaps /dev/mem.
You can get it in Ubuntu with: sudo apt-get install busybox
Usage: read 4 bytes from the physical address 0x12345678:
sudo busybox devmem 0x12345678
Write 0x9abcdef0 to that address:
sudo busybox devmem 0x12345678 w 0x9abcdef0
Source: https://github.com/mirror/busybox/blob/1_27_2/miscutils/devmem.c#L85
mmap MAP_SHARED
When mmapping /dev/mem, you likely want to use:
open("/dev/mem", O_RDWR | O_SYNC);
mmap(..., PROT_READ | PROT_WRITE, MAP_SHARED, ...)
MAP_SHARED makes writes go to physical memory immediately, which makes it easier to observe, and makes more sense for hardware register writes.
CONFIG_STRICT_DEVMEM and nopat
To use /dev/mem to view and modify regular RAM on kernel v4.9, you must fist:
disable CONFIG_STRICT_DEVMEM (set by default on Ubuntu 17.04)
pass the nopat kernel command line option for x86
IO ports still work without those.
See also: mmap of /dev/mem fails with invalid argument for virt_to_phys address, but address is page aligned
Cache flushing
If you try to write to RAM instead of a register, the memory may be cached by the CPU: How to flush the CPU cache for a region of address space in Linux? and I don't see a very portable / easy way to flush it or mark the region as uncacheable:
How to write kernel space memory (physical address) to a file using O_DIRECT?
How to flush the CPU cache for a region of address space in Linux?
Is it possible to allocate, in user space, a non cacheable block of memory on Linux?
So maybe /dev/mem can't be used reliably to pass memory buffers to devices?
This can't be observed in QEMU unfortunately, since QEMU does not simulate caches.
How to test it out
Now for the fun part. Here are a few cool setups:
Userland memory
allocate volatile variable on an userland process
get the physical address with /proc/<pid>/maps + /proc/<pid>/pagemap
modify the value at the physical address with devmem, and watch the userland process react
Kernelland memory
allocate kernel memory with kmalloc
get the physical address with virt_to_phys and pass it back to userland
modify the physical address with devmem
query the value from the kernel module
IO mem and QEMU virtual platform device
create a platform device with known physical register addresses
use devmem to write to the register
watch printfs come out of the virtual device in response
Bonus: determine the physical address for a virtual address
Is there any API for determining the physical address from virtual address in Linux?
You can map a device file to a user process memory using mmap(2) system call. Usually, device files are mappings of physical memory to the file system.
Otherwise, you have to write a kernel module which creates such a file or provides a way to map the needed memory to a user process.
Another way is remapping parts of /dev/mem to a user memory.
Edit:
Example of mmaping /dev/mem (this program must have access to /dev/mem, e.g. have root rights):
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
if (argc < 3) {
printf("Usage: %s <phys_addr> <offset>\n", argv[0]);
return 0;
}
off_t offset = strtoul(argv[1], NULL, 0);
size_t len = strtoul(argv[2], NULL, 0);
// Truncate offset to a multiple of the page size, or mmap will fail.
size_t pagesize = sysconf(_SC_PAGE_SIZE);
off_t page_base = (offset / pagesize) * pagesize;
off_t page_offset = offset - page_base;
int fd = open("/dev/mem", O_SYNC);
unsigned char *mem = mmap(NULL, page_offset + len, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, page_base);
if (mem == MAP_FAILED) {
perror("Can't map memory");
return -1;
}
size_t i;
for (i = 0; i < len; ++i)
printf("%02x ", (int)mem[page_offset + i]);
return 0;
}

Resources