shm_open() fails with EINVAL when creating shared memory in subdirectory of /dev/shm - linux

I have a GNU/Linux application with uses a number of shared memory objects. It could, potentially, be run a number of times on the same system. To keep things tidy, I first create a directory in /dev/shm for each of the set of shared memory objects.
The problem is that on newer GNU/Linux distributions, I no longer seem to be able create these in a sub-directory of /dev/shm.
The following is a minimal C program with illustrates what I'm talking about:
/*****************************************************************************
* shm_minimal.c
*
* Test shm_open()
*
* Expect to create shared memory file in:
* /dev/shm/
* └── my_dir
*    └── shm_name
*
* NOTE: Only visible on filesystem during execution. I try to be nice, and
* clean up after myself.
*
* Compile with:
* $ gcc -lrt shm_minimal.c -o shm_minimal
*
******************************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
int main(int argc, const char* argv[]) {
int shm_fd = -1;
char* shm_dir = "/dev/shm/my_dir";
char* shm_file = "/my_dir/shm_name"; /* does NOT work */
//char* shm_file = "/my_dir_shm_name"; /* works */
// Create directory in /dev/shm
mkdir(shm_dir, 0777);
// make shared memory segment
shm_fd = shm_open(shm_file, O_RDWR | O_CREAT, 0600);
if (-1 == shm_fd) {
switch (errno) {
case EINVAL:
/* Confirmed on:
* kernel v3.14, GNU libc v2.19 (ArchLinux)
* kernel v3.13, GNU libc v2.19 (Ubuntu 14.04 Beta 2)
*/
perror("FAIL - EINVAL");
return 1;
default:
printf("Some other problem not being tested\n");
return 2;
}
} else {
/* Confirmed on:
* kernel v3.8, GNU libc v2.17 (Mint 15)
* kernel v3.2, GNU libc v2.15 (Xubuntu 12.04 LTS)
* kernel v3.1, GNU libc v2.13 (Debian 6.0)
* kernel v2.6.32, GNU libc v2.12 (RHEL 6.4)
*/
printf("Success !!!\n");
}
// clean up
close(shm_fd);
shm_unlink(shm_file);
rmdir(shm_dir);
return 0;
}
/* vi: set ts=2 sw=2 ai expandtab:
*/
When I run this program on a fairly new distribution, the call to shm_open() returns -1, and errno is set to EINVAL. However, when I run on something a little older, it creates the shared memory object in /dev/shm/my_dir as expected.
For the larger application, the solution is simple. I can use a common prefix instead of a directory.
If you could help enlighten me to this apparent change in behavior it would be very helpful. I suspect someone else out there might be trying to do something similar.

So it turns out the issue stems from how GNU libc validates the shared memory name. Specifically, the shared memory object MUST now be at the root of the shmfs mount point.
This was changed in glibc git commit b20de2c3d9 as the result of bug BZ #16274.
Specifically, the change is the line:
if (name[0] == '\0' || namelen > NAME_MAX || strchr (name, '/') != NULL)
Which now disallows '/' from anywhere in the filename (not counting leading '/')

If you have a third party tool that was broken by this shm_open change, a brilliant coworker found a workaround : preload a library that overrides the shm_open call and swaps slashes for underscores. It does the same for shm_unlink as well, so the application can properly free shared memory when needed.
deslash_shm.cc :
#include <dlfcn.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <algorithm>
#include <string>
// function used in place of the standard shm_open() function
extern "C" int shm_open(const char *name, int oflag, mode_t mode)
{
// keep a function pointer to the real shm_open() function
static int (*real_open)(const char *, int, mode_t) = NULL;
// the first time in, ask the dynamic linker to find the real shm_open() function
if (!real_open) real_open = (int (*)(const char *, int, mode_t)) dlsym(RTLD_NEXT,"shm_open");
// take the name we were given and replace all slashes with underscores instead
std::string n = name;
std::replace(n.begin(), n.end(), '/', '_');
// call the real open function with the patched path name
return real_open(n.c_str(), oflag, mode);
}
// function used in place of the standard shm_unlink() function
extern "C" int shm_unlink(const char *name)
{
// keep a function pointer to the real shm_unlink() function
static int (*real_unlink)(const char *) = NULL;
// the first time in, ask the dynamic linker to find the real shm_unlink() function
if (!real_unlink) real_unlink = (int (*)(const char *)) dlsym(RTLD_NEXT, "shm_unlink");
// take the name we were given and replace all slashes with underscores instead
std::string n = name;
std::replace(n.begin(), n.end(), '/', '_');
// call the real unlink function with the patched path name
return real_unlink(n.c_str());
}
To compile this file:
c++ -fPIC -shared -o deslash_shm.so deslash_shm.cc -ldl
And preload it before starting a process that tries to use non-standard slash characters in shm_open:
in bash:
export LD_PRELOAD=/path/to/deslash_shm.so
in tcsh:
setenv LD_PRELOAD /path/to/deslash_shm.so

Related

What does lseek() mean for a directory file descriptor?

According to strace, lseek(fd, 0, SEEK_END) = 9223372036854775807 when fd refers to a directory. Why is this syscall succeeding at all? What does lseek() mean for a dir fd?
On my test system, if you use opendir(), and readdir() through all the entries in the directory, telldir() then returns the same value:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <dirent.h>
int main(int argc, char *argv[]) {
int fd = open(".", O_RDONLY);
if (fd < 0) {
perror("open");
return 1;
}
off_t o = lseek(fd, 0, SEEK_END);
if (o == (off_t)-1) {
perror("lseek");
return 1;
}
printf("Via lseek: %ld\n", (long)o);
close(fd);
DIR *d = opendir(".");
if (!d) {
perror("opendir");
return 1;
}
while (readdir(d)) {
}
printf("via telldir: %ld\n", telldir(d));
closedir(d);
return 0;
}
outputs
Via lseek: 9223372036854775807
via telldir: 9223372036854775807
Quoting from the telldir(3) man page:
In early filesystems, the value returned by telldir() was a simple file offset within a directory. Modern filesystems use tree or hash structures, rather than flat tables, to represent directories. On such filesystems, the value returned by telldir() (and used internally by readdir(3)) is a "cookie" that is used by the implementation to derive a position within a directory. Application programs should treat this strictly as an opaque value, making no assumptions about its contents.
It's a magic number that indicates that the index into the directory's contents is at the end. Don't count on the number always being the same, or being portable. It's a black box. And stick with the dirent API for traversing directory contents unless you really know exactly what you're doing (Under the hood on Linux + glibc, opendir(3) calls openat(2) on the directory, readdir(3) fetches information about its contents with getdents(2), and seekdir(3) calls lseek(2), but that's just implementation details)

What is the cause of the hard limit on the directory nesting depth returned by getcwd on macOS and how can it be circumvented?

On linux and macOS, directories can be nested to seemingly arbitrary depth, as demonstrated by the following C program. However, on macOS but not on linux, there seems to be a hard limit on the nesting level returned by getcwd, specifically a nesting level of 256. When that limit is reached, getcwd returns ENOENT, a rather strange error code. Where does this limit come from? Is there a way around it?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/types.h>
void fail(char *msg) { perror(msg); exit(1); }
void create_nested_dirs(int n) {
int i;
char name[10];
char cwd[10000];
if (chdir("/tmp") < 0) fail("chdir(\"/tmp\")");
for (i=2; i<=n; i++) {
sprintf(name, "%09d", i);
printf("%s\n",name);
if (mkdir(name, 0777) < 0 && errno != EEXIST) fail("mkdir");
if (chdir(name) < 0) fail("chdir(name)");
if (getcwd(cwd, sizeof(cwd)) == NULL) fail("getcwd");
printf("cwd = \"%s\" strlen(cwd)=%d\n", cwd, strlen(cwd));
}
}
int main() {
long ret = pathconf("/", _PC_PATH_MAX);
printf("PATH_MAX is %ld\n", ret);
create_nested_dirs(300);
return 0;
}
Update
The above program was updated to print the value returned by pathconf("/", _PC_PATH_MAX) and to print the length of the path returned by getcwd.
On my machine running macOS Mojave 10.14, the PATH_MAX is 1024 and the longest string correctly returned by getcwd is 2542 characters long. Then a 2552 character long directory of nesting depth 256 is created by mkdir and then after a successful chdir to that directory a getcwd fails with ENOENT.
If the sprintf(name, "%09d", i); is changed to sprintf(name, "%03d", i); the paths are considerably shorter but the getcwd still fails when the directory nesting depth reaches 256.
So the limiting factor here is the nesting depth, not PATH_MAX.
My understanding of the source code here is that the meat of the work is done by the call fcntl(fd, F_GETPATH, b) so the problem may be in fcntl.

can a program read its own elf section?

I would like to use ld's --build-id option in order to add build information to my binary. However, I'm not sure how to make this information available inside the program. Assume I want to write a program that writes a backtrace every time an exception occurs, and a script that parses this information. The script reads the symbol table of the program and searches for the addresses printed in the backtrace (I'm forced to use such a script because the program is statically linked and backtrace_symbols is not working). In order for the script to work correctly I need to match build version of the program with the build version of the program which created the backtrace. How can I print the build version of the program (located in the .note.gnu.build-id elf section) from the program itself?
How can I print the build version of the program (located in the .note.gnu.build-id elf section) from the program itself?
You need to read the ElfW(Ehdr) (at the beginning of the file) to find program headers in your binary (.e_phoff and .e_phnum will tell you where program headers are, and how many of them to read).
You then read program headers, until you find PT_NOTE segment of your program. That segment will tell you offset to the beginning of all the notes in your binary.
You then need to read the ElfW(Nhdr) and skip the rest of the note (total size of the note is sizeof(Nhdr) + .n_namesz + .n_descsz, properly aligned), until you find a note with .n_type == NT_GNU_BUILD_ID.
Once you find NT_GNU_BUILD_ID note, skip past its .n_namesz, and read the .n_descsz bytes to read the actual build-id.
You can verify that you are reading the right data by comparing what you read with the output of readelf -n a.out.
P.S.
If you are going to go through the trouble to decode build-id as above, and if your executable is not stripped, it may be better for you to just decode and print symbol names instead (i.e. to replicate what backtrace_symbols does) -- it's actually easier to do than decoding ELF notes, because the symbol table contains fixed-sized entries.
Basically, this is the code I've written based on answer given to my question. In order to compile the code I had to make some changes and I hope it will work for as many types of platforms as possible. However, it was tested only on one build machine. One of the assumptions I used was that the program was built on the machine which runs it so no point in checking endianness compatibility between the program and the machine.
user#:~/$ uname -s -r -m -o
Linux 3.2.0-45-generic x86_64 GNU/Linux
user#:~/$ g++ test.cpp -o test
user#:~/$ readelf -n test | grep Build
Build ID: dc5c4682e0282e2bd8bc2d3b61cfe35826aa34fc
user#:~/$ ./test
Build ID: dc5c4682e0282e2bd8bc2d3b61cfe35826aa34fc
#include <elf.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#if __x86_64__
# define ElfW(type) Elf64_##type
#else
# define ElfW(type) Elf32_##type
#endif
/*
detecting build id of a program from its note section
http://stackoverflow.com/questions/17637745/can-a-program-read-its-own-elf-section
http://www.scs.stanford.edu/histar/src/pkg/uclibc/utils/readelf.c
http://www.sco.com/developers/gabi/2000-07-17/ch5.pheader.html#note_section
*/
int main (int argc, char* argv[])
{
char *thefilename = argv[0];
FILE *thefile;
struct stat statbuf;
ElfW(Ehdr) *ehdr = 0;
ElfW(Phdr) *phdr = 0;
ElfW(Nhdr) *nhdr = 0;
if (!(thefile = fopen(thefilename, "r"))) {
perror(thefilename);
exit(EXIT_FAILURE);
}
if (fstat(fileno(thefile), &statbuf) < 0) {
perror(thefilename);
exit(EXIT_FAILURE);
}
ehdr = (ElfW(Ehdr) *)mmap(0, statbuf.st_size,
PROT_READ|PROT_WRITE, MAP_PRIVATE, fileno(thefile), 0);
phdr = (ElfW(Phdr) *)(ehdr->e_phoff + (size_t)ehdr);
while (phdr->p_type != PT_NOTE)
{
++phdr;
}
nhdr = (ElfW(Nhdr) *)(phdr->p_offset + (size_t)ehdr);
while (nhdr->n_type != NT_GNU_BUILD_ID)
{
nhdr = (ElfW(Nhdr) *)((size_t)nhdr + sizeof(ElfW(Nhdr)) + nhdr->n_namesz + nhdr->n_descsz);
}
unsigned char * build_id = (unsigned char *)malloc(nhdr->n_descsz);
memcpy(build_id, (void *)((size_t)nhdr + sizeof(ElfW(Nhdr)) + nhdr->n_namesz), nhdr->n_descsz);
printf(" Build ID: ");
for (int i = 0 ; i < nhdr->n_descsz ; ++i)
{
printf("%02x",build_id[i]);
}
free(build_id);
printf("\n");
return 0;
}
Yes, a program can read its own .note.gnu.build-id. The important piece is the dl_iterate_phdr function.
I've used this technique in Mesa (the OpenGL/Vulkan implementation) to read its own build-id for use with the on-disk shader cache.
I've extracted those bits into a separate project[1] for easy use by others.
[1] https://github.com/mattst88/build-id

How can I get the source code for the linux utility tail?

this command is really very useful but where I can get the source code to see what is going on inside .
thanks .
The tail utility is part of the coreutils on linux.
Source tarball: ftp://ftp.gnu.org/gnu/coreutils/coreutils-7.4.tar.gz
Source file: https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/tail.c (original http link)
I've always found FreeBSD to have far clearer source code than the gnu utilities. So here's tail.c in the FreeBSD project:
http://svnweb.freebsd.org/csrg/usr.bin/tail/tail.c?view=markup
Poke around the uclinux site. Since they distributed the software, they are required to make the source available one way or another.
Or, you could read man fseek and guess at how it might be done.
NB-- See William's comments below, there are cases when you can't use seek.
You might find it an interesting exercise to write your own. The vast majority of the Unix command-line tools are a page or so of fairly straightforward C code.
To just look at the code, the GNU CoreUtils sources are easily found on gnu.org or your favorite Linux mirror site.
/`*This example implements the option n of tail command.*/`
#define _FILE_OFFSET_BITS 64
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <getopt.h>
#define BUFF_SIZE 4096
FILE *openFile(const char *filePath)
{
FILE *file;
file= fopen(filePath, "r");
if(file == NULL)
{
fprintf(stderr,"Error opening file: %s\n",filePath);
exit(errno);
}
return(file);
}
void printLine(FILE *file, off_t startline)
{
int fd;
fd= fileno(file);
int nread;
char buffer[BUFF_SIZE];
lseek(fd,(startline + 1),SEEK_SET);
while((nread= read(fd,buffer,BUFF_SIZE)) > 0)
{
write(STDOUT_FILENO, buffer, nread);
}
}
void walkFile(FILE *file, long nlines)
{
off_t fposition;
fseek(file,0,SEEK_END);
fposition= ftell(file);
off_t index= fposition;
off_t end= fposition;
long countlines= 0;
char cbyte;
for(index; index >= 0; index --)
{
cbyte= fgetc(file);
if (cbyte == '\n' && (end - index) > 1)
{
countlines ++;
if(countlines == nlines)
{
break;
}
}
fposition--;
fseek(file,fposition,SEEK_SET);
}
printLine(file, fposition);
fclose(file);
}
int main(int argc, char *argv[])
{
FILE *file;
file= openFile(argv[2]);
walkFile(file, atol(argv[1]));
return 0;
}
/*Note: take in mind that i not wrote code to parse input options and arguments, neither code to check if the lines number argument is really a number.*/

Getting stack traces on Unix systems, automatically

What methods are there for automatically getting a stack trace on Unix systems? I don't mean just getting a core file or attaching interactively with GDB, but having a SIGSEGV handler that dumps a backtrace to a text file.
Bonus points for the following optional features:
Extra information gathering at crash time (eg. config files).
Email a crash info bundle to the developers.
Ability to add this in a dlopened shared library
Not requiring a GUI
FYI,
the suggested solution (using backtrace_symbols in a signal handler) is dangerously broken. DO NOT USE IT -
Yes, backtrace and backtrace_symbols will produce a backtrace and a translate it to symbolic names, however:
backtrace_symbols allocates memory using malloc and you use free to free it - If you're crashing because of memory corruption your malloc arena is very likely to be corrupt and cause a double fault.
malloc and free protect the malloc arena with a lock internally. You might have faulted in the middle of a malloc/free with the lock taken, which will cause these function or anything that calls them to dead lock.
You use puts which uses the standard stream, which is also protected by a lock. If you faulted in the middle of a printf you once again have a deadlock.
On 32bit platforms (e.g. your normal PC of 2 year ago), the kernel will plant a return address to an internal glibc function instead of your faulting function in your stack, so the single most important piece of information you are interested in - in which function did the program fault, will actually be corrupted on those platform.
So, the code in the example is the worst kind of wrong - it LOOKS like it's working, but it will really fail you in unexpected ways in production.
BTW, interested in doing it right? check this out.
Cheers,
Gilad.
If you are on systems with the BSD backtrace functionality available (Linux, OSX 1.5, BSD of course), you can do this programmatically in your signal handler.
For example (backtrace code derived from IBM example):
#include <execinfo.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
void sig_handler(int sig)
{
void * array[25];
int nSize = backtrace(array, 25);
char ** symbols = backtrace_symbols(array, nSize);
for (int i = 0; i < nSize; i++)
{
puts(symbols[i]);;
}
free(symbols);
signal(sig, &sig_handler);
}
void h()
{
kill(0, SIGSEGV);
}
void g()
{
h();
}
void f()
{
g();
}
int main(int argc, char ** argv)
{
signal(SIGSEGV, &sig_handler);
f();
}
Output:
0 a.out 0x00001f2d sig_handler + 35
1 libSystem.B.dylib 0x95f8f09b _sigtramp + 43
2 ??? 0xffffffff 0x0 + 4294967295
3 a.out 0x00001fb1 h + 26
4 a.out 0x00001fbe g + 11
5 a.out 0x00001fcb f + 11
6 a.out 0x00001ff5 main + 40
7 a.out 0x00001ede start + 54
This doesn't get bonus points for the optional features (except not requiring a GUI), however, it does have the advantage of being very simple, and not requiring any additional libraries or programs.
Here is an example of how to get some more info using a demangler. As you can see this one also logs the stacktrace to file.
#include <iostream>
#include <sstream>
#include <string>
#include <fstream>
#include <cxxabi.h>
void sig_handler(int sig)
{
std::stringstream stream;
void * array[25];
int nSize = backtrace(array, 25);
char ** symbols = backtrace_symbols(array, nSize);
for (unsigned int i = 0; i < size; i++) {
int status;
char *realname;
std::string current = symbols[i];
size_t start = current.find("(");
size_t end = current.find("+");
realname = NULL;
if (start != std::string::npos && end != std::string::npos) {
std::string symbol = current.substr(start+1, end-start-1);
realname = abi::__cxa_demangle(symbol.c_str(), 0, 0, &status);
}
if (realname != NULL)
stream << realname << std::endl;
else
stream << symbols[i] << std::endl;
free(realname);
}
free(symbols);
std::cerr << stream.str();
std::ofstream file("/tmp/error.log");
if (file.is_open()) {
if (file.good())
file << stream.str();
file.close();
}
signal(sig, &sig_handler);
}
Dereks solution is probably the best, but here's an alternative anyway:
Recent Linux kernel version allow you to pipe core dumps to a script or program. You could write a script to catch the core dump, collect any extra information you need and mail everything back.
This is a global setting though, so it'd apply to any crashing program on the system. It will also require root rights to set up.
It can be configured through the /proc/sys/kernel/core_pattern file. Set that to something like ' | /home/myuser/bin/my-core-handler-script'.
The Ubuntu people use this feature as well.

Resources