shm_unlink from the shell? - linux

I have a program that creates a shared memory object with shm_open that I'm working on. It tries to release the object with shm_unlink, but sometimes a programming error will cause it to crash before it can make that call. In that case, I need to unlink the shared memory object "by hand," and I'd like to be able to do it in a normal shell, without writing any C code -- i.e. with normal Linux utilities.
Can it be done? This question seems to say that using unlink(1) on /dev/shm/path_passed_to_shm_open, but the manpages aren't clear.

Unlinking the file in /dev/shm will delete the shared memory object if no other process has the object mapped.
Unlike SYSV shared memory, which is implemented in the Kernel, POSIX shared memory objects are simply "files in disguise".
When you call shm_open and mmap, you can see the following in the process process map (using pmap -X):
Address Perm Offset Device Inode Mapping
b7737000 r--s 00000000 00:0d 2267945 test_object
The device major and minor number correspond to the tmpfs mounted at /dev/shm (some systems mount this at /run, and then symlink /dev/shm to /run/shm).
A listing of the folder will show the same inode number:
$ ls -li /dev/shm/
2267945 -rw------- 1 mikel mikel 1 Apr 14 13:36 test_object
Like any other inode, the space will be freed when all references are removed. If we close the only program referencing this we see:
$ cat /proc/meminfo | grep Shmem
Shmem: 700 kB
Once we remove the last reference (created in /dev/shm), the space will be freed:
$ rm /dev/shm/test_object
$ cat /proc/meminfo | grep Shmem
Shmem: 696 kB
If you're curious, you can look at the corresponding files in the glibc sources. shm_directory.c and shm_directory.h generate the filename as /dev/your_shm_name. The implementations shm_open and shm_unlink simply open and unlink this file. So it should be easy to see that rm /dev/shm/your_shm_name performs the same operation,

Related

df -h giving fake data?

when i'm writing df -h in my instance i'm getting this data:
Filesystem Size Used Avail Use% Mounted on
devtmpfs 7.7G 0 7.7G 0% /dev
tmpfs 7.7G 0 7.7G 0% /dev/shm
tmpfs 7.7G 408K 7.7G 1% /run
tmpfs 7.7G 0 7.7G 0% /sys/fs/cgroup
/dev/nvme0n1p1 32G 24G 8.5G 74% /
tmpfs 1.6G 0 1.6G 0% /run/user/1000
but when i'm clicking sudo du -sh / i'm getting:
11G /
So in df -h, / size is 24G but in du -sh same directory the size is 11G.
I'm trying to get some free space on my instance and can't find the files that cause that.
What i'm missing?
did df -h is really giving fake data?
This question comes up quite often. The file system allocates disk blocks in the file system to record its data. This data is referred to as metadata which is not visible to most user-level programs (such as du). Examples of metadata are inodes, disk maps, indirect blocks, and superblocks.
The du command is a user-level program that isn't aware of filesystem metadata, while df looks at the filesystem disk allocation maps and is aware of file system metadata. df obtains true filesystem statistics, whereas du sees only a partial picture.
There are many causes on why the disk space used or available when running the du or df commands differs.
Perhaps the most common is deleted files. Files that have been deleted may still be open by at least one process. The entry for such files is removed from the associated directory, which makes the file inaccessible. Therefore the command du which only counts files does not take these files into account and comes up with a smaller value. As long as a process still has the deleted file in use, however, the associated blocks are not yet released in the file system, so df which works at the kernel level correctly displays these as occupied. You can find out if this is the case by running the following:
lsof | grep '(deleted)'
The fix for this issue would be to restart the services that still have those deleted files open.
The second most common cause is if you have a partition or drive mounted on top of a directory with the same name. For example, if you have a directory under / called backup which contains data and then you mount a new drive on top of that directory and label it /backup but it contains no data then the space used will show up with the df command even though the du command shows no files.
To determine if there are any files or directories hidden under an active mount point, you can try using a bind-mount to mount your / filesystem which will enable me to inspect underneath other mount points. Note, this is recommended only for experienced system administrators.
mkdir /tmp/tmpmnt
mount -o bind //tmp/tmpmnt
du /tmp/tmpmnt
After you have confirmed that this is the issue, the bind mount can be removed by running:
umount /tmp/tmpmnt/
rmdir /tmp/tmpmnt
Another possible cause might be filesystem corruption. If this is suspected, please make sure you have good backups, and at your convenience, please unmount the filesystem and run fsck.
Again, this should be done by experienced system administrators.
You can also check the calculation by running:
strace -e statfs df /
This will give you output similar to:
statfs("/", {f_type=XFS_SB_MAGIC, f_bsize=4096, f_blocks=20968699, f_bfree=17420469,
f_bavail=17420469, f_files=41942464, f_ffree=41509188, f_fsid={val=[64769, 0]},
f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/vda1 83874796 14192920 69681876 17% /
+++ exited with 0 +++
Notice the difference between f_bfree and f_bavail? These are the free blocks in the filesystem vs free blocks available to an unprivileged user. The used column is merely a calculation between the two.
Hope this will make your idea clear. Let me know if you still have any doubts.

How to select huge page sizes for DPDK and malloc?

We develop a Linux application that uses DPDK and which must also be heavily optimised for speed.
We must specify huge pages for use by DPDK and also for general dynamic memory allocation. For the latter we use the libhugetlbfs library.
sudo mount -t hugetlbfs none /mnt/hugetlbfs
We specify the huge pages in the bootcmd line as follows:
hugepagesz=1G hugepages=20 hugepages=0 default_hugepagesz=1G
We are using Centos 7 and full boot cmdline is:
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-957.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet hugepagesz=1G hugepages=20 default_hugepagesz=1G irqaffinity=0,1 isolcpus=4-23 nosoftlockup mce=ignore_ce idle=poll
These values are fairly arbitrary. With these values, I see:
$ free -m
total used free shared buff/cache available
Mem: 47797 45041 2468 9 287 2393
Swap: 23999 0 23999
So 2.468GB RAM is free out of 48GB. So a very large amount of memory is allocated to huge pages and I want to reduce this.
My question is what would be sensible values for them?
I am confused by the interpretation of the parameters. I see:
$ cat /proc/meminfo
<snip>
HugePages_Total: 43
HugePages_Free: 43
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
and also:
$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
43
Why are 43 pages reported when my parameters only specify 20 pages of 1G?
I would like some guidelines on:
huge page size/quantity that I might need for DPDK?
huge page size/quantity that I might need for malloc?
I know these are highly application dependant but some guidelines would be helpful. Also how could I detect if the huge pages were insufficent for the application?
Additional info:
$ cat /proc/mounts | grep huge
cgroup /sys/fs/cgroup/hugetlb cgroup rw,seclabel,nosuid,nodev,noexec,relatime,hugetlb 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,seclabel,relatime 0 0
Update 4 March:
My boot cmdline is now:
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-957.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet hugepagesz=1G hugepages=20 default_hugepagesz=1G irqaffinity=0,1 isolcpus=4-23 nosoftlockup mce=ignore_ce idle=poll transparent_hugepage=never
and transparent hugepages are disabled (I activated a custom tune profile):
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
but I still see 43 hugepages:
$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
43
whereas I have only specified 20 in cmdline. Why is this?

How to release hugepages from the crashed application

I have an application that uses hugepage and the application suddenly crashed due to some bug.
After crashing, since the application does not release the hugepage properly, the free hugepage number is not increased in sys filesystem.
$ sudo cat /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages
0
$ sudo cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
1024
Is there a way to release the hugepages by force?
Sometimes need to check all directory that hugetlbfs has been mounted.
So,
find mounted directory by command mount | grep huge.
check every directory except especially /dev/hugepages.
delete all 2M-sized files. (2M is the size of hugepage)
Use ipcs -m to list the shared memory segments.
Use ipcrm to remove the left over shared memory segments.
Edit on 06/24/2019:
Ok, so, the above answer, while correct as far as it goes, was a bit brief. In particular, if you have a host with multiple DB instances, and only one is crashed how can you determine which (if any) memory segments should be cleaned up?
Well, this too, can be done. For each running instance, connect w/ / as sysdba, then do oradebug setmypid (any pid will do, as all Oracle PIDs connect to the SGA). Then do oradebug ipc. That will (hopefully) return IPC information written to the trace file. So, go to the udump (or diag_dest) directory, and look for your trace file. It will contain all the IPC information for the instance. This will include ShmId. Look through the file for the ShmId(s) that this instance is using. Now look at the output of ipcs -m.
When you have done that for all the running instances, any memory segment output by ipcs -m that shows non-zero memory allocation, and that you cannot account for in the oradebug ipc information from any running instance, must be the left over memory segments from the crashed instance. Use ipcrm to remove it/them.
When doing this on a host with multiple running instances, this can be a bit fraught. Please proceed with caution. You don't want to remove the SGA of a running instance!
Hope that helps....
HugeTLB can either be used for shared memory (and Mark J. Bobak's answer would deal with that) or the app mmaps files created in a hugetlb filesystem. If the app crashes without removing those files they survive and keep corresponding memory 'allocated'.
Check hugeTLB filesystem and see if there are any leftover files from the app. Removing them would release the memory.
If you follow the instruction below, you can get rid of the allocated hugepages:
1) Let's check the hugepages which were free at restart
dpdk#dpdkvm:~$ ls /mnt/huge/
empty
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ cat /proc/meminfo
...
HugePages_Total: 256
HugePages_Free: 256
...
2) Starting a dpdk application with wrong parameters, producing an error
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ sudo ./build/kni -c 0x03 -n 2 -- -P -p 0x03 --config="(0,0,1),(1,0,1)"
...
EAL: Error - exiting with code: 1
Cause: No supported Ethernet device found
3) When I check hugepages, there is not any free
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ cat /proc/meminfo
...
HugePages_Total: 256
HugePages_Free: 0
...
4) Now, when I check the mounted hugepage directory, I can see the files which are not given back to OS by dpdk application.
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ ls /mnt/huge/
...
rtemap_0 rtemap_137 rtemap_176 rtemap_214 rtemap_253 rtemap_62
...
5) Finally, if you remove the files starting with rtemap, you can give the hugepages back
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ sudo rm /mnt/huge/*
[sudo] password for dpdk:
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ cat /proc/meminfo
...
HugePages_Total: 256
HugePages_Free: 256
...
your hugetlb may be used by shared memory or mmap files.
try to remove the shared memories or umount the hugetlb fs

Is there a way to tell in linux if a binary program that is running matches file on the disk?

Suppose a binary executable program is running:
For example:
ps -eaf | grep someServer
shows that someServer is running.
Is it possible to tell if someServer executable on disk ( eg /usr/bin/someServer ) matches the program that was actually started ?
Yes: Use soft link /proc/$pid/exe to get the path which was used to load the code.
Look into /proc/$pid/maps. It will look like this (for /sbin/getty):
00400000-00407000 r-xp 00000000 08:01 3145779 /sbin/getty
00606000-00607000 r--p 00006000 08:01 3145779 /sbin/getty
00607000-00608000 rw-p 00007000 08:01 3145779 /sbin/getty
... lots more ...
Filter the file using the path that you got from the soft link to find the lines that are interesting for you.
The last number (3145779) is the inode of the file. When you create a new file on disk, it gets a new inode.
To see the inode of a file, use ls --inode /sbin/getty:
3145779 /sbin/getty
Since the two numbers are still identical, the executable on disk is the same as in RAM.
Background: Linux doesn't load processes into RAM at once. Instead, the executable file is memory-mapped into RAM using the virtual memory subsystem. This means parts of the executable which you never use will never be loaded into memory. It also means that the kernel uses the executable on disk as a "cache".
When you overwrite the executable on disk, the original inode is not changed. Your existing process hangs on to it. Instead, a new inode is created and the directory node (which contains the file name and a pointer to the inode with the data) is updated. This is why you can overwrite files that are currently in use on Linux.
The original inode will be cleaned up when the last process which uses it dies.
I am not sure what do you mean exactly.
use pid matching (compare ps output and getpid() in the server)
use /proc/$pid/environ and find _=path_to_someServer (there is the binary)
use diff /proc/$pid/exe someServer ("type someServer" to get full path)
If you want to know if any running processes were spawned from a particular executable inode:
# find -L /proc/[0-9]*/exe -samefile /usr/bin/someServer
This command will output a list of /proc/<pid>/exe pathnames, one for each process whose executable image was mapped into memory from the same inode as is presently linked at /usr/bin/someServer.
Note that the command will NOT find processes that were spawned from inodes that were formerly linked at /usr/bin/someServer but have since been unlinked, such as if a newer version of the executable has replaced a running version.

du -skh * in / returns vastly different size from df on centos 5.5

I have a vps slice running centos 5.5 I am supposed to have 15 gigs of disk space, but according to df it seems to double my disk space usage.
when I run du -skh * in / as root i get:
[root#yardvps1 /]# du -skh *
0 aquota.group
0 aquota.user
5.2M bin
4.0K boot
4.0K dev
4.9M etc
2.5G home
12M lib
14M lib64
4.0K media
4.0K mnt
299M opt
0 proc
692K root
23M sbin
4.0K selinux
4.0K srv
0 sys
48K tmp
2.0G usr
121M var
this is consistent with what I have uploaded to the machine, and adds up to about 5gigs.
BUT when i run df i get:
[root#yardvps1 /]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/simfs 15728640 11659048 4069592 75% /
none 262144 4 262140 1% /dev
it is showing me using almost 12 gigs already.
what is causing this discrepancy and is there anything I can do about it, I planned the server out based on 15 gigs but now it is basically only letting me have about 7 gigs of stuff on it.
thanks.
The most common cause of this effect is open files that have been deleted.
The kernel will only free the disk blocks of a deleted file if it is not in use at the time of its deletion. Otherwise that is deferred until the file is closed, or the system is rebooted.
A common Unix-world trick to ensure that no temporary files are left around is the following:
A process creates and opens a temporary file
While still holding the open file descriptor, the process unlinks (i.e. deletes) the file
The process reads and writes to the file normally using the file descriptor
The process closes the file descriptor when it's done, and the kernel frees the space
If the process (or the system) terminates unexpectedly, the temporary file is already deleted and no clean-up is necessary.
As a bonus, deleting the file reduces the chances of naming collisions when creating temporary files and it also provides an additional layer of obscurity over the running processes - for anyone but the root user, that is.
This behaviour ensures that processes don't have to deal with files that are suddenly pulled from under their feet, and also that processes don't have to consult each other in order to delete a file. It is unexpected behaviour for those coming from Windows systems, though, since there you are not normally allowed to delete a file that is in use.
The lsof command, when run as root, will show all open files and it will specifically indicate deleted files that are deleted:
# lsof 2>/dev/null | grep deleted
bootlogd 2024 root 1w REG 9,3 58 917506 /tmp/init.0W2ARi (deleted)
bootlogd 2024 root 2w REG 9,3 58 917506 /tmp/init.0W2ARi (deleted)
Stopping and restarting the guilty processes, or just rebooting the server should solve this issue.
Deleted files could also be held open by the kernel if, for example, it's a mounted filesystem image. In this case unmounting the filesystem or rebooting the server should do the trick.
In your case, judging by the size of the "missing" space I'd look for any references to the file that you used to set up the VPS e.g. the Centos DVD image that you deleted after installing.
Another case which I've come across although it doesn't appear to be your issue is if you mount a partition "on top" of existing files.
If you do so you effectively hide existing files that exist in the directory on the mounted-to partition (the mount point) from the mounted partition.
To fix: stop any processes with open files on the mounted partition, unmount partition, find and move/remove any files that now appear in mount point directory.
I had the same trouble with FreeBSD server. The reboot helped.

Resources