Under Linux, is it possible to gcore a process whose executable has been deleted? - linux

Programming on CentOS 6.6, I deleted an executable (whoops, make clean) while it was running in a screen session.
Now, unrelated, I want to gcore the process to debug something. I have rebuilt the executable, but gcore doesn't accept the replaced file. It knows the original file was deleted and won't let me dump core.
# gcore 15659
core.YGsoec:4: Error in sourced command file:
/home/dev/bin/daemon/destinyd (deleted): No such file or directory.
gcore: failed to create core.15659
# ls -l /proc/15659/exe
lrwxrwxrwx. 1 root root 0 Mar 12 21:33 /proc/15659/exe -> /home/dev/bin/daemon/destinyd (deleted)
# ln -s /proc/15659/exe /home/dev/bin/daemon/destinyd
ln: creating symbolic link `/home/dev/bin/daemon/destinyd': File exists
# rm /proc/15659/exe
rm: remove symbolic link `/proc/15659/exe'? y
rm: cannot remove `/proc/15659/exe': Permission denied
FreeBSD's gcore has an optional argument "executable" which looks promising (as if I could specify a binary to use that is not /proc/15659/exe), but that's of no use to me as Linux's gcore does not have any such argument.
Are there any workarounds? Or will I just have to restart the process (using the recreated executable) and wait for the bug I'm tracking to reproduce itself?

Despite the output of ls -l /proc/15659/exe, the original executable is in fact still available through that path.
So, not only was I able to restore the original file with a simple cp (though this was not enough to restore the link and get gcore to work), but I was able to attach GDB to the process using this path as executable:
# gdb -p 15659 /proc/15659/exe
and then run the "generate-core-file" command, followed by "detach".
Then, I became free to examine the core file as needed:
# gdb /proc/15659/exe core.15659
In truth I had forgotten about the ability of GDB to generate core files, plus I was anxious about actually attaching GDB to the process because timing was quite important: generating the core file at precisely the right time to catch that bug.
But nos steered me back onto this path and, my fears apparently unfounded, GDB was able to produce a lovely core.15659 for me.

Related

wget -o command output generates more than one file, is it possible to get only one?

I am executing the "wget -o " and because the output is bigger than expected, it is split in more than one file. Is there a way to get only one file? If this is possible I would prefer to use only the command wget.
The command wget that I am executing is:
$ wget -o neighborhoods.json https://raw.githubusercontent.com/mongodb/docs-assets/geospatial/neighborhoods.json
And the multiple output is:
-rw-rw-r-- 1 ubuntu ubuntu 6652 Mar 4 01:15 neighborhoods.json
-rw-rw-r-- 1 ubuntu ubuntu 4137081 Mar 4 01:15 neighborhoods.json.1
Look well at the wget output, you will see what it is/will be doing. wget does not split files if they are long; instead, it avoids to overwrite files, if they exist (creating a new file instead of touching the already existing one).
Delete the two files neighborXXX, and start wget again; be sure it finishes without problems: it will write (create) the single file you asked for. If it is interrupted, and you restart it, it will create a new file (appending .1 and so on).
You can pass it the option -c to tell it to continue a broken download, if it was interrupted - most of the times it works well (not always tough).

Linux File descriptors

I have a Java program after 2 weeks of running in average will become stuck and produce the following error:
Caused by: java.net.SocketException: Too many open files
at sun.nio.ch.Net.socket0(Native Method)
at sun.nio.ch.Net.socket(Net.java:415)
at sun.nio.ch.Net.socket(Net.java:408)
at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:105)
That hints to me that many sockets are opened but never closed.
Before diving into programmatic instrumentation i started to inspect what information i could draw from linux itself. I am using Redhat.
And then, a few questions came up as follows:
Why the following commands do not give the same output?
See
[ec2-user#ip-172-22-28-102 ~]$ sudo ls /proc/32085/fd | wc -l
592
[ec2-user#ip-172-22-28-102 ~]$ sudo lsof -a -p 32085 | wc -l
655
Is there a way to know from the proc stat info which thread created which file descriptor?
It seems like there is not because if i do the following, i am getting the same information:
[ec2-user#ip-172-22-28-102 ~]$ sudo ls /proc/32085/task/22386/fd | wc -l
592
[ec2-user#ip-172-22-28-102 ~]$ sudo ls /proc/32085/fd | wc -l
592
Same if i go to the thread directly from under /proc/ .
Thx
Is there a way to know from the proc stat info which thread created which file descriptor?
I am pretty sure the answer here is "no". File descriptors are opened by processes, not threads (and will be visible to all threads spawned by the same process).
Why the following commands do not give the same output?
First, the -a argument to lsof appears to be a no-op in this case. Specfically, the man says that it "causes list selection options to be ANDed, as described above". So you are really just running:
sudo lsof -p 32085
And that will print things other than open file descriptors (such as memory-mapped files, current working directory, etc), while /proc/<PID>/fd contains only open file descriptors. So you're getting different results because you're asking for different information.
The only reason you can receive that message is that you have opened files and you didn't close them after use. You have a file descriptor leak in your java application. Java programmers normally don't check memory as the garbage collector copes with unreferenced objects. If you save file descriptors without closing in some data structure or you don't close the files after using, you can reach the maximum limit allowed to a process (this is controlled per process and can be changed by the ulimit shell command)
But if your problem is a file descriptor leak, pushing up the ulimit will only delay the problem some time. File descriptors must be closed, or you'll run into trouble.
I've just ran across this difference today, the explanation is that lsof takes into account more types of files, like memory-mapped objects, run-time libraries etc

Busybox init does not start /etc/init.d/rcS

I'm trying to build embedded system using buildroot. Everything seems to work. All modules are starting, the system is stable. The problem is that /etc/init.d/rcS does not start during initialization of the system. If I run it manually everything is OK. I have it in my inittab file.
# /etc/inittab
#
# Copyright (C) 2001 Erik Andersen <andersen#codepoet.org>
#
# Note: BusyBox init doesn't support runlevels. The runlevels field is
# completely ignored by BusyBox init. If you want runlevels, use
# sysvinit.
#
# Format for each entry: <id>:<runlevels>:<action>:<process>
#
# id == tty to run on, or empty for /dev/console
# runlevels == ignored
# action == one of sysinit, respawn, askfirst, wait, and once
# process == program to run
# Startup the system
null::sysinit:/bin/mount -t proc proc /proc
null::sysinit:/bin/mount -o remount,rw /
null::sysinit:/bin/mkdir -p /dev/pts
null::sysinit:/bin/mkdir -p /dev/shm
null::sysinit:/bin/mount -a
null::sysinit:/bin/hostname -F /etc/hostname
# now run any rc scripts
::sysinit:/etc/init.d/rcS
# Put a getty on the serial port
ttyFIQ0::respawn:/sbin/getty -L -n ttyFIQ0 115200 vt100 # GENERIC_SERIAL
# Stuff to do for the 3-finger salute
::ctrlaltdel:/sbin/reboot
# Stuff to do before rebooting
null::shutdown:/etc/init.d/rcK
null::shutdown:/bin/umount -a -r
null::shutdown:/sbin/swapoff -a
Any idea what could be wrong?
/bin/init needs to be on your filesystem.
/bin/sh needs to be on your filesystem.
/etc/init.d/rcS needs to be executable and have #!/bin/sh as its first line.
Init
Are you sure you where invoking Busybox init? What was the kernel command line? If no init= option was supplied to the the kernel, the kernel will look for an executable at /init.
For instance, if your busybox binary resides in /bin/busybox, you need to create the following symlink :
ln -s /bin/busybox /init
If you want your init to reside in /sbin, to comply with the inittab, also create a symlink there. Note that the kernel will not respect init= setting if you don't mount root and your busybox only runs in an initramfs.
ln -s /bin/busybox /sbin/init
Inittab
Also, you could try not using an inittab. The things you try to run from inittab, might very well fit in rcS and any descendant scripts. From the same source you found your example inittab:
# Note: BusyBox init works just fine without an inittab. If no inittab is
# found, it has the following default behavior:
# ::sysinit:/etc/init.d/rcS
# ::askfirst:/bin/sh
# ::ctrlaltdel:/sbin/reboot
# ::shutdown:/sbin/swapoff -a
# ::shutdown:/bin/umount -a -r
# ::restart:/sbin/init
# tty2::askfirst:/bin/sh
# tty3::askfirst:/bin/sh
# tty4::askfirst:/bin/sh
rcS
Make sure /etc/init.d/rcS is executable:
chmod +x chroot chroot /bin/busybox
And try with:
#!/bin/busybox sh
echo "Hello world!"
Please note that this sentence can get buried between kernel log messages, so you might want to pass the quiet kernel command line option to see if it appears.
Busybox symlinks
Are the symlinks installed into the file system or not? If not it is not a disaster. Make sure that /etc/init.d/rcS starts with:
#!/bin/busybox sh
mkdir -pv /sbin
/bin/busybox --install -s
In addition to the scripts themselves being executable and having a correct shebang line, the kernel also needs to be compiled with the CONFIG_BINFMT_SCRIPT option enabled.
CONFIG_BINFMT_SCRIPT:
Say Y here if you want to execute interpreted scripts starting with
#! followed by the path to an interpreter.
You can build this support as a module; however, until that module
gets loaded, you cannot run scripts. Thus, if you want to load this
module from an initramfs, the portion of the initramfs before loading
this module must consist of compiled binaries only.
Most systems will not boot if you say M or N here. If unsure, say Y.
Without this option, you may receive the error message can't run '/etc/init.d/rcS': Exec format error.
From the information given, everything looks correct.
Some things to try:
Check ownership of your rcS script.
Comment out everything from rcS, and add something very simple:
echo "This worked" > /tmp/test
There might be something in your script related to a startup race condition that is causing it to exit. Also curious if your script is starting syslogd.

Why is gdb requiring root permission to debug user programs?

I have been using gdb quite successfully for a while, but I recently upgraded my version of Ubuntu, and now it seems that I can only get gdb to successfully run my program if I run as root. That is,
~ % gdb -q sleep -ex 'run 60'
Reading symbols from /bin/sleep...(no debugging symbols found)...done.
Starting program: /bin/sleep 60
tcsh: Permission denied.
During startup program exited with code 1.
(gdb)
fails, whereas
~ % sudo gdb -q sleep -ex 'run 60'
Reading symbols from /bin/sleep...(no debugging symbols found)...done.
Starting program: /bin/sleep 60
Running .tcshrc
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7adada0 in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:82
82 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb)
works. One clue is that in the first case, the gdb startup doesn't run my .tcshrc file, whereas in the second case it does.
It seems that this is a simple permissions issue, which I must have fixed at one time, because in the past, I have never needed to run gdb as root. After much googling, however, I wasn't able to find what I might have done (if I did in fact do something). One possible fix - set ptrace permissions - didn't seem to work.
Is there something that needs to be done to allow gdb to run programs without root privileges? I know in OSX, gdb has to be codesigned. Is there something similar for Ubuntu/Linux?
here are some ideas for debugging the gdb problem. comments aren't really feasible for such things, so I put them into an answer.
try the -n option to make sure no init file is loaded.
use the echo program instead of sleep 60 to make debugging simpler (the SIGINT thing in your example is probably specific to the sleep program.
run gdb -batch and put the rest into ~/.gdbinit:
file /bin/echo
run
add set verbose on.
don't forget to clean up ~/.gdbinit when done.
I changed my login shell to bash and gdb no longer needs root permission to debug. Here is the latest :
My .gdbinit file:
(bash) ~ % more .gdbinit
show environment SHELL
file /bin/echo
run 'running .gdbinit'
(bash) ~ %
and the results of running gdb :
(bash) ~ % gdb -q -batch
SHELL = /bin/bash
running .gdbinit
[Inferior 1 (process 3174) exited normally]
(bash) ~ %
I still don't understand why tcsh didn't work, though, and am curious to know. So if anyone has a possible explanation, please comment.
This isn't a complete answer, but things are becoming clearer. The tip above was very helpful. I created the following .gdbinit file
show environment SHELL
file /bin/echo
run 'Goodbye'
and the results were interesting. If SHELL=/usr/tcsh, I get a permissions error, i.e.
~ % setenv SHELL /bin/tcsh
~ % gdb -q -batch
SHELL = /bin/tcsh
tcsh: Permission denied.
/home/calhoun/.gdbinit:12: Error in sourced command file:
During startup program exited with code 1.
Unsetting the shell variable works :
~ % unsetenv SHELL
~ % gdb -q -batch
Environment variable "SHELL" not defined.
Goodbye
[Inferior 1 (process 6992) exited normally]
In this case, run uses /bin/sh to expand the argument list. Setting SHELL to /bin/bash or /bin/dash will use those shells to expand the argument list, e.g.
~ % setenv SHELL /bin/bash
~ % gdb -q -batch
SHELL = /bin/bash
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
Goodbye
[Inferior 1 (process 7280) exited normally]
Oddly, the "no loadable sections' error only happens when the shell variable is explicitly set. Another mystery.
Why /bin/tcsh doesn't work is still baffling. In my case, the permissions on /bin/tcsh are
~ % ls -lh /bin/tcsh
lrwxrwxrwx 1 root root 13 Oct 14 2011 /bin/tcsh -> /usr/bin/tcsh
~ % ls -lh /usr/bin/tcsh
-rwxr-xr-x 1 root root 382K Oct 14 2011 /usr/bin/tcsh
The problem could also be something in my .tcshrc file that is causing the shell to crash in this non-interactive mode.
Check if your program has executable privileges.
ls -l .
-rw-r--r-- 1 opt opt 30010 Aug 16 16:13 test
Something like 'test' maybe caused that when you debug it.
I specified Linux in the launch.json file, via the LaunchCompleteCommand...
MS has 3 different ways to specify the CPU... Windows, Linux or OSx
From https://code.visualstudio.com/docs/cpp/launch-json-reference
"launchCompleteCommand": "exec-run",
"linux": {
"MIMode": "gdb",
"miDebuggerPath": "/usr/bin/gdb"
},
In my launch.json file, this LaunchCompleteCommand section was pasted in after the setupCommands section.
I faced this issue before, and files from /usr/bin/gdb* didn't have execution permissions. This happened after I have installed peda, pwndbg and gef.
# chmod +x /usr/bin/gdb-*
I had the same problem and the reason was that someone had set the sticky bit in the gdb executable:
cruiz> ls -l /usr/bin/gdb
-rwsr-sr-x 1 root root 4190760 2010-05-05 07:55 /usr/bin/gdb*
I changed it (chmod 755 /usr/bin/gdb) and now it works.
Before:
cruiz> gdb
...
(gdb) shell
csh: Permission denied.
After the change:
cruiz> gdb
(gdb) shell
cruiz>

Buildroot - built a file system, how to login? boot hangs

Can someone help me to understand how I need to configure buildroot, so that I will be able to successfully boot my own file system and login to it ?
I have a (seemingly) working kernel, and now I created my own file system (didn't change any settings in build root really, except set console to ttyAMA0), but the boot process just seems to hang without any problems to this:
....
[ 3.130000] VFS: Mounted root (ext3 filesystem) on device 179:2.
[ 3.140000] Freeing init memory: 144K
Starting logging: OK
Starting network...
ip: RTNETLINK answers: Operation not permitted
ip: SIOCSIFFLAGS: Permission denied
Whole boot log is visible here: http://paste.ubuntu.com/1364407/
I understand that /etc/inittab controls the boot process, the contents looks like this:
# Startup the system
null::sysinit:/bin/mount -t proc proc /proc
null::sysinit:/bin/mount -o remount,rw / # REMOUNT_ROOTFS_RW
null::sysinit:/bin/mkdir -p /dev/pts
null::sysinit:/bin/mkdir -p /dev/shm
null::sysinit:/bin/mount -a
null::sysinit:/bin/hostname -F /etc/hostname
# now run any rc scripts
::sysinit:/etc/init.d/rcS
# Put a getty on the sttyAMA0::respawn:/sbin/getty -L ttyAMA0 115200 vt100 # GENERIC_SERIAL
# Stuff to do for the 3-finger salute
::ctrlaltdel:/sbin/reboot
# Stuff to do before rebooting
null::shutdown:/etc/init.d/rcK
null::shutdown:/bin/umount -a -r
null::shutdown:/sbin/swapoff -a
Any advice on what is wrong in my configuration ?
Any tips on where I could get a good overview of "the usual necessary configurations" needed when creating my own linux system ?
This problem was raised by the submitter on the Buildroot mailing list. The solution was that the submitter was using Buildroot the contents of output/target directory directly as its root filesystem, even though the Buildroot documentation explicitly tells not to do so. This is because Buildroot does not run as root, and therefore cannot create device files or adjust permissions/ownerships properly in output/target. These steps are done when creating the root filesystem images, thanks to a magic tool called fakeroot.
Therefore, if someone wants the root filesystem to extract on a SD card partition or something like that, one should ask Buildroot to generate a tar image, and then extract it as root in the SD card partition.
Since this problem was quite common, we have now added a file in output/target called THIS_IS_NOT_YOUR_ROOT_FILESYTEM which contains details about this issue. See http://git.buildroot.net/buildroot/commit/?id=9226a9907c4eb0fffab777f50e88b74aa14d1737.

Resources