Advanced pgrep-like process search in bash - linux

I need to find the pid of a certain java process in bash on linux.
If there's only one java process,
PID=$(pgrep java)
works.
For multiple java processes it becomes more complicated. Manually, I run pstree, find the ancestor of the java process that I need first, then find the java process in question. Is it possible to do this in bash? Basically I need the functionality that in pseudo-code looks like:
Having `processname1` and `processname2`
and knowing that `processname2` is in the subtree of 'processname1',
find the pid of `processname2`.
In this example the java process will be processname2.

Reformulating your psuedo-code question: find all processname2 processes which have a processname1 process as parent. This can be directly expressed using the following nested pgrep call:
pgrep -P $(pgrep -d, processname1) processname2
Here's the documentation for those flag straight from the pgrep(1) manpage:
-d delimiter
Sets the string used to delimit each process ID in the output
(by default a newline).
-P ppid,...
Only match processes whose parent process ID is listed.
Note that this will only work if processname2 is an immediate child process of processname1.

Related

How to reference a process reliably (using a tag or something similar)?

I have multiple processes (web-scrapers) running in the background (one scraper for each website). The processes are python scripts that were spawned/forked a few weeks ago. I would like to control (they listen on sockets to enable IPC) them from one central place (kinda like a dispatcher/manager python script), while the processes (scrapers) remain individual unrelated processes.
I thought about using the PID to reference each process, but that would require storing the PID whenever I (re)launch one of the scrapers because there is no semantic relation between a number and my use case. I just want to supply some text-tag along with the process when I launch it, so that I can reference it later on.
pgrep -f searches all processes by their name and calling pattern (including arguments).
E.g. if you spawned a process as python myscraper --scrapernametag=uniqueid01 then you can run:
TAG=uniqueid01; pgrep -f "scrapernametag=$TAG"
to discover the PID of a process later down the line.

How can I Shell have more than one job in Linux

I'm a beginner in Linux, just have some questions about job and process group.
My textbook says 'Unix shells use the abstraction of a job to represent the processes that are created as a result of evaluating a single command line. At any point in time, there is at most one foreground job and zero or more background jobs.
Lets say we have this simple shell code(I leave out some unimportant code, i.e setup argv etc):
When we type the first commnad, for example:./exampleProgram &
Q1- Is a job created? if yes, what process does the job contain?
the main() of shellex.c is invoked, so when execute the line 15: fork() create a new child process, lets say the parent process is p1, and the newly created child process is c1
and if it is background job, then we can type another command in the prompt, ./anotherProgram & and type enter
Q2- I'm pretty sure it just p1 and c1 at the moment and p1 is executing the second command, when it execute the line 15: fork() again,
we have p1, c1 and new created c2, is my understanding correct?
Q3- How many job is there? is it just one job that contains p1, c1, and c2?
Q4- if it is only one job, so when we keeps typing new commands we will only have one job contains one parent process p1 and many child processes c1, c2, c3, c4...
so why my textbook says that Shell can have more than one job? and why there is at most one foreground job and zero or more background jobs.
There is quite a bit to say on this topic, some of which can fit in an answer, and most of which will require further reading.
For Q1, I would say conceptually yes, but jobs are not automatic, and job tracking and control are not magical. I don't see any logic in the code snippits you've show that e.g. establishes and maintains a jobs table. I understand it's just a sample, so maybe the job control logic is elsewhere. Job control is a feature of common, existing Unix shells, but if a person writes a new Unix shell from scratch, job control features would need to be added, as code / logic.
For Q2, the way you've put it is not how I would put it. After the first call to fork(), yes there is a p1 and a c1, but recognize that at first, p1 and c1 are different instances of the same program (shellex); only after the call to execve() is exampleProgram running. fork() creates a child instance of shellex, and execve() causes the child instance of shellex to be replaced (in RAM) by exampleProgram (assuming that's the value of argv[0]).
There is no real sense in which the parent is "executing" the child, nor the process that replaces the child upon execve(), except just to get them going. The parent starts the child and might wait for the child execution to complete, but really a parent and its whole hierarchy of child processes are all executing each on its own, being executed by the kernel.
But yes, if told that the program to run should be run in the background, then shellex will accept further input, and upon the next call to fork(), there will be the parent shellex with two child processes. And again, at first the child c2 will be an instance of shellex, quickly replaced via execve() by whatever program has been named.
(Regarding running in the background, whether or not & has that effect depends upon the logic inside the function named parseline() in the sample code. Shells I'm familiar with use & to say "run this in the background", but there is nothing special nor magical about that. A newly-written Unix shell can do it some other way, with a trailing +, or a leading BG:, or whatever the shell author decides to do.
For Q3 and Q4, the first thing to recognize is that the parent you are calling p1 is the shell program that you've shown. So, no, p1 would not be part of the job.
In Unix, a job is a collection of processes that execute as part of a single pipeline. Thus a job can consist of one process or many. Such processes remain attached to the terminal from which they are run, but might be in the foreground (running and interactive), suspended, or in the background (running, not interactive).
one process, foreground : ls -lR
one process, background : ls -lR &
one process, background : ls -lR, then CTRL-Z, then bg
many processes, foreground : ls -lR | grep perl | sed 's/^.*\.//'
many processes, background : ls -lR | grep perl | sed 's/^.*\.//' &
To see jobs vs. processes empirically, run a pipeline in the background (the 5th of the 5 examples above), and while it is running use ps to show you the process IDs and the process group IDs. e.g., on my Mac's version of bash, that's:
$ ls -lR | grep perl | sed 's/^.*\.//' &
[1] 2454 <-- job 1, PID of the sed is 2454
$ ps -o command,pid,pgid
COMMAND PID PGID
vim 2450 2450 <-- running in a different tab
ls -lR 2452 2452 }
grep perl 2453 2452 }-- 3 PIDs, 1 PGID
sed s/^.*\.// 2454 2452 }
In contrast to this attachment to the shell and the terminal, a daemon detaches from both. When starting a daemon, the parent uses fork() to start a child process, but then exits, leaving only the child running, and now with a parent of PID 1. The child closes down stdin, stdout, and stderr, since those are meaningless, since a daemon runs "headless".
But in a shell, the parent -- which, again, is the shell -- stays running either wait()ing (foreground child program), or not wait()ing (background child program), and the child typically retains use of stdin, stdout, and stderr (although, these might be redirected to files, etc.)
And, a shell can invoke sub-shells, and of course any program that is run can fork() its own child processes, and so on. So the hierarchy of processes can become quite deep. Without specific action otherwise, a child process will be in the same process group as it's parent.
Here are some articles for further reading:
What is difference between a job and a process in Unix?
https://unix.stackexchange.com/questions/4214/what-is-the-difference-between-a-job-and-a-process
https://unix.stackexchange.com/questions/363126/why-is-process-not-part-of-expected-process-group
Bash Reference Manual; Job Control
Bash Reference Manual; Job Control Basics
A job is not a Linux thing, it's not a background process, it's something your particular shell defines to be a "job".
Typically a shell introduces the notion of "job" to do job control. This normally includes a way to identify a job and perform actions on it, like
bring into foreground
put into background
stop
resume
kill
If a shell has no means to do any of this, it makes little sense to talk about jobs.

How can I identify a progam by pid

My title is more than no explicite so feel free to change it (don't really know how to name it)
I use a php script to check if a list of pid is running, My issue is that pid identifying is not enough and some other program can get the pid number later on when mine is over.
So, is there something I can do to identify than pid is the good pid that I need to check and not another one.
I think to hash /proc/<pid>/cmdline but even that is not 100% safe (another program can be the same software and the same parameters (it's rare but possible).
if an example is needed:
I run several instance of wget
one of them have PID number 8426
some times later…
I check if PID 8426 is running, it is so my php script react and don't check file downloaded but the fact is that PID 8426 of wget is over and it's another program that running pid 8426.
If the new program run for a long time (eg: a service) I can wait a long time for my php script to check the downloaded file.
Have you tried employing an object-oriented paradigm, where you could encapsulate the specific PID number into its specific object (i.e., specific program)? To accomplish this, you need to create a class (say you give it the arbitrary name "SOURCE") from which these programs can be obtained as objects belonging to that class. Doing so will encapsulate any information (e.g., PID), including the methods of that specific program to that program alone and, therefore, provide a safer way than doing a hash. Similar methods can be found in the object-oriented programming paradigm of Python.
You can read the binary file that /proc/<pid>/exe points to. The following concept is done in a shell but probably can do that in any language including php:
$ readlink "/proc/$$/exe"
/bin/bash

How to set process ID in Linux for a specific program

I was wondering if there is some way to force to use some specific process ID to Linux to some application before running it. I need to know in advance the process ID.
Actually, there is a way to do this. Since kernel 3.3 with CONFIG_CHECKPOINT_RESTORE set(which is set in most distros), there is /proc/sys/kernel/ns_last_pid which contains last pid generated by kernel. So, if you want to set PID for forked program, you need to perform these actions:
Open /proc/sys/kernel/ns_last_pid and get fd
flock it with LOCK_EX
write PID-1
fork
VoilĂ ! Child will have PID that you wanted.
Also, don't forget to unlock (flock with LOCK_UN) and close ns_last_pid.
You can checkout C code at my blog here.
As many already suggested you cannot set directly a PID but usually shells have facilities to know which is the last forked process ID.
For example in bash you can lunch an executable in background (appending &) and find its PID in the variable $!.
Example:
$ lsof >/dev/null &
[1] 15458
$ echo $!
15458
On CentOS7.2 you can simply do the following:
Let's say you want to execute the sleep command with a PID of 1894.
sudo echo 1893 > /proc/sys/kernel/ns_last_pid; sleep 1000
(However, keep in mind that if by chance another process executes in the extremely brief amount of time between the echo and sleep command you could end up with a PID of 1895+. I've tested it hundreds of times and it has never happened to me. If you want to guarantee the PID you will need to lock the file after you write to it, execute sleep, then unlock the file as suggested in Ruslan's answer above.)
There's no way to force to use specific PID for process. As Wikipedia says:
Process IDs are usually allocated on a sequential basis, beginning at
0 and rising to a maximum value which varies from system to system.
Once this limit is reached, allocation restarts at 300 and again
increases. In Mac OS X and HP-UX, allocation restarts at 100. However,
for this and subsequent passes any PIDs still assigned to processes
are skipped
You could just repeatedly call fork() to create new child processes until you get a child with the desired PID. Remember to call wait() often, or you will hit the per-user process limit quickly.
This method assumes that the OS assigns new PIDs sequentially, which appears to be the case eg. on Linux 3.3.
The advantage over the ns_last_pid method is that it doesn't require root permissions.
Every process on a linux system is generated by fork() so there should be no way to force a specific PID.
From Linux 5.5 you can pass an array of PIDs to the clone3 system call to be assigned to the new process, up to one for each nested PID namespace, from the inside out. This requires CAP_SYS_ADMIN or (since Linux 5.9) CAP_CHECKPOINT_RESTORE over the PID namespace.
If you a not concerned with PID namespaces use an array of size one.

Confusion with pid's and processes on linux

from reading docs and online most people have been saying that to kill a process in linux, only the command kill "pid" is needed.
For example to kill memcached would be kill $(cat memcached.pid)
But for pretty much every process that i've tried to kill including the one above, this would not work. I managed to get it to work with a different command:
ps aux | grep (process name here)
That command, for whatever reason would get a different pid, which would work when killing the program.
I guess my question is, why are there different pid's? Isn't the point of an id to be unique? Why do celery, memcached, and other processes all have a different pid's when using the aux | grep command, versus the pid in the .pid file? Is this some kinda error on my configuration or is it ment to be like this?
Also, where is it possible to get all arguments and descriptions for an executable in linux?
I know the "man" command is useful for some functions, but it wont work for many executables, like celery for example.
Thanks!
The process ID (pid) is assigned by the operating system on-the-fly when a process starts up. It's unique in the sense that no two processes have the same ID. However, the actual value is not guaranteed to be the same from one run of the process to another. The best way to think of it is like those "now serving" tickets:
You are correct that you can look up an ID via ps and grep, though you may find it easier to just use:
pgrep (process name here)
Also, if you just want to kill the process, you can even skip the above step and use:
pkill (process name here)

Resources