Why is it allowed to modify argv[0]?

Why is it allowed to modify argv[0]? - linux

I have been working on a project that uses PIDs, /proc and command line analysis to validate processes on a system. My code had to be checked by the security guys who manage to break it with a single line... embarrassing!
#!/usr/bin/env perl
$0="I am running wild"; # I had no clue you can do this!
system("cat /proc/$$/cmdline");
print("\n");
system("ps -ef | grep $$");
# do bad stuff here...
My questions:
I see some uses cases for the above, like hiding passwords given on the command line (also bad practice) but I see a lot more problems/issues when one can hide processes and spoof cmdline. Is there a reason it is allowed? Isn't it a system vulnerability?
How can I prevent or detect this? I have looked into /proc mount options. I also know that one can use lsof to identify spoofed processes based on unexpected behavior, but this won't work in my case. At the moment I am using a simple method to detect if the cmdline contains at least one null (\0) character which assumes that at least one argument is present. In the above code, spaces need to be replaced with nulls to bypass that check which is something I couldn't find how to implement in Perl - writes up to the first \0.

To answer 1:
It's because of how starting a new process actually works.
You fork() to spawn a duplicate instance of your current process, and then you exec() to start the new thing - it replaces your current process with the 'new' process, and as a result - it has to rewrite $0.
This is actually quite useful though when running parallel code - I do it fairly frequently when fork()ing code, because it makes it easy to spot which 'thing' is getting stuck/running hot.
E.g.:
use Parallel::ForkManager;
my $manager = Parallel::ForkManager -> new ( 10 );
foreach my $server ( #list_of_servers ) {
$manager -> start and next;
$0 = "$0 child: ($server)";
#do stuff;
$manager -> finish;
}
You can instantly see in your ps list what's going on. You'll see this sort of behaviour with many multiprocessing services like httpd.
But it isn't a vulnerability, unless you assume it's something that it's not (as you do). No more than being able to 'mv' a binary anyway.
Anyway, to answer 2... prevent or detect what? I mean, you can't tell what a process is doing from the command line, but you can't tell what it's doing otherwise anyway (there's plenty of ways to make a piece of code do something 'odd' behind the scenes).
The answer is - trust your security model. Don't run untrusted code in a privileged context, and it's largely irrelevant what they call it. Sure, you could write rude messages in the process list, but it's pretty obvious who's doing it.

Related

NodeJS exec without creating a child process

In NodeJS, how does one make the c-level exec call, like other languages and runtimes allow (e.g. Python's os.exec() or even Bash's exec), please?
Unlike the child_process module (and all the wrappers I've found for it so far), this would be meant to completely replace the current NodeJS process by the execed process.
What I am aiming for is to create a NodeJS cli adaptor of sorts, that you provide with one set of configuration options and you can use it to dispatch different tools that use the same configuration, but each has a different expectations on the format (some expect ENV VARs, some command-line arguments, some need a file, ...).
The reason why I am looking for such an exec call is that, once the tool gets dispatched, I have no expectation on the NodeJS process sticking around -- there's no reason for it to continue existing. At the same time, I need to tool to become the foreground process (accept all signals, control characters, ...). It seems to me that it would be possible to forward all such events from the parent NodeJS to the child tool (if I used child_process), but it seems a bit too much to implement from scratch...
Cheers!

OK, so I think I have finally got it. I am posting this as an answer as it actually shows a real code example:
child_process.spawnSync(
"bash",
[
"-ic",
["psql"],
],
{
stdio: "inherit",
}
);
process.exit();
This will spin up bash (with control tty) and inside it psql. Pressing ctrl-c does not kill the node process; it is, as expected, sent to psql. I have no idea if it would be possible to rewrite this code in a way where the node process terminates before psql and bash, but for my use case, this does it. Maybe it would be even possible to get it done without bash, but I think that bash -i is actually the key here.

How to use the attach the same console as output for a process and input for another process?

I am trying to use suckless ii irc client. I can listen to a channel by tail -f out file. However is it also possible for me to input into the same console by starting an echo or cat command?
If I background the process, it actually displays the output in this console but that doesn't seem to be right way? Logically, I think I need to get the fd of the console (but how to do that) and then force the tail output to that fd and probably background it. And then use the present bash to start a cat > in.
Is it actually fine to do this or is that I am creating a lot of processes overhead for a simple task? In other words piping a lot of stuff is nice but it creates a lot of overhead which ideally has to be in a single process if you are going to repeat that task it a lot?

However is it also possible for me to input into the same console by starting an echo or cat command?
Simply NO! cat writes the current content. cat has no idea that the content will grow later. echo writes variables and results from the given command line. echo itself is not made for writing the content of files.
If I background the process, it actually displays the output in this console but that doesn't seem to be right way?
If you do not redirect the output, the output goes to the console. That is the way it is designed :-)
Logically, I think I need to get the fd of the console (but how to do that) and then force the tail output to that fd and probably background it.
As I understand that is the opposite direction. If you want to write to the stdin from the process, you simply can use a pipe for that. The ( useless ) example show that cat writes to the pipe and the next command will read from the pipe. You can extend to any other pipe read/write scenario. See link given below.
Example:
cat main.cpp | cat /dev/stdin
cat main.cpp | tail -f
The last one will not exit, because it waits that the pipe gets more content which never happens.
Is it actually fine to do this or is that I am creating a lot of processes overhead for a simple task? In other words piping a lot of stuff is nice but it creates a lot of overhead which ideally has to be in a single process if you are going to repeat that task it a lot?
I have no idea how time critical your job is, but I believe that the overhead is quite low. Doing the same things in a self written prog must not be faster. If all is done in a single process and no access to the file system is required, it will be much faster. But if you also use system calls, e.g. file system access, it will not be much faster I believe. You always have to pay for the work you get.
For IO redirection please read:
http://www.tldp.org/LDP/abs/html/io-redirection.html
If your scenario is more complex, you can think of named pipes instead of IO redirection. For that you can have a look at:
http://www.linuxjournal.com/content/using-named-pipes-fifos-bash

How to implement multithreaded access to file-based queue in bash script

I have a bit of a problem with designing a multiprocessed bash script that goes trough websites, follows found links and does some processing on every new page (it actually gathers email addresses but that's an unimportant detail to the problem).
The script is supposed to work like this:
Downloads a page
Parses out all links and adds them to queue
Does some unimportant processing
Pops an URL from queue and starts from
That itself would be quite simple to program, the problem arrises from two restrictions and a feature the script needs to have.
The script must not process one URL twice
The script must be able to process n (supplied as argument) pages at once
The script must be POSIX complient (with the exception of curl) -> so no fancy locks
Now, I've managed to come up with an implementation that uses two files for queues one of which stores all URLs that had already been processed, and other one URLs that had been found but not yet processed.
The main proces simply spawns a bunch of child processes that all share the queue files and (in a loop until URLs-to-be-processed-queue is empty) pops top a URL from URLs-to-be-processed-queue, process the page, try to add every newly found link to URLs-already-processed-queue and if it succeeds (the URL is not already there) add it to URLs-to-be-processed-queue as well.
The issue lies in the fact that you can't (AFAIK) make the queue-file operations atomic and therefore locking is necessary. And locking in POSIX complient way is ... terror... slow terror.
The way I do it is following:
#Pops first element from a file ($1) and prints it to stdout; if file emepty print out empty return 1
fuPop(){
if [ -s "$1" ]; then
sed -nr '1p' "$1"
sed -ir '1d' "$1"
return 0
else
return 1
fi
}
#Appends line ($1) to a file ($2) and return 0 if it's not in it yet; if it, is just return 1
fuAppend(){
if grep -Fxq "$1" < "$2"; then
return 1
else
echo "$1" >> "$2"
return 0
fi
}
#There're multiple processes running this function.
prcsPages(){
while [ -s "$todoLinks" ]; do
luAckLock "$linksLock"
linkToProcess="$(fuPop "$todoLinks")"
luUnlock "$linksLock"
prcsPage "$linkToProcess"
...
done
...
}
#The prcsPage downloads it, does some magic and than calls prcsNewLinks and prcsNewEmails that both get list of new emails / new urls in $1
#$doneEmails, ..., contain file path, $mailLock, ..., contain dir path
prcsNewEmails(){
luAckLock "$mailsLock"
for newEmail in $1; do
if fuAppend "$newEmail" "$doneEmails"; then
echo "$newEmail"
fi
done
luUnlock "$mailsLock"
}
prcsNewLinks(){
luAckLock "$linksLock"
for newLink in $1; do
if fuAppend "$newLink" "$doneLinks"; then
fuAppend "$newLink" "$todoLinks"
fi
done
luUnlock "$linksLock"
}
The problem is that my implementation is slow (like really slow), almost so slow that it doesn't make sense to use more than 2 10 (decreasing lock waiting help a great deal) child processes. You can actually disable the locks (just comment out the luAckLock and luUnlock bits) and it works quite ok (and much faster) but there're race conditions every once in a while with the seds -i and it just doesn't feel right.
The worst(as I see it) is locking in prcsNewLinks as it takes quite a lot time (most of time-run basically) and practically prevents other processes from starting to process a new page (as it requires poping new URL from (currently locked) $todoLinks queue).
Now my question is, how to do it better, faster, and nicer?
The whole script is here (it contains some signal magic, a lot of debug outputs, and not that good code generally).
BTW: Yes, you're right, doing this in bash - and what's more in POSIX compliant way - is insane! But it's university assignment so I kinda have to do it
//Though I feel it's not really expected of me to resolve these problems (as the race conditions arise more frequently only when having 25+ threads which is probably not something a sane person would test).
Notes to the code:
Yes, the wait should have (and already has) a random time. The code shared here was just a proof of concept written during real analysis lecture.
Yes, the number of debug ouptuts and their formatting is terrible and there should be standalone logging function. That's, however, not point of my problem.

First of all, do you need to implement your own HTML/HTTP spidering? Why not let wget or curl recurse through a site for you?
You could abuse the filesystem as a database, and have your queues be directories of one-line files. (or empty files where the filename is the data). That would give you producer-consumer locking, where producers touch a file, and consumers move it from the incoming to the processing/done directory.
The beauty of this is that multiple threads touching the same file Just Works. The desired result is the url appearing once in the "incoming" list, and that's what you get when multiple threads create files with the same name. Since you want de-duplication, you don't need locking when writing to the incoming list.
1) Downloads a page
2) Parses out all links and adds them to queue
For each link found,
grep -ql "$url" already_checked || : > "incoming/$url"
or
[[ -e "done/$url" ]] || : > "incoming/$url"
3) Does some unimportant processing
4) Pops an URL from queue and starts from 1)
# mostly untested, you might have to debug / tweak this
local inc=( incoming/* )
# edit: this can make threads exit sooner than desired.
# See the comments for some ideas on how to make threads wait for new work
while [[ $inc != "incoming/*" ]]; do
# $inc is shorthand for "${inc[0]}", the first array entry
mv "$inc" "done/" || { rm -f "$inc"; continue; } # another thread got that link, or that url already exists in done/
url=${inc#incoming/}
download "$url";
for newurl in $(link_scan "$url"); do
[[ -e "done/$newurl" ]] || : > "incoming/$newurl"
done
process "$url"
inc=( incoming/* )
done
edit: encoding URLs into strings that don't contain / is left as an exercise for the reader. Although probably urlencoding / to %2F would work well enough.
I was thinking of moving URLs to a "processing" list per thread, but actually if you don't need to be able to resume from interruption, your "done" list can instead be a "queued & done" list. I don't think it's actually ever useful to mv "$url" "threadqueue.$$/" or something.
The "done/" directory will get pretty big, and start to slow down with maybe 10k files, depending on what filesystem you use. It's probably more efficient to maintain the "done" list as a file of one url per line, or a database if there's a database CLI interface that's fast for single commands.
Maintaining the done list as a file isn't bad, because you never need to remove entries. You can probably get away without locking it, even for multiple processes appending it. (I'm not sure what happens if thread B writes data at EOF between thread A opening the file and thread A doing a write. Thread A's file position might be the old EOF, in which case it would overwrite thread B's entry, or worse, overwrite only part of it. If you do need locking, maybe flock(1) would be useful. It gets a lock, then executes the commands you pass as args.)
If broken files from lack of write locking doesn't happen, then you might not need write locking. The occasional duplicate entry in the "done" list will be a tiny slowdown compared to having to lock for every check/append.
If you need strictly-correct avoidance of downloading the same URL multiple times, you need readers to wait for the writer to finish. If you can just sort -u the list of emails at the end, it's not a disaster for a reader to read an old copy while the list is being appended. Then writers only need to lock each other out, and readers can just read the file. If they hit EOF before a writer manages to write a new entry, then so be it.
I'm not sure whether it matters if a a thread adds entries to the "done" list before or after they remove them from the incoming list, as long as they add them to "done" before processing. I was thinking that one way or the other might make races more likely to cause duplicate done entries, and less likely to make duplicate downloads / processing, but I'm not sure.

Is is OK to use a non-zero return code for a process that executed successfully?

I'm implementing a simple job scheduler, which spans a new process for every job to run. When a job exits, I'd like it to report the number of actions executed to the scheduler.
The simplest way I could find, is to exit with the number of actions as a return code. The process would for example exit with return code 3 for "3 actions executed".
But the standard (AFAIK) being to use the return code 0 when a process exited successfully, and any other value when there was en error, would this approach risk to create any problem?
Note: the child process is not an executable script, but a fork of the parent, so not accessible from the outside world.

What you are looking for is inter process communication - and there are plenty ways to do it:
Sockets
Shared memory
Pipes
Exclusive file descriptors (to some extend, rather go for something else if you can)
...
Return convention changes are not something a regular programmer should dare to violate.

The only risk is confusing a calling script. What you describe makes sense, since what you want really is the count. As Joe said, use negative values for failures, and you should consider including a --help option that explains the return values ... so you can figure out what this code is doing when you try to use it next month.

I would use logs for it: log the number of actions executed to the scheduler. This way you can also log datetimes and other extra info.
I would not change the return convention...

If the scheduler spans a child and you are writing that you could also open a pipe per child, or a named pipes or maybe unix domain sockets, and use that for inter process communication and writing the processed jobs there.
I would stick with conventions, namely returning 0 for success, expecially if your program is visible/usable around by other people, or anyway document well those decisions.
Anyway apart from conventions there are also standards.

Is there an equivalent of `tail -f` in Perl?

I'm writing a script that listens for changes to a log file and acts appropriately. At the moment I'm using open my $fh, "tail -f $logfile |"; but ideally I don't want to use system calls.
I've tried File::Tail but it has a minimum interval of 1 second (I've tried passing less but it defaults to 1 second, even if I pass 0). I've checked the source for it and it seems to be using sleep() which takes an integer. Before I try to write my own, are there any other options?
Thanks.

Did you know that tail -f also uses a 1-second sleep by default? It's true! (At least for GNU tail...)
File::Tail actually uses Time::HiRes's sleep function, which means that the sleep time parameter isn't an integer; you can set it to any fractional number of seconds that your system can deal with.

There's a lot of good stuff in the documentation, including an answer to this exact question. From perlfaq5's answer to How do I do a "tail -f" in perl?
First try
seek(GWFILE, 0, 1);
The statement seek(GWFILE, 0, 1) doesn't change the current position, but it does clear the end-of-file condition on the handle, so that the next makes Perl try again to read something.
If that doesn't work (it relies on features of your stdio implementation), then you need something more like this:
for (;;) {
for ($curpos = tell(GWFILE); <GWFILE>; $curpos = tell(GWFILE)) {
# search for some stuff and put it into files
}
# sleep for a while
seek(GWFILE, $curpos, 0); # seek to where we had been
}
If this still doesn't work, look into the clearerr method from IO::Handle, which resets the error and end-of-file states on the handle.
There's also a File::Tail module from CPAN.

CPAN has File::ChangeNotify

In my opinion there is nothing wrong with using tail -f pipe like you already do. In many situations going for language purity like "no system calls" is counterproductive.
Also, I've seen problems with File::Tail in long-running processes. Unfortunately I cannot corroborate this claim with any hard evidence since it happened quite some time ago, I did not debug the problems, replacing the use of File::Tail with a tail -F pipe instead.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string