How to make a mod_perl interpreter sticky by some conventions?

How to make a mod_perl interpreter sticky by some conventions? - multithreading

As it seems that mod_perl only manages Perl interpreters per VHOST, is
there any way I can influence which cloned interpreter mod_perl
selects to process a request? I've read through the configurable
scopes and had a look into "modperl_interp_select" in the source and I
could see that if a request already has a interpreter associated, that
one is selected by mod_perl.
else if (r) {
if (is_subrequest && (scope == MP_INTERP_SCOPE_REQUEST)) {
[...]
}
else {
p = r->pool;
get_interp(p);
}
I would like to add some kind of handler before mod_perl selects an
interpreter to process a request and then select an interpreter to
assign it to the request myself, based on different criteria included in
the request.
But I'm having trouble to understand if such a handler can exist at
all or if everything regarding a request is already processed by a
selected interpreter of mod_perl.
Additionally, I can see APR::Pool-API, but it doesn't seem to provide
the capability to set some user data on a current pool object, which
is what mod_perl reads by "get_interp".
Could anyone help me on that? Thanks!
A bit on the background: I have a dir structure in cgi-bin like the following:
cgi-bin
software1
customer1
*.cgi
*.pm
customer2
*.cgi
*.pm
software2
customer1
*.cgi
*.pm
customer2
*.cgi
*.pm
Each customer uses a private copy of the software and the software is using itself, e.g. software1 of customer1 may talk to software2 of customer1 by loading some special client libs of software2 into it's own Perl interpreter. To get things more complicated, software2 may even bring general/common parts of software1 with it's own private installation by using svn:external. So I have a lot of the same software with the same Perl packages in one VHOST and I can't guarantee that all of those private installation always have the same version level.
It's quite a mixup, but which is known to work under the rules we have within the same Perl interpreter.
But now comes mod_perl, clones interpreters as needed and reuses them for requests into whichever sub dir of cgi-bin it likes and in this case things will break, because suddenly the interpreter already processed software1 of customer1 and should now process software2 of customer2, which uses common packages of software1, which already where loaded by the Perl interpreter before and are used because of %INC instead of the private packages of software2 and such...
Yes, there are different ways to deal with that, like VHOSTs and sub domains persoftware or customer or whatever, but I would like to check different ways of keeping one VHOST and the current directory structure, just by using what mod_perl or Apache httpd provides. And one way would be if I could tell mod_perl to always use the same Perl interpreter for requests to the same directory. This way mod_perl would create it's pool of interpreters and I would be responsible to select each of them per directory.

What I've learned so far is that it's not easily possible to influence mod_perl's decision about a selected interpreter. If one wants to, it seems that one would need to really patch mod_perl on C level or provide an own C-handler for httpd as a hook to run before mod_perl. In the end mod_perl is only a combination of handlers for httpd itself, so placing one in front of it doing some special things is possible. It's another question if it's wise to do so, because one would have to deal with some mod_perl internals like the fact there's no interpreter available currently and in the end in my case I would need a map somewhere for interpreters and their associated directory to process.
In the end it's not that easy and I don't want to patch mod_perl or start on low level C for a httpd handler/hook.
For documentation purposes I want to mention two possible workarounds which came into my mind:
Pool of Perl threads
The problem with the current mod_perl approach in my case is that it clones Perl interpreters low level in C and those are run by threads provided by a pool of httpd, every thread can run any interpreter any given time, unless it's not already in use by another thread. With this approach it seems impossible to access the interpreters within Perl itself without using any low level XS as well, especially it's not possible to manage the interpreters and threads with Perl's Threads API, simply because it aren't no Perl threads, it's Perl interpreters executed by httpd threads. In the end both behave the same, though, because during creation of Perl threads the current interpreter is cloned as well and associated OS threads are created and such. But while using Perl's threads you have more influence about the shared data and such.
So a workaround for my current problem could be to not let mod_perl and it's interpreters process a request, but instead create an own thread pool of Perl threads directly while starting up in a VHOST using PerlModule or such. Those threads can be directly managed entirely within Perl, one could create some queues to dispatch work in form of absolute paths to requested CGI applications and such. Besides the thread pool itself a handler would be needed which would get called instead of e.g. ModPerl::Registry to function as a dispatcher: It would need to decide based on some criteria which thread to use and put the requested path into it's queue and the thread itself could ultimately e.g. just create new instances of ModPerl::Registry to process the given file. Of course there would be some glue needed here and there...
There are some downsides of this approach of course: It sounds like a fair amount of work, doubles some of the functionality already implemented by mod_perl especially regarding pool maintenance and doubles the amount of threads and memory used, because the mod_perl interpreters and threads would only be used to execute the dispatcher handler and additionally one would have threads and interpreters to process the requests within the Perl threads. The amount of threads shouldn't be a huge problem at all, that one of mod_perl would just sleep and wait for the Perl thread to finish it's work.
#INC-Hook with source code changes
Another, and I guess easier, approach would be to make use of #INC hooks for Perl's require in combination again with an own mod_perl handler extending ModPerl::Registry. The key point is that the handler is the first place where the requested file is read and it's source can be changed before compiling it. Every
use XY::Z;
XY::Z->new(...);
could be changed to
use SomePrefix::XY::Z;
SomePrefix::XY::Z->new(...);
where SomePrefix would simply be the full path of the parent directory of the requested file changed to be a valid Perl package name. ModPerl::Registry already does something similar while transforming a requested CGI script automatically into a mod_perl handler, so this works in general and ModPerl::Registry already provides some logic to generate the package name and such. The change leads to that Perl won't find the packages anymore automatically, simply because they don't exist with the new name in a place known to Perl, that's where the #INC hook applies.
The hook is responsible to recognize such changed packages, simply because of the name of SomePrefix or a marker prefix in front of SomePrefix or whatever, and map those to a file in the file system to provide a handle to the requested file which Perl can load during "require". Additionally, the hook will provide a callback which gets called by Perl for each line of the file read and will function as a source code filter, again changing each "package", "use" or "require" statement to have SomePrefix in front of. This will result again in the hook being responsible for providing file handles to those packages etc.
The key point here is changing the source code during runtime once: Instead of "XY/Z.pm" which Perl would require normally, would be available n times in my directory structure and would be saved as "XY/Z.pm" in %INC, one lets Perl require "SomePrefix/XY/Z.pm", that would be stored in %INC and is unique for every Perl interpreter used by mod_perl because SomePrefix reflects the unique installation directory of a requested file. There's no room anymore for Perl to think it already had loaded XY::Z, just because it processed a request from another directory before.
Of course this only works for easy "use ...;" statements, things like "eval("require $package");" will make things a bit more complicated.
Comments welcome... :-)

"PerlOptions" is a DIR scoped variable, not limited to VirtualHosts, so new interpreter pools can be created for any location. These directive can even be placed in .htaccess files for ease of configuration, but something like this in your httpd.conf should give you the desired effect:
<Location /cgi-bin/software1/customer1>
PerlOptions +Parent
</Location>
<Location /cgi-bin/software1/customer2>
PerlOptions +Parent
</Location>
<Location /cgi-bin/software2/customer1>
PerlOptions +Parent
</Location>
<Location /cgi-bin/software2/customer2>
PerlOptions +Parent
</Location>

Related

How to get the last process that modified a particular file?

Ηi,
Say I have a file called something.txt. I would like to find the most recent program to modify it, specifically the full path to said program (eg. /usr/bin/nano). I only need to worry about files modified while my program is running, so I can add an event listener at program startup and find out what program modified it when my program was running.
Thanks!

auditd in Linux could perform actions regarding file modifications
See the following URI xmodulo.com/how-to-monitor-file-access-on-linux.html

Something like this generally isn't going to be possible for arbitrary processes. If these aren't arbitrary processes, then you could use some sort of network bus (e.g. redis) to publish "write" messages. Otherwise your only other bet would be to implement your own filesystem using FUSE. Even with FUSE though, you may not always have access to the pid depending on who/what is writing to the file and the security setup of your OS.

Linux non-su script indirectly triggering su script?

I'd like to create an auto-testing/grading script for students on a Linux system such that:
Any student user can initiate the script at any time.
A separate script (with root privileges) copies student code to a non-student-accessible file space, using non-student-accessible unit tests, etc.
The user receives limited feedback in the form of a text file generated by the grading script.
In short, I'm looking to create something similar to programming contest submission systems, but allowing richer feedback without revealing all teacher unit testing.
I would imagine that a spooling behavior between one initiating script and one root-permission cron script might be in order. Are there any models/examples of how one might best structure communication between a user-initiated script and a separate root-initiated script for such purposes?

There are many options.
The things I would mention at the first line:
Don't use su; use sudo; there are several reasons for it, and the main reason, that to use su you need the password of the user you want to be and with sudo — you don't;
Scripts can't be suid, you must use binaries or just a normal script that will be started using sudo (of course students must have sudoers entry that allows them to use the script);
Cron is not that fast, as you may theoretically need; cron runs tasks every minute; please consider inotify usage;
To communicate between components of your system you need something that will react in realtime; there are many opensource components/libraries/frameworks that could help you, but I would recommend you to take a look at ZeroMQ and Redis;
Results of the scripts' executions/tests can be written either to a filesystem (I think it would be better), or to a DBMS.

If you want to stick to shell scripting, the method I suggest for communicating between processes would be to have the root script continually check a named pipe for input (i.e. keep opening it after each eof) and send each input through whatever various tests must be done. Have part of the input be a 'return address' - where to send the result.
This should allow the tests to be performed in a privileged space without exposing any control over the privileged space to the students. The students don't need sudo, and you don't need to pull in libraries. Just have the students pipe their code into a non-privileged script that adds the return address and whatever other markup you may need, which then gives it to the named pipe.

Multiple Process InitScript Logic

I am developing initscripts for some of our software, and am having difficulty deciding how I should use it for a particular piece.
We have homegrown software responsible for passing data around out network, it's built on a standard pubsub model. There is a publisher process (two, actually, for two different use cases), a broker process, and a subscriber process). Any combination of these processes, and even multiple of the same process, can run simultaneously on a given box. I'm having trouble deciding how best to allow this to be configured. Since it can vary from box to box, that will likely go into /etc/sysconfig/pubsub which will be read in by the initscript.
The only things I will have to allow to be configured is (1) the process name, which is one of log_publish, dir_publish, broker, subscribe, and (2) the configuration file that corresponds to that particular process.
I wish to avoid telling people how to modify the initscript per box in order to change the list of running processes, so this unique configuration file per box is the best way I can come up with to accomplish that.
I assume this also means that I will have to have some kind of unique identifier per process on the box, as I intend to use the touch /var/lock/subsys/* method that most RedHat initscripts use already to lock a process from running twice. Knowing this, I know the identifier can't always be random, otherwise it will never be effective in order to prevent duplicate processes with the same configuration file (because, again, I need to be able to run multiple processes with different configuration files).
I have no idea how best to represent this in configuration.

I've implemented this similarly to how VNC does it when run as an initscript.
If you look at your distro's configuration file for vnc init (ex. RedHat/CentOS: /etc/sysconfig/vncservers), you see this:
# The VNCSERVERS variable is a list of display:user pairs.
#
# Uncomment the line below to start a VNC server on display :1
# as my 'myusername' (adjust this to your own). You will also
# need to set a VNC password; run 'man vncpasswd' to see how
# to do that.
#
# DO NOT RUN THIS SERVICE if your local area network is
# untrusted! For a secure way of using VNC, see
# <URL:http://www.uk.research.att.com/vnc/sshvnc.html>.
# VNCSERVERS="1:myusername"
# VNCSERVERARGS[1]="-geometry 800x600"
Pretty straight forward. You define a screen number, and parameters to match if necessary.
So now, I have, for example:
PUBSUBPROCS="1:publish 2:broker 3:subscribe"
PUBSUBARGS[1]="/config/publish.cfg"
PUBSUBARGS[2]="/config/broker.cfg"
PUBSUBARGS[3]="/config/subscribe.cfg"
And most all of the logic for parsing this was also ripped out from the vncserver initscript, which I will not post here for length reasons.

I'd say have multiple initscripts, one per process type, and then let the configuration for each determine how many of that process to spawn.

Automatically adjusting process priorities under Linux

I'm trying to write a program that automatically sets process priorities based on a configuration file (basically path - priority pairs).
I thought the best solution would be a kernel module that replaces the execve() system call. Too bad, the system call table isn't exported in kernel versions > 2.6.0, so it's not possible to replace system calls without really ugly hacks.
I do not want to do the following:
-Replace binaries with shell scripts, that start and renice the binaries.
-Patch/recompile my stock Ubuntu kernel
-Do ugly hacks like reading kernel executable memory and guessing the syscall table location
-Polling of running processes
I really want to be:
-Able to control the priority of any process based on it's executable path, and a configuration file. Rules apply to any user.
Does anyone of you have any ideas on how to complete this task?

If you've settled for a polling solution, most of the features you want to implement already exist in the Automatic Nice Daemon. You can configure nice levels for processes based on process name, user and group. It's even possible to adjust process priorities dynamically based on how much CPU time it has used so far.

Sometimes polling is a necessity, and even more optimal in the end -- believe it or not. It depends on a lot of variables.
If the polling overhead is low-enough, it far exceeds the added complexity, cost, and RISK of developing your own style kernel hooks to get notified of the changes you need. That said, when hooks or notification events are available, or can be easily injected, they should certainly be used if the situation calls.
This is classic programmer 'perfection' thinking. As engineers, we strive for perfection. This is the real world though and sometimes compromises must be made. Ironically, the more perfect solution may be the less efficient one in some cases.
I develop a similar 'process and process priority optimization automation' tool for Windows called Process Lasso (not an advertisement, its free). I had a similar choice to make and have a hybrid solution in place. Kernel mode hooks are available for certain process related events in Windows (creation and destruction), but they not only aren't exposed at user mode, but also aren't helpful at monitoring other process metrics. I don't think any OS is going to natively inform you of any change to any process metric. The overhead for that many different hooks might be much greater than simple polling.
Lastly, considering the HIGH frequency of process changes, it may be better to handle all changes at once (polling at interval) vs. notification events/hooks, which may have to be processed many more times per second.
You are RIGHT to stay away from scripts. Why? Because they are slow(er). Of course, the linux scheduler does a fairly good job at handling CPU bound threads by downgrading their priority and rewarding (upgrading) the priority of I/O bound threads -- so even in high loads a script should be responsive I guess.

There's another point of attack you might consider: replace the system's dynamic linker with a modified one which applies your logic. (See this paper for some nice examples of what's possible from the largely neglected art of linker hacking).
Where this approach will have problems is with purely statically linked binaries. I doubt there's much on a modern system which actually doesn't link something dynamically (things like busybox-static being the obvious exceptions, although you might regard the ability to get a minimal shell outside of your controls as a feature when it all goes horribly wrong), so this may not be a big deal. On the other hand, if the priority policies are intended to bring some order to an overloaded shared multi-user system then you might see smart users preparing static-linked versions of apps to avoid linker-imposed priorities.

Sure, just iterate through /proc/nnn/exe to get the pathname of the running image. Only use the ones with slashes, the others are kernel procs.
Check to see if you have already processed that one, otherwise look up the new priority in your configuration file and use renice(8) to tweak its priority.

If you want to do it as a kernel module then you could look into making your own binary loader. See the following kernel source files for examples:
$KERNEL_SOURCE/fs/binfmt_elf.c
$KERNEL_SOURCE/fs/binfmt_misc.c
$KERNEL_SOURCE/fs/binfmt_script.c
They can give you a first idea where to start.
You could just modify the ELF loader to check for an additional section in ELF files and when found use its content for changing scheduling priorities. You then would not even need to manage separate configuration files, but simply add a new section to every ELF executable you want to manage this way and you are done. See objcopy/objdump of the binutils tools for how to add new sections to ELF files.

Does anyone of you have any ideas on how to complete this task?
As an idea, consider using apparmor in complain-mode. That would log certain messages to syslog, which you could listen to.

If the processes in question are started by executing an executable file with a known path, you can use the inotify mechanism to watch for events on that file. Executing it will trigger an I_OPEN and an I_ACCESS event.
Unfortunately, this won't tell you which process caused the event to trigger, but you can then check which /proc/*/exe are a symlink to the executable file in question and renice the process id in question.
E.g. here is a crude implementation in Perl using Linux::Inotify2 (which, on Ubuntu, is provided by the liblinux-inotify2-perl package):
perl -MLinux::Inotify2 -e '
use warnings;
use strict;
my $x = shift(#ARGV);
my $w = new Linux::Inotify2;
$w->watch($x, IN_ACCESS, sub
{
for (glob("/proc/*/exe"))
{
if (-r $_ && readlink($_) eq $x && m#^/proc/(\d+)/#)
{
system(#ARGV, $1)
}
}
});
1 while $w->poll
' /bin/ls renice
You can of course save the Perl code to a file, say onexecuting, prepend a first line #!/usr/bin/env perl, make the file executable, put it on your $PATH, and from then on use onexecuting /bin/ls renice.
Then you can use this utility as a basis for implementing various policies for renicing executables. (or doing other things).

Short script, long module or long script, short module?

I was just wondering what is best for performance when making a web service in Perl.
Is it best to have as short as possible .pl script and put as much code as possible in a module that is used by the .pl script, or doesn't it impact performance if i don't use a module at all?
I am using mod_perl on a CentOS Linux box with perl 5.8.8.

Since you're using mod_perl, there is no appreciable performance penalty in spreading your code across as many files as you like. Using modules, small, testable functions, and an object-oriented structure for your code will make it much more maintainable, reusable, and extensible. And you don't need a .pl script at all, since you're using mod_perl. Just do something like:
httpd.conf
PerlModule My::WebApp
<Location /app>
SetHandler perl-script
PerlHandler My::WebApp
</Location>
My/WebApp.pm
package My::WebApp;
use strict;
use warnings;
use Apache2::Const -compile => qw(OK);
sub handler {
my $r = shift; # apache request object
# do stuff
return Apache2::Const::OK;
}
Using a web application framework makes it even easier. CGI::Application has excellent mod_perl support built in.

There is a performance benefit to writing modules with mod_perl. You can have Apache load modules at startup. That way they're compiled and ready to go when it forks off a new child. They can also do work at startup and share that work rather than each child having to do it over again. It also has the potential for the compiled code to reside in shared memory reducing memory footprint.
Here's some info about Apache 2.x and Apache 1.x. In Apache 2 your strategy differs some based on what worker model you're using.
But more important is that modules are easier to test, document and reuse.
Caveat: its been a while since I've done performance optimization of mod_perl.

No matter what you are doing, you should separate how people run your program from all of its functionality. Separate your code into modules where they are not tightly coupled to anything else that you are doing. The Apache specific portions of your application should only be a thin layer that handle the connection between the request and the rest of your code.

You better think in terms of maintainability. Go for modules. When you are using mod-perl, there is no need to worry about peformance issues this may cause.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string