Short script, long module or long script, short module? - linux

I was just wondering what is best for performance when making a web service in Perl.
Is it best to have as short as possible .pl script and put as much code as possible in a module that is used by the .pl script, or doesn't it impact performance if i don't use a module at all?
I am using mod_perl on a CentOS Linux box with perl 5.8.8.

Since you're using mod_perl, there is no appreciable performance penalty in spreading your code across as many files as you like. Using modules, small, testable functions, and an object-oriented structure for your code will make it much more maintainable, reusable, and extensible. And you don't need a .pl script at all, since you're using mod_perl. Just do something like:
httpd.conf
PerlModule My::WebApp
<Location /app>
SetHandler perl-script
PerlHandler My::WebApp
</Location>
My/WebApp.pm
package My::WebApp;
use strict;
use warnings;
use Apache2::Const -compile => qw(OK);
sub handler {
my $r = shift; # apache request object
# do stuff
return Apache2::Const::OK;
}
Using a web application framework makes it even easier. CGI::Application has excellent mod_perl support built in.

There is a performance benefit to writing modules with mod_perl. You can have Apache load modules at startup. That way they're compiled and ready to go when it forks off a new child. They can also do work at startup and share that work rather than each child having to do it over again. It also has the potential for the compiled code to reside in shared memory reducing memory footprint.
Here's some info about Apache 2.x and Apache 1.x. In Apache 2 your strategy differs some based on what worker model you're using.
But more important is that modules are easier to test, document and reuse.
Caveat: its been a while since I've done performance optimization of mod_perl.

No matter what you are doing, you should separate how people run your program from all of its functionality. Separate your code into modules where they are not tightly coupled to anything else that you are doing. The Apache specific portions of your application should only be a thin layer that handle the connection between the request and the rest of your code.

You better think in terms of maintainability. Go for modules. When you are using mod-perl, there is no need to worry about peformance issues this may cause.

Related

How to Synchronize object between multiple instance of Node Js application

Is there any to lock any object in Node JS application.
Is there are multiple instance for application is available some function shouldnt run concurrent. If instance A function is completed, it should unlock that object/key or some identifier and B instance of application should check if its unlock it should run some function.
Any Object or Key can be used for identifying the locking and unlocking the function.
How to do that in NodeJS application which have multiple instances.
As mentioned above Redis may be your answer, however, it really depends on the resources available to you. There are some other possibilities less complicated and certainly less powerful which may also do the trick.
node-cache may also do the trick, if you set it up correctly. It is not any where near as powerful as Redis, but on the bright side it does not require as much setup and interaction with your environment.
So there is Redis and node-cache for memory locks. I should mention there are quite a few NPM packages which do the cache. Depends on what you need, and how intricate your cache needs to be.
However, there are less elegant ways to do what you want, though less elegant is not necessarily worse.
You could use a JSON file based system and hold locks on the files for a TTL. lockfile or proper-lockfile will accomplish the task. You can read the information from the files when needed, delete when required, give them a TTL. Basically a cache system to disk.
The memory system is obviously faster. The file system requires just as much planning in your code as the memory system.
There is yet another way. This is possibly the most dangerous one, and you would have to think long and hard on the consequences in terms of security and need.
Node.js has its own process.env. As most know this holds the system global variables available to all by simply writing process.env.foo where foo would have been declared as a global system variable. A package such as .dotenv allows you to add to your system variables by way of a .env text file. Thus if you put in that file sam=mongoDB, then in your code where you write process.env.sam it will be interpreted as mongoDB. Tons of system wide variables can be set up here.
So what good does that do, you may ask? Well these are system wide variables, and they can be changed in mid-flight. So if you need to lock the variables and then change them it is a simple manner to do it with. Beware though of the gotcha here. Once the system goes down, or all processes stop, and is started again, your environment variables will return to the default in the .env file.
Additionally, unless you are running a system which is somewhat safe on AWS or Azure etc. I would not feel secure in having my .env file open to the world. There is a way around this one too. You can use a hash to encrypt all variables and put the hash in the file. When you call it, decrypt before actually requesting use of the full variable.
There are probably many wore ways to lock and unlock, not the least of which is to use the native Node.js structure. Combine File System events together with Crypto. But this demands a much deeper level of understanding of the actual Node.js library and structures.
Hope some of this helped.
I strongly recommend Redis in your case.
There are several ways to create a application/process shared object, using locks is one of them, as you mentioned.
But they're just complicated. Unless you really need to do that yourself, Redis will be good enough. Atomic ops cross multiple process, transaction and so on.
Old thread but I didn't want to use redis so I made my own open source solution which utilizes websocket connections:
https://github.com/OneAndonlyFinbar/sync-cache

Synchronous read or write on Nodejs?

Why and when should we prefer read/writeFileSync to the asynchronous ones on Nodejs in particular for server applications?
For them are blocking functions, you should ever prefer the async versions in production environments.
Anyway, it could make sense to use them during the bootstrap of your application, as an example if you use to load configurations from files and you don't want either to let your fully initialized components to interact with partially initialized ones or design them to be able to work with each other in such a case.
I cannot see any other meaningful use for them.
I prefer to use writeFileSync or readFileSync if it does not make any sense for the program to continue in the event of a failure and there is no other asynchronous work to do. Typically this is at the beginning/end of a program, but I have also used them while "checkpointing" a long running process to make it clear that nothing is changing while the checkpoint is being written to disk.
Of course you can implement this with the asynchronous versions too, it is just more verbose and prone to mistakes.

How to make a mod_perl interpreter sticky by some conventions?

As it seems that mod_perl only manages Perl interpreters per VHOST, is
there any way I can influence which cloned interpreter mod_perl
selects to process a request? I've read through the configurable
scopes and had a look into "modperl_interp_select" in the source and I
could see that if a request already has a interpreter associated, that
one is selected by mod_perl.
else if (r) {
if (is_subrequest && (scope == MP_INTERP_SCOPE_REQUEST)) {
[...]
}
else {
p = r->pool;
get_interp(p);
}
I would like to add some kind of handler before mod_perl selects an
interpreter to process a request and then select an interpreter to
assign it to the request myself, based on different criteria included in
the request.
But I'm having trouble to understand if such a handler can exist at
all or if everything regarding a request is already processed by a
selected interpreter of mod_perl.
Additionally, I can see APR::Pool-API, but it doesn't seem to provide
the capability to set some user data on a current pool object, which
is what mod_perl reads by "get_interp".
Could anyone help me on that? Thanks!
A bit on the background: I have a dir structure in cgi-bin like the following:
cgi-bin
software1
customer1
*.cgi
*.pm
customer2
*.cgi
*.pm
software2
customer1
*.cgi
*.pm
customer2
*.cgi
*.pm
Each customer uses a private copy of the software and the software is using itself, e.g. software1 of customer1 may talk to software2 of customer1 by loading some special client libs of software2 into it's own Perl interpreter. To get things more complicated, software2 may even bring general/common parts of software1 with it's own private installation by using svn:external. So I have a lot of the same software with the same Perl packages in one VHOST and I can't guarantee that all of those private installation always have the same version level.
It's quite a mixup, but which is known to work under the rules we have within the same Perl interpreter.
But now comes mod_perl, clones interpreters as needed and reuses them for requests into whichever sub dir of cgi-bin it likes and in this case things will break, because suddenly the interpreter already processed software1 of customer1 and should now process software2 of customer2, which uses common packages of software1, which already where loaded by the Perl interpreter before and are used because of %INC instead of the private packages of software2 and such...
Yes, there are different ways to deal with that, like VHOSTs and sub domains persoftware or customer or whatever, but I would like to check different ways of keeping one VHOST and the current directory structure, just by using what mod_perl or Apache httpd provides. And one way would be if I could tell mod_perl to always use the same Perl interpreter for requests to the same directory. This way mod_perl would create it's pool of interpreters and I would be responsible to select each of them per directory.
What I've learned so far is that it's not easily possible to influence mod_perl's decision about a selected interpreter. If one wants to, it seems that one would need to really patch mod_perl on C level or provide an own C-handler for httpd as a hook to run before mod_perl. In the end mod_perl is only a combination of handlers for httpd itself, so placing one in front of it doing some special things is possible. It's another question if it's wise to do so, because one would have to deal with some mod_perl internals like the fact there's no interpreter available currently and in the end in my case I would need a map somewhere for interpreters and their associated directory to process.
In the end it's not that easy and I don't want to patch mod_perl or start on low level C for a httpd handler/hook.
For documentation purposes I want to mention two possible workarounds which came into my mind:
Pool of Perl threads
The problem with the current mod_perl approach in my case is that it clones Perl interpreters low level in C and those are run by threads provided by a pool of httpd, every thread can run any interpreter any given time, unless it's not already in use by another thread. With this approach it seems impossible to access the interpreters within Perl itself without using any low level XS as well, especially it's not possible to manage the interpreters and threads with Perl's Threads API, simply because it aren't no Perl threads, it's Perl interpreters executed by httpd threads. In the end both behave the same, though, because during creation of Perl threads the current interpreter is cloned as well and associated OS threads are created and such. But while using Perl's threads you have more influence about the shared data and such.
So a workaround for my current problem could be to not let mod_perl and it's interpreters process a request, but instead create an own thread pool of Perl threads directly while starting up in a VHOST using PerlModule or such. Those threads can be directly managed entirely within Perl, one could create some queues to dispatch work in form of absolute paths to requested CGI applications and such. Besides the thread pool itself a handler would be needed which would get called instead of e.g. ModPerl::Registry to function as a dispatcher: It would need to decide based on some criteria which thread to use and put the requested path into it's queue and the thread itself could ultimately e.g. just create new instances of ModPerl::Registry to process the given file. Of course there would be some glue needed here and there...
There are some downsides of this approach of course: It sounds like a fair amount of work, doubles some of the functionality already implemented by mod_perl especially regarding pool maintenance and doubles the amount of threads and memory used, because the mod_perl interpreters and threads would only be used to execute the dispatcher handler and additionally one would have threads and interpreters to process the requests within the Perl threads. The amount of threads shouldn't be a huge problem at all, that one of mod_perl would just sleep and wait for the Perl thread to finish it's work.
#INC-Hook with source code changes
Another, and I guess easier, approach would be to make use of #INC hooks for Perl's require in combination again with an own mod_perl handler extending ModPerl::Registry. The key point is that the handler is the first place where the requested file is read and it's source can be changed before compiling it. Every
use XY::Z;
XY::Z->new(...);
could be changed to
use SomePrefix::XY::Z;
SomePrefix::XY::Z->new(...);
where SomePrefix would simply be the full path of the parent directory of the requested file changed to be a valid Perl package name. ModPerl::Registry already does something similar while transforming a requested CGI script automatically into a mod_perl handler, so this works in general and ModPerl::Registry already provides some logic to generate the package name and such. The change leads to that Perl won't find the packages anymore automatically, simply because they don't exist with the new name in a place known to Perl, that's where the #INC hook applies.
The hook is responsible to recognize such changed packages, simply because of the name of SomePrefix or a marker prefix in front of SomePrefix or whatever, and map those to a file in the file system to provide a handle to the requested file which Perl can load during "require". Additionally, the hook will provide a callback which gets called by Perl for each line of the file read and will function as a source code filter, again changing each "package", "use" or "require" statement to have SomePrefix in front of. This will result again in the hook being responsible for providing file handles to those packages etc.
The key point here is changing the source code during runtime once: Instead of "XY/Z.pm" which Perl would require normally, would be available n times in my directory structure and would be saved as "XY/Z.pm" in %INC, one lets Perl require "SomePrefix/XY/Z.pm", that would be stored in %INC and is unique for every Perl interpreter used by mod_perl because SomePrefix reflects the unique installation directory of a requested file. There's no room anymore for Perl to think it already had loaded XY::Z, just because it processed a request from another directory before.
Of course this only works for easy "use ...;" statements, things like "eval("require $package");" will make things a bit more complicated.
Comments welcome... :-)
"PerlOptions" is a DIR scoped variable, not limited to VirtualHosts, so new interpreter pools can be created for any location. These directive can even be placed in .htaccess files for ease of configuration, but something like this in your httpd.conf should give you the desired effect:
<Location /cgi-bin/software1/customer1>
PerlOptions +Parent
</Location>
<Location /cgi-bin/software1/customer2>
PerlOptions +Parent
</Location>
<Location /cgi-bin/software2/customer1>
PerlOptions +Parent
</Location>
<Location /cgi-bin/software2/customer2>
PerlOptions +Parent
</Location>

Automatically adjusting process priorities under Linux

I'm trying to write a program that automatically sets process priorities based on a configuration file (basically path - priority pairs).
I thought the best solution would be a kernel module that replaces the execve() system call. Too bad, the system call table isn't exported in kernel versions > 2.6.0, so it's not possible to replace system calls without really ugly hacks.
I do not want to do the following:
-Replace binaries with shell scripts, that start and renice the binaries.
-Patch/recompile my stock Ubuntu kernel
-Do ugly hacks like reading kernel executable memory and guessing the syscall table location
-Polling of running processes
I really want to be:
-Able to control the priority of any process based on it's executable path, and a configuration file. Rules apply to any user.
Does anyone of you have any ideas on how to complete this task?
If you've settled for a polling solution, most of the features you want to implement already exist in the Automatic Nice Daemon. You can configure nice levels for processes based on process name, user and group. It's even possible to adjust process priorities dynamically based on how much CPU time it has used so far.
Sometimes polling is a necessity, and even more optimal in the end -- believe it or not. It depends on a lot of variables.
If the polling overhead is low-enough, it far exceeds the added complexity, cost, and RISK of developing your own style kernel hooks to get notified of the changes you need. That said, when hooks or notification events are available, or can be easily injected, they should certainly be used if the situation calls.
This is classic programmer 'perfection' thinking. As engineers, we strive for perfection. This is the real world though and sometimes compromises must be made. Ironically, the more perfect solution may be the less efficient one in some cases.
I develop a similar 'process and process priority optimization automation' tool for Windows called Process Lasso (not an advertisement, its free). I had a similar choice to make and have a hybrid solution in place. Kernel mode hooks are available for certain process related events in Windows (creation and destruction), but they not only aren't exposed at user mode, but also aren't helpful at monitoring other process metrics. I don't think any OS is going to natively inform you of any change to any process metric. The overhead for that many different hooks might be much greater than simple polling.
Lastly, considering the HIGH frequency of process changes, it may be better to handle all changes at once (polling at interval) vs. notification events/hooks, which may have to be processed many more times per second.
You are RIGHT to stay away from scripts. Why? Because they are slow(er). Of course, the linux scheduler does a fairly good job at handling CPU bound threads by downgrading their priority and rewarding (upgrading) the priority of I/O bound threads -- so even in high loads a script should be responsive I guess.
There's another point of attack you might consider: replace the system's dynamic linker with a modified one which applies your logic. (See this paper for some nice examples of what's possible from the largely neglected art of linker hacking).
Where this approach will have problems is with purely statically linked binaries. I doubt there's much on a modern system which actually doesn't link something dynamically (things like busybox-static being the obvious exceptions, although you might regard the ability to get a minimal shell outside of your controls as a feature when it all goes horribly wrong), so this may not be a big deal. On the other hand, if the priority policies are intended to bring some order to an overloaded shared multi-user system then you might see smart users preparing static-linked versions of apps to avoid linker-imposed priorities.
Sure, just iterate through /proc/nnn/exe to get the pathname of the running image. Only use the ones with slashes, the others are kernel procs.
Check to see if you have already processed that one, otherwise look up the new priority in your configuration file and use renice(8) to tweak its priority.
If you want to do it as a kernel module then you could look into making your own binary loader. See the following kernel source files for examples:
$KERNEL_SOURCE/fs/binfmt_elf.c
$KERNEL_SOURCE/fs/binfmt_misc.c
$KERNEL_SOURCE/fs/binfmt_script.c
They can give you a first idea where to start.
You could just modify the ELF loader to check for an additional section in ELF files and when found use its content for changing scheduling priorities. You then would not even need to manage separate configuration files, but simply add a new section to every ELF executable you want to manage this way and you are done. See objcopy/objdump of the binutils tools for how to add new sections to ELF files.
Does anyone of you have any ideas on how to complete this task?
As an idea, consider using apparmor in complain-mode. That would log certain messages to syslog, which you could listen to.
If the processes in question are started by executing an executable file with a known path, you can use the inotify mechanism to watch for events on that file. Executing it will trigger an I_OPEN and an I_ACCESS event.
Unfortunately, this won't tell you which process caused the event to trigger, but you can then check which /proc/*/exe are a symlink to the executable file in question and renice the process id in question.
E.g. here is a crude implementation in Perl using Linux::Inotify2 (which, on Ubuntu, is provided by the liblinux-inotify2-perl package):
perl -MLinux::Inotify2 -e '
use warnings;
use strict;
my $x = shift(#ARGV);
my $w = new Linux::Inotify2;
$w->watch($x, IN_ACCESS, sub
{
for (glob("/proc/*/exe"))
{
if (-r $_ && readlink($_) eq $x && m#^/proc/(\d+)/#)
{
system(#ARGV, $1)
}
}
});
1 while $w->poll
' /bin/ls renice
You can of course save the Perl code to a file, say onexecuting, prepend a first line #!/usr/bin/env perl, make the file executable, put it on your $PATH, and from then on use onexecuting /bin/ls renice.
Then you can use this utility as a basis for implementing various policies for renicing executables. (or doing other things).

Designing a perl script with multithreading and data sharing between threads

I'm writing a perl script to run some kind of a pipeline. I start by reading a JSON file with a bunch of parameters in it. I then do some work - mainly building some data structures needed later and calling external programs that generate some output files I keep references to.
I usually use a subroutine for each of these steps. Each such subroutine will usually write some data to a unique place that no other subroutine writes to (i.e. a specific key in a hash) and reads data that other subroutines may have generated.
These steps can take a good couple of minutes if done sequentially, but most of them can be run in parallel with some simple logic of dependencies that I know how to handle (using threads and a queue). So I wonder how I should implement this to allow sharing data between the threads. What would you suggest the framework to be? Perhaps use an object (of which I will have only one instance) and keep all the shared data in $self? Perhaps
a simple script (no objects) with some "global" shared variables? ...
I would obviously prefer a simple, neat solution.
Read threads::shared. By default, as perhaps you know, perl variables are not shared. But you place the shared attribute on them, and they are.
my %repository: shared;
Then if you want to synchronize access to them, the easiest way is to
{ lock( %repository );
$repository{JSON_dump} = $json_dump;
}
# %respository will be unlocked at the end of scope.
However you could use Thread::Queue, which are supposed to be muss-free, and do this as well:
$repo_queue->enqueue( JSON_dump => $json_dump );
Then your consumer thread could just:
my ( $key, $value ) = $repo_queue->dequeue( 2 );
$repository{ $key } = $value;
You can certainly do that in Perl, I suggest you look at perldoc threads and perldoc threads::shared, as these manual pages best describe the methods and pitfalls encountered when using threads in Perl.
What I would really suggest you use, provided you can, is instead a queue management system such as Gearman, which has various interfaces to it including a Perl module. This allows you to create as many "workers" as you want (the subs actually doing the work) and create one simple "client" which would schedule the appropriate tasks and then collate the results, without needing to use tricks as using hashref keys specific to the task or things like that.
This approach would also scale better, and you'd be able to have clients and workers (even managers) on different machines, should you choose so.
Other queue systems, such as TheSchwartz, would not be indicated as they lack the feedback/result that Gearman provides. To all effects, using Gearman this way is pretty much as the threaded system you described, just without the hassles and headaches that any system based on threads may eventually suffer from: having to lock variables, using semaphores, joining threads.

Resources