Why are static variables in an xsub not thread safe? - multithreading

According to perldoc threads:
Since Perl 5.8, thread programming has been available using a model
called interpreter threads which provides a new Perl interpreter for
each thread, and, by default, results in no data or state information
being shared between threads.
What type of data or state information is referred to in the above quote? According to perldoc perlxs:
Starting with Perl 5.8, a macro framework has been defined to allow
static data to be safely stored in XS modules that will be accessed
from a multi-threaded Perl.
So it seems to me static variables are shared between threads? But Perl variables are not shared?
(I am trying to figure out exactly what kind of data is thread safe, and how to create a thread safe module)

Each thread has its own interpreter. This struct[1] stores everything that makes up perl, including parser state, regex engine state, symbol table, and all "SV" (which includes scalars, arrays, hashes, code, etc). Creating a new thread from within Perl copies makes a copy of the current interpreter.
XS code may safely use the Perl API because each function has a parameter that specifies the interpreter to use. This is often invisible to the code thanks to macros, but you might have noticed references to "THX" or "Perl context". Just don't pass an SV that belongs to one interpreter to another. (You might have heard of the "Free to wrong pool" error message that can result from this.)
But Perl can't offer any protection to things outside of its knowledge or control, such as the static storage of the external libraries it loads. No copies of those are made. Two threads could call the same C function at the same time, so precautions need to be taken as if you were writing a multi-threaded C program.
That macro framework to which your quote alludes gives access to per-interpreter storage. It also allows the library to specify a function to call on creation of new Perl threads to clone the variables into to the new interpreter.
If Perl is built without -Dusemultiplicity, the Perl interpreter consists of a bajillion global (static) variables instead. MULTIPLICITY moves them into a struct and adds a context parameter to Perl API calls. This has a performance penalty, but it allows a process to have multiple Perl interpreters. Because threaded builds of Perl requires this, building a threaded perl (-Dusethreads) presumes -Dusemultiplicity.

Related

Groovy script: How to safeguard against possible memory leak or infinite loops

We intend to allow our clients to execute their own groovy scripts within our platform. They will be allowed to access only controlled methods but we have one concern.
Although we will take all possible care somewhere we might get into a risk of long running loops - resulting in memory leak or infinite loop which can affect our platform.
Is there any inherent way within groovy script to protect against such probability?
You can prevent loops and all dangerous syntax constructs with a SecureASTCustomizer.
Nice article here.
If you can restrict your DSL (the code your client can run) so that it can only invoke your methods on your object(s), it's up to you to guarantee memory cannot leak or be abused in any way.
Build your GroovyShell by passing it a custom CompilerConfiguration
Build your custom CompilerConfiguration with a custom SecureASTCustomizer using method addCompilationCustomizers(...)
You probably want to get all function calls in your DSL to go through a custom class of yours. For that, have your shell parse the script, cast the script as a DelegatingScript and pass it your object:
DemoDSLHelper delegate = new DemoDSLHelper(); // Your custom class with custom methods you want to expose in the DSL
GroovyShell shell = new GroovyShell(createCompilerConfiguration());
Script script = shell.parse(scriptText);
((DelegatingScript)script).setDelegate(delegate);
Object result = script.run();
Run that in a separate JVM, so that you can enforce limits on the process with OS-dependent primitives for that (containerization, etc), or simply kill it if it fails to respect your rules (duration, resources consumption, etc). If it must run within a JVM that does other things, at the very least you'll need to be extremely restrictive in what you accept in the DSL.
And here a gist with complete runnable example.

How to make a mod_perl interpreter sticky by some conventions?

As it seems that mod_perl only manages Perl interpreters per VHOST, is
there any way I can influence which cloned interpreter mod_perl
selects to process a request? I've read through the configurable
scopes and had a look into "modperl_interp_select" in the source and I
could see that if a request already has a interpreter associated, that
one is selected by mod_perl.
else if (r) {
if (is_subrequest && (scope == MP_INTERP_SCOPE_REQUEST)) {
[...]
}
else {
p = r->pool;
get_interp(p);
}
I would like to add some kind of handler before mod_perl selects an
interpreter to process a request and then select an interpreter to
assign it to the request myself, based on different criteria included in
the request.
But I'm having trouble to understand if such a handler can exist at
all or if everything regarding a request is already processed by a
selected interpreter of mod_perl.
Additionally, I can see APR::Pool-API, but it doesn't seem to provide
the capability to set some user data on a current pool object, which
is what mod_perl reads by "get_interp".
Could anyone help me on that? Thanks!
A bit on the background: I have a dir structure in cgi-bin like the following:
cgi-bin
software1
customer1
*.cgi
*.pm
customer2
*.cgi
*.pm
software2
customer1
*.cgi
*.pm
customer2
*.cgi
*.pm
Each customer uses a private copy of the software and the software is using itself, e.g. software1 of customer1 may talk to software2 of customer1 by loading some special client libs of software2 into it's own Perl interpreter. To get things more complicated, software2 may even bring general/common parts of software1 with it's own private installation by using svn:external. So I have a lot of the same software with the same Perl packages in one VHOST and I can't guarantee that all of those private installation always have the same version level.
It's quite a mixup, but which is known to work under the rules we have within the same Perl interpreter.
But now comes mod_perl, clones interpreters as needed and reuses them for requests into whichever sub dir of cgi-bin it likes and in this case things will break, because suddenly the interpreter already processed software1 of customer1 and should now process software2 of customer2, which uses common packages of software1, which already where loaded by the Perl interpreter before and are used because of %INC instead of the private packages of software2 and such...
Yes, there are different ways to deal with that, like VHOSTs and sub domains persoftware or customer or whatever, but I would like to check different ways of keeping one VHOST and the current directory structure, just by using what mod_perl or Apache httpd provides. And one way would be if I could tell mod_perl to always use the same Perl interpreter for requests to the same directory. This way mod_perl would create it's pool of interpreters and I would be responsible to select each of them per directory.
What I've learned so far is that it's not easily possible to influence mod_perl's decision about a selected interpreter. If one wants to, it seems that one would need to really patch mod_perl on C level or provide an own C-handler for httpd as a hook to run before mod_perl. In the end mod_perl is only a combination of handlers for httpd itself, so placing one in front of it doing some special things is possible. It's another question if it's wise to do so, because one would have to deal with some mod_perl internals like the fact there's no interpreter available currently and in the end in my case I would need a map somewhere for interpreters and their associated directory to process.
In the end it's not that easy and I don't want to patch mod_perl or start on low level C for a httpd handler/hook.
For documentation purposes I want to mention two possible workarounds which came into my mind:
Pool of Perl threads
The problem with the current mod_perl approach in my case is that it clones Perl interpreters low level in C and those are run by threads provided by a pool of httpd, every thread can run any interpreter any given time, unless it's not already in use by another thread. With this approach it seems impossible to access the interpreters within Perl itself without using any low level XS as well, especially it's not possible to manage the interpreters and threads with Perl's Threads API, simply because it aren't no Perl threads, it's Perl interpreters executed by httpd threads. In the end both behave the same, though, because during creation of Perl threads the current interpreter is cloned as well and associated OS threads are created and such. But while using Perl's threads you have more influence about the shared data and such.
So a workaround for my current problem could be to not let mod_perl and it's interpreters process a request, but instead create an own thread pool of Perl threads directly while starting up in a VHOST using PerlModule or such. Those threads can be directly managed entirely within Perl, one could create some queues to dispatch work in form of absolute paths to requested CGI applications and such. Besides the thread pool itself a handler would be needed which would get called instead of e.g. ModPerl::Registry to function as a dispatcher: It would need to decide based on some criteria which thread to use and put the requested path into it's queue and the thread itself could ultimately e.g. just create new instances of ModPerl::Registry to process the given file. Of course there would be some glue needed here and there...
There are some downsides of this approach of course: It sounds like a fair amount of work, doubles some of the functionality already implemented by mod_perl especially regarding pool maintenance and doubles the amount of threads and memory used, because the mod_perl interpreters and threads would only be used to execute the dispatcher handler and additionally one would have threads and interpreters to process the requests within the Perl threads. The amount of threads shouldn't be a huge problem at all, that one of mod_perl would just sleep and wait for the Perl thread to finish it's work.
#INC-Hook with source code changes
Another, and I guess easier, approach would be to make use of #INC hooks for Perl's require in combination again with an own mod_perl handler extending ModPerl::Registry. The key point is that the handler is the first place where the requested file is read and it's source can be changed before compiling it. Every
use XY::Z;
XY::Z->new(...);
could be changed to
use SomePrefix::XY::Z;
SomePrefix::XY::Z->new(...);
where SomePrefix would simply be the full path of the parent directory of the requested file changed to be a valid Perl package name. ModPerl::Registry already does something similar while transforming a requested CGI script automatically into a mod_perl handler, so this works in general and ModPerl::Registry already provides some logic to generate the package name and such. The change leads to that Perl won't find the packages anymore automatically, simply because they don't exist with the new name in a place known to Perl, that's where the #INC hook applies.
The hook is responsible to recognize such changed packages, simply because of the name of SomePrefix or a marker prefix in front of SomePrefix or whatever, and map those to a file in the file system to provide a handle to the requested file which Perl can load during "require". Additionally, the hook will provide a callback which gets called by Perl for each line of the file read and will function as a source code filter, again changing each "package", "use" or "require" statement to have SomePrefix in front of. This will result again in the hook being responsible for providing file handles to those packages etc.
The key point here is changing the source code during runtime once: Instead of "XY/Z.pm" which Perl would require normally, would be available n times in my directory structure and would be saved as "XY/Z.pm" in %INC, one lets Perl require "SomePrefix/XY/Z.pm", that would be stored in %INC and is unique for every Perl interpreter used by mod_perl because SomePrefix reflects the unique installation directory of a requested file. There's no room anymore for Perl to think it already had loaded XY::Z, just because it processed a request from another directory before.
Of course this only works for easy "use ...;" statements, things like "eval("require $package");" will make things a bit more complicated.
Comments welcome... :-)
"PerlOptions" is a DIR scoped variable, not limited to VirtualHosts, so new interpreter pools can be created for any location. These directive can even be placed in .htaccess files for ease of configuration, but something like this in your httpd.conf should give you the desired effect:
<Location /cgi-bin/software1/customer1>
PerlOptions +Parent
</Location>
<Location /cgi-bin/software1/customer2>
PerlOptions +Parent
</Location>
<Location /cgi-bin/software2/customer1>
PerlOptions +Parent
</Location>
<Location /cgi-bin/software2/customer2>
PerlOptions +Parent
</Location>

One variable shared across all forked instances?

I have a Perl script that forks itself repeatedly. I wish to gather statistics about each forked instance: whether it passed or failed and how many instances there were in total. For this task, is there a way to create a variable that is shared across all instances?
My perl version is v5.8.8.
You should use IPC in some shape or form, most typically a shared memory segment with a semaphore guarding access to it. Alternatively, you could use some kind of hybrid memory/disk database where access API would handle concurrent access for you but this might be an overkill. Finally, you could use a file with record locking.
IPC::Shareable does what you literally ask for. Each process will have to take care to lock and unlock a shared hash (for example), but the data will appear to be shared across processes.
However, ordinary UNIX facilities provide easier ways (IMHO) of collecting worker status and count. Have every process write ($| = 1) "ok\n" or "not ok\n" when it END{}s, for example, and make sure that they are writing to a FIFO as comparatively short writes will not be interleaved. Then capture that output (e.g., ./my-script.pl | tee /tmp/my.log) and you're done. Another approach would have them record their status in simple files — open(my $status, '>', "./status.$$") — in a directory specially prepared for this.

Is calling a lua function(as a callback) from another thread safe enough?

Actually I am using visual C++ to try to bind lua functions as callbacks for socket events(in another thread). I initialize the lua stuff in one thread and the socket is in another thread, so every time the socket sends/receives a message, it will call the lua function and the lua function determines what it should do according to the 'tag' within the message.
So my questions are:
Since I pass the same Lua state to lua functions, is that safe? Doesn't it need some kinda protection? The lua functions are called from another thead so I guess they might be called simultaneously.
If it is not safe, what's the solution for this case?
It is not safe to call back asynchronously into a Lua state.
There are many approaches to dealing with this. The most popular involve some kind of polling.
A recent generic synchronization library is DarkSideSync
A popular Lua binding to libev is lua-ev
This SO answer recommends Lua Lanes with LuaSocket.
It is not safe to call function within one Lua state simultaneously in multiple threads.
I was dealing with the same problem, since in my application all basics such as communication are handled by C++ and all the business logic is implemented in Lua. What I do is create a pool of Lua states that are all created and initialised on an incremental basis (once there's not enough states, create one and initialise with common functions / objects). It works like this:
Once a connection thread needs to call a Lua function, it checks out an instance of Lua state, initialises specific globals (I call it a thread / connection context) in a separate (proxy) global table that prevents polluting the original global, but is indexed by the original global
Call a Lua function
Check the Lua state back in to the pool, where it is restored to the "ready" state (dispose of the proxy global table)
I think this approach would be well suited for your case as well. The pool checks each state (on an interval basis) when it was last checked out. When the time difference is big enough, it destroys the state to preserve resources and adjust the number of active states to current server load. The state that is checked out is the most recently used among the available states.
There are some things you need to consider when implementing such a pool:
Each state needs to be populated with the same variables and global functions, which increases memory consumption.
Implementing an upper limit for state count in the pool
Ensuring all the globals in each state are in a consistent state, if they happen to change (here I would recommend prepopulating only static globals, while populating dynamic ones when checking out a state)
Dynamic loading of functions. In my case there are many thousands of functions / procedures that can be called in Lua. Having them constantly loaded in all states would be a huge waste. So instead I keep them byte code compiled on the C++ side and have them loaded when needed. It turns out not to impact performance that much in my case, but your mileage may vary. One thing to keep in mind is to load them only once. Say you invoke a script that needs to call another dynamically loaded function in a loop. Then you should load the function as a local once before the loop. Doing it otherwise would be a huge performance hit.
Of course this is just one idea, but one that turned out to be best suited for me.
It's not safe, as the others mentioned
Depends on your usecase
Simplest solution is using a global lock using the lua_lock and lua_unlock macros. That would use a single Lua state, locked by a single mutex. For a low number of callbacks it might suffice, but for higher traffic it probably won't due to the overhead incurred.
Once you need better performance, the Lua state pool as mentioned by W.B. is a nice way to handle this. Trickiest part here I find synchronizing the global data across the multiple states.
DarkSideSync, mentioned by Doug, is useful in cases where the main application loop resides on the Lua side. I specifically wrote it for that purpose. In your case this doesn't seem a fit. Having said that; depending on your needs, you might consider changing your application so the main loop does reside on the Lua side. If you only handle sockets, then you can use LuaSocket and no synchronization is required at all. But obviously that depends on what else the application does.

Designing a perl script with multithreading and data sharing between threads

I'm writing a perl script to run some kind of a pipeline. I start by reading a JSON file with a bunch of parameters in it. I then do some work - mainly building some data structures needed later and calling external programs that generate some output files I keep references to.
I usually use a subroutine for each of these steps. Each such subroutine will usually write some data to a unique place that no other subroutine writes to (i.e. a specific key in a hash) and reads data that other subroutines may have generated.
These steps can take a good couple of minutes if done sequentially, but most of them can be run in parallel with some simple logic of dependencies that I know how to handle (using threads and a queue). So I wonder how I should implement this to allow sharing data between the threads. What would you suggest the framework to be? Perhaps use an object (of which I will have only one instance) and keep all the shared data in $self? Perhaps
a simple script (no objects) with some "global" shared variables? ...
I would obviously prefer a simple, neat solution.
Read threads::shared. By default, as perhaps you know, perl variables are not shared. But you place the shared attribute on them, and they are.
my %repository: shared;
Then if you want to synchronize access to them, the easiest way is to
{ lock( %repository );
$repository{JSON_dump} = $json_dump;
}
# %respository will be unlocked at the end of scope.
However you could use Thread::Queue, which are supposed to be muss-free, and do this as well:
$repo_queue->enqueue( JSON_dump => $json_dump );
Then your consumer thread could just:
my ( $key, $value ) = $repo_queue->dequeue( 2 );
$repository{ $key } = $value;
You can certainly do that in Perl, I suggest you look at perldoc threads and perldoc threads::shared, as these manual pages best describe the methods and pitfalls encountered when using threads in Perl.
What I would really suggest you use, provided you can, is instead a queue management system such as Gearman, which has various interfaces to it including a Perl module. This allows you to create as many "workers" as you want (the subs actually doing the work) and create one simple "client" which would schedule the appropriate tasks and then collate the results, without needing to use tricks as using hashref keys specific to the task or things like that.
This approach would also scale better, and you'd be able to have clients and workers (even managers) on different machines, should you choose so.
Other queue systems, such as TheSchwartz, would not be indicated as they lack the feedback/result that Gearman provides. To all effects, using Gearman this way is pretty much as the threaded system you described, just without the hassles and headaches that any system based on threads may eventually suffer from: having to lock variables, using semaphores, joining threads.

Resources