One variable shared across all forked instances? - linux

I have a Perl script that forks itself repeatedly. I wish to gather statistics about each forked instance: whether it passed or failed and how many instances there were in total. For this task, is there a way to create a variable that is shared across all instances?
My perl version is v5.8.8.

You should use IPC in some shape or form, most typically a shared memory segment with a semaphore guarding access to it. Alternatively, you could use some kind of hybrid memory/disk database where access API would handle concurrent access for you but this might be an overkill. Finally, you could use a file with record locking.

IPC::Shareable does what you literally ask for. Each process will have to take care to lock and unlock a shared hash (for example), but the data will appear to be shared across processes.
However, ordinary UNIX facilities provide easier ways (IMHO) of collecting worker status and count. Have every process write ($| = 1) "ok\n" or "not ok\n" when it END{}s, for example, and make sure that they are writing to a FIFO as comparatively short writes will not be interleaved. Then capture that output (e.g., ./my-script.pl | tee /tmp/my.log) and you're done. Another approach would have them record their status in simple files — open(my $status, '>', "./status.$$") — in a directory specially prepared for this.

Related

Perl threads to execute a sybase stored proc parallel

I have written a sybase stored procedure to move data from certain tables[~50] on primary db for given id to archive db. Since it's taking a very long time to archive, I am thinking to execute the same stored procedure in parallel with unique input id for each call.
I manually ran the stored proc twice at same time with different input and it seems to work. Now I want to use Perl threads[maximum 4 threads] and each thread execute the same procedure with different input.
Please advise if this is recommended way or any other efficient way to achieve this. If the experts choice is threads, any pointers or examples would be helpful.
What you do in Perl does not really matter here: what matters is what happens on the side of the Sybase server. Assuming each client task creates its own connection to the database, then it's all fine and how the client achieved this makes no diff for the Sybase server. But do not use a model where the different client tasks will try to use the same client-server connection as that will never happen in parallel.
No 'answer' per se, but some questions/comments:
Can you quantify taking a very long time to archive? Assuming your archive process consists of a mix of insert/select and delete operations, do query plans and MDA data show fast, efficient operations? If you're seeing table scans, sort merges, deferred inserts/deletes, etc ... then it may be worth the effort to address said performance issues.
Can you expand on the comment that running two stored proc invocations at the same time seems to work? Again, any sign of performance issues for the individual proc calls? Any sign of contention (eg, blocking) between the two proc calls? If the archival proc isn't designed properly for parallel/concurrent operations (eg, eliminate blocking), then you may not be gaining much by running multiple procs in parallel.
How many engines does your dataserver have, and are you planning on running your archive process during a period of moderate-to-heavy user activity? If the current archive process runs at/near 100% cpu utilization on a single dataserver engine, then spawning 4 copies of the same process could see your archive process tying up 4 dataserver engines with heavy cpu utilization ... and if your dataserver doesn't have many engines ... combined with moderate-to-heavy user activity at the same time ... you could end up invoking the wrath of your DBA(s) and users. Net result is that you may need to make sure your archive process hog the dataserver.
One other item to consider, and this may require input from the DBAs ... if you're replicating out of either database (source or archive), increasing the volume of transactions per a given time period could have a negative effect on replication throughput (ie, an increase in replication latency); if replication latency needs to be kept at a minimum, then you may want to rethink your entire archive process from the point of view of spreading out transactional activity enough so as to not have an effect on replication latency (eg, single-threaded archive process that does a few insert/select/delete operations, sleeps a bit, then does another batch, then sleeps, ...).
It's been my experience that archive processes are not considered high-priority operations (assuming they're run on a regular basis, and before the source db fills up); this in turn means the archive process is usually designed so that it's efficient while at the same time putting a (relatively) light load on the dataserver (think: running as a trickle in the background) ... ymmv ...

MPI I/O, matching processes to files

I have a number of files, say 100 and a number of processors, say 1000. Each proc needs to read parts of some subset of files. For instance, proc 3 needs file04.dat, file05.dat, and file09.dat, while proc 4 needs file04.dat, file07.dat, and file08.dat., etc. Which files are needed by which procs are not known at compile time and cannot be determined from any algorithm, but are easily determined during runtime from an existing metadata file.
I am trying to determine the best way to do this, using MPI I/O. It occurs to me that I could just have all the procs cycle through the files they need, calling MPI_File_open with MPI_COMM_SELF as the communicator argument. However, I'm a beginner with MPI I/O, and I suspect this would create some problems with large numbers of procs or files. Is this the case?
I have also thought that perhaps the thing to do would be to establish a separate communicator for each file, and each processor that needs a particular file would be a member of the file's associated communicator. But here, first, would that be a good idea? And second, I'm not an expert on communicators either and I can't figure out how to set up the communicators in this manner. Any ideas?
And if anyone has a completely different idea that would work better, I would be glad to hear it.

Oracle-like sequence in Linux?

I need to tag parallel calls to my program with a unique number in single common log file (thousands of calls in a day).
For this an Oracle sequence would be perfect (returned number guaranteed uniqueness). I could implement this with a small C program (C for speed, this is the issue here) using system file locking facilities, but does Linux provide such a facility already (/dev/increment_forever would be nice :)), or did somebody out there already make such a utility ?
Edit: forgot to mention that my program is not a persistent process (it's not a server), so 100 calls == 100 instances of my program. Using an FS file to store a counter would be too slow with needed locking mechanism.. that's why something like /dev/increment_forever (alias: system facility) would be perfect..
First: You're seriously overestimating the costs of advisory locking on Linux. Compared to the price you're already paying for a unique instance of your program to start up, using flock to get an exclusive lock before updating a file with a unique identifier is cheap. (Doing atomic rename-based updates -- of a file other than the one the lock is held on, of course -- has some extra cost around filesystem metadata churn and journaling, but for thousands of calls per day this is nothing; one would worry if you needed to generate thousands of identifiers per second).
Second: Your question implies that what you actually need is uniqueness, as opposed to ordering. This puts you in a space where you don't necessarily need coordination or locking at all. Consider the approach taken by type-1 UUIDs (using a very high-precision timestamp, potentially in combination with other information -- consider CPU identifier, as only one process can be on a single CPU at a given time; or PID, as only one process can have a PID at a given time), or that taken by type-4 UUIDs (using a purely random value). Combine your process's PID and the timestamp at which it started (the latter is column 22 of /proc/self/stat), and you should be set.
This is much slower than a native C implementation using the flock call directly, but should give you an idea of a correct implementation:
retrieve_and_increment() {
local lock_fd curr_value next_value
# using a separate lockfile to allow atomic replacement of content file
exec {lock_fd}<>counter.lock
flock -x "$lock_fd" || {
exec {lock_fd}<&-
return 1
}
next_value=$(( $(<counter) + 1 ))
printf '%s\n' "$next_value" >counter.next && mv counter.next counter
exec {lock_fd}<&- # close our handle on the lock
# then, when not holding the lock, write result to stdout
# ...that way we decrease the time spent holding the lock if stdout blocks
printf '%s\n' "$next_value"
}
Note that we're spinning up an external command for mv, so flock isn't the only time we're paying fork/exec costs here -- a reason why this would be better implemented within your C program.
For other people reading this who genuinely need thousands of unique sequence values generated per second, I would strongly suggest using a Redis database for this purpose. The INCR command will atomically increment the value associated with a key in O(1) time and return that value. If setting up a TCP connection to a local service is considered too slow/expensive, Redis also supports connections via Unix sockets.
On my not-particularly-beefy laptop:
$ redis-benchmark -t INCR -n 100000 -q
INCR: 95510.98 requests per second
95,000 requests per second is probably quite sufficient. :)

Is it ok to create shared variables inside a thread?

I think this might be a fairly easy question.
I found a lot of examples using threads and shared variables but in no example a shared variable was created inside a thread. I want to make sure I don't do something that seems to work and will break some time in the future.
The reason I need this is I have a shared hash that maps keys to array refs. Those refs are created/filled by one thread and read/modified by another (proper synchronization is assumed). In order to store those array refs I have to make them shared too. Otherwise I get the error Invalid value for shared scalar.
Following is an example:
my %hash :shared;
my $t1 = threads->create(
sub { my #ar :shared = (1,2,3); $hash{foo} = \#ar });
$t1->join;
my $t2 = threads->create(
sub { print Dumper(\%hash) });
$t2->join;
This works as expected: The second thread sees the changes the first made. But does this really hold under all circumstances?
Some clarifications (regarding Ian's answer):
I have one thread A reading from a pipe and waiting for input. If there is any, thread A will write this input in a shared hash (it maps scalars to hashes... those are the hashes that need to be declared shared as well) and continues to listen on the pipe. Another thread B gets notified (via cond_wait/cond_signal) when there is something to do, works on the stuff in the shared hash and deletes the appropriate entries upon completion. Meanwhile A can add new stuff to the hash.
So regarding Ian's question
[...] Hence most people create all their shared variables before starting any sub-threads.
Therefore even if shared variables can be created in a thread, how useful would it be?
The shared hash is a dynamically growing and shrinking data structure that represents scheduled work that hasn't yet been worked on. Therefore it makes no sense to create the complete data structure at the start of the program.
Also the program has to be in (at least) two threads because reading from the pipe blocks of course. Furthermore I don't see any way to make this happen without sharing variables.
The reason for a shared variable is to share. Therefore it is likely that you will wish to have more than one thread access the variable.
If you create your shared variable in a sub-thread, how will you stop other threads accessing it before it has been created? Hence most people create all their shared variables before starting any sub-threads.
Therefore even if shared variables can be created in a thread, how useful would it be?
(PS, I don’t know if there is anything in perl that prevents shared variables being created in a thread.)
PS A good design will lead to very few (if any) shared variables
This task seems like a good fit for the core module Thread::Queue. You would create the queue before starting your threads, push items on with the reader, and pop them off with the processing thread. You can use the blocking dequeue method to have the processing thread wait for input, avoiding the need for signals.
I don't feel good answering my own question but I think the answers so far don't really answer it. If something better comes along, I'd be happy to accept that. Eric's answer helped though.
I now think there is no problem with sharing variables inside threads. The reasoning is: Threads::Queue's enqueue() method shares anthing it enqueues. It does so with shared_clone. Since enqueuing should be good from any thread, sharing should too.

Designing a perl script with multithreading and data sharing between threads

I'm writing a perl script to run some kind of a pipeline. I start by reading a JSON file with a bunch of parameters in it. I then do some work - mainly building some data structures needed later and calling external programs that generate some output files I keep references to.
I usually use a subroutine for each of these steps. Each such subroutine will usually write some data to a unique place that no other subroutine writes to (i.e. a specific key in a hash) and reads data that other subroutines may have generated.
These steps can take a good couple of minutes if done sequentially, but most of them can be run in parallel with some simple logic of dependencies that I know how to handle (using threads and a queue). So I wonder how I should implement this to allow sharing data between the threads. What would you suggest the framework to be? Perhaps use an object (of which I will have only one instance) and keep all the shared data in $self? Perhaps
a simple script (no objects) with some "global" shared variables? ...
I would obviously prefer a simple, neat solution.
Read threads::shared. By default, as perhaps you know, perl variables are not shared. But you place the shared attribute on them, and they are.
my %repository: shared;
Then if you want to synchronize access to them, the easiest way is to
{ lock( %repository );
$repository{JSON_dump} = $json_dump;
}
# %respository will be unlocked at the end of scope.
However you could use Thread::Queue, which are supposed to be muss-free, and do this as well:
$repo_queue->enqueue( JSON_dump => $json_dump );
Then your consumer thread could just:
my ( $key, $value ) = $repo_queue->dequeue( 2 );
$repository{ $key } = $value;
You can certainly do that in Perl, I suggest you look at perldoc threads and perldoc threads::shared, as these manual pages best describe the methods and pitfalls encountered when using threads in Perl.
What I would really suggest you use, provided you can, is instead a queue management system such as Gearman, which has various interfaces to it including a Perl module. This allows you to create as many "workers" as you want (the subs actually doing the work) and create one simple "client" which would schedule the appropriate tasks and then collate the results, without needing to use tricks as using hashref keys specific to the task or things like that.
This approach would also scale better, and you'd be able to have clients and workers (even managers) on different machines, should you choose so.
Other queue systems, such as TheSchwartz, would not be indicated as they lack the feedback/result that Gearman provides. To all effects, using Gearman this way is pretty much as the threaded system you described, just without the hassles and headaches that any system based on threads may eventually suffer from: having to lock variables, using semaphores, joining threads.

Resources