Perl: close open2 handle from a background thread - multithreading

I am simply trying to find out how to properly use the open2 function.
See an example below. It works for a small $max, but naturally, if I write long enough to the $hIn, so eventually it will get blocked because nothing reads the data on the output continuously.
use 5.26.0;
use IPC::Open2;
my $max = 100000;
my $pid = open2(my $hOut, my $hIn, "cat") || die "failed 'cat': $!";
{
my $cnt = 0;
#When $max is big (e.g. 100000) so the code below will get blocked
#when writing to $hIn
while ($cnt<$max) {say $hIn $cnt++;}
close($hIn) || say "can't close hIn";
}
while(<$hOut>) { print; }
close($hOut) || say "can't close hOut";
waitpid( $pid, 0 );
The only solution, that I can think about, is launching an other thread that will do the writing on the background.
With the code below I can write into the $hIn as much data as I want and read them in the main thread but the $hIn seems not to get closed. Because of that the while(<$hOut>) will never finish while waiting for more output.
use 5.26.0;
use threads;
use IPC::Open2;
my $max = 100000;
my $pid = open2(my $hOut, my $hIn, "cat") || die "failed 'cat': $!";
my $thr = threads->create(sub {
my $cnt = 0;
while ($cnt<$max) {say $hIn $cnt++;}
#The close does not have any effect here (although no error is produced)
close($hIn) || say "can't close hIn";
});
#This outputs all the data written to $hIn but never leaves the loop...
while(<$hOut> ) { print; }
close($hOut) || say "can't close hOut";
$thr->join;
waitpid( $pid, 0 );
My questions are:
Provided that my approach with threads is ok, how can I close the file handle from the background thread?
If it is not ok (actually use threads is discouraged in Perl), so can someone provide a working example of open2 that can write and read a lot of data without a risk of getting blocked waiting for the reader or writer?
EDIT: Following your suggestions here is an implementation of the code above using IPC::Run:
use 5.26.0;
use IPC::Run qw/ run /;
my $max = 1000000;
run sub {
my $cnt = 0;
while ($cnt<$max) {say $cnt++;}
},
"|", "cat",
"|", sub {
while(<> ) {
print;
}
}
or die "run sub | cat | sub failed: $?";
It runs without flaws, the code is very readable... I am very happy to have learned about this module. Thanks to everyone!
Yet, I consider the question to be unanswered. If it is not possible to write this functionality using open2 directly, why does that even exist and confuse people? Also not being able to close the file handle from a different thread looks like a bug to me (certainly it is - the close should at least report an error).

Your program stopped because the pipe to which it was writing became full.
The pipe to cat became full because cat stopped reading from it.
cat stopped because the pipe to which it was writing became full.
The pipe from cat became full because you program isn't reading from it.
So you have two programs waiting for each other to do something. This is a deadlock.
The low-level solution is to use select to monitor both ends of the pipe.
The high-level solution is to let IPC::Run or IPC::Run3 do that hard work for you.
use IPC::Run qw( run );
my $cnt_max = 100000;
my $cnt = 0;
run [ "cat" ],
'<', sub { $cnt < $cnt_max ? $cnt++ . "\n" : undef };

Related

Perl run the same script for different directories at the same time

I have a directory that contains other directories (the number of directories is arbitrary), like this:
Main_directory_samples/
subdirectory_sample_1/
subdirectory_sample_2/
subdirectory_sample_3/
subdirectory_sample_4/
I have a script that receives as input one directory each time and it takes 1h to run (for each directory). To run the script I have the following code:
opendir DIR, $maindirectory or die "Can't open directory!!";
while(my $dir = readdir DIR){
if($dir ne '.' && $dir ne '..'){
system("/bin/bash", "my_script.sh", $maindirectory.'/'.$dir);
}
}
closedir DIR;
However, I want to run the script for different directories at the same time. For instance, the subdirectory_sample_1/ and subdirectory_sample_2/ would run in the same thread; subdirectory_sample_3/ and subdirectory_sample_4/ in another. But I just can't find a way to do this.
As you're just starting external processes and waiting for them, a non-threading option:
use strict;
use warnings;
use Path::Tiny;
use IO::Async::Loop;
use Future::Utils 'fmap_concat';
my $loop = IO::Async::Loop->new;
my $maindirectory = '/foo/bar';
my #subdirs = grep { -d } path($maindirectory)->children; # excludes . and ..
# runs this code to maintain up to 'concurrent' pending futures at once
my $main_future = fmap_concat {
my $dir = shift;
my $future = $loop->new_future;
my $process = $loop->open_process(
command => ['/bin/bash', 'my_script.sh', $dir],
on_finish => sub { $future->done(#_) },
on_exception => sub { $future->fail(#_) },
);
return $future;
} foreach => \#subdirs, concurrent => 2;
# run event loop until all futures are done or one fails, throw exception on failure
my #exit_codes = $main_future->get;
See the docs for IO::Async::Loop and Future::Utils.
One way is to fork and in each child process a group of directories.
A basic example
use warnings;
use strict;
use feature 'say';
use List::MoreUtils qw(natatime);
use POSIX qw(:sys_wait_h); # for WNOHANG
use Time::HiRes qw(sleep); # for fractional seconds
my #all_dirs = qw(d1 d2 d3 d4);
my $path = 'maindir';
my #procs;
# Get iterator over groups (of 2)
my $it = natatime 2, #all_dirs;
while (my #dirs = $it->()) {
my $pid = fork // do { #/
warn "Can't fork for #dirs: $!";
next;
};
if ($pid == 0) {
foreach my $dir (#dirs) {
my #cmd = ('/bin/bash/', 'my_script.sh', "$path/$dir");
say "in $$, \#cmd: (#cmd)";
# system(#cmd) == 0 or do { inspect $? }
};
exit;
};
push #procs, $pid;
}
# Poll with non-blocking wait for processes (reap them)
my $gone;
while (($gone = waitpid -1, WNOHANG) > -1) {
my $status = $?;
say "Process $gone exited with $status" if $gone > 0;
sleep 0.1;
}
See system and/or exec for details, in particular on error checking, as well as $? variable. It can be unpacked to retrieve more details about the error; or, at least print a warning and skip to the next item (which happens above anyway).
The code above prints out the command and pid's with their exit status, but replace #cmd with a test command of no consequence and un-comment the system line to try this out.
Watch for how many jobs there are. A basic rule of thumb is to not have more than 2 per core at which point the performance starts suffering, but this depends on many details. Experiment to find the sweet spot for your case. I like to have a job per core and then at least one core free. In order to throttle this see modules linked at the end.
To break all jobs (directories) into groups I used natatime from List::MoreUtils (n-at-a-time). If there are more specific criteria about how to group directories adjust that.
See Forks::Super and Parallel::ForkManager for higher-level ways to work with forked processes.

perl multithreading: capturing stdio of subthread childs with "mixed" results

I wrote a massively multithreaded application in perl which basically scans a file- or directory-structure for changes (either using inotify or polling). When it detects changes, it launches subthreads that execute programs with the changed files as an argument, according to a configuration.
This works quite nice so far, with the exception that my application also tries to capture stdout and stderr of the externally executed programs and write them to log files in a structured manner.
I am, however, experiencing an occasional but serious mixup of output here, in the way that every when and then (usually under heavy workload, of course, so that the normal tests always run fine) stdout from a program on thread A gets into the stdout pipe FH of another program running on thread B at the very same time.
My in-thread code to run the externally executed programs and capture the output from them looks like this:
my $out;
$pid = open($out, "( stdbuf -oL ".$cmd." | stdbuf -oL sed -e 's/^/:::LOG:::/' ) 2>&1 |") or xlog('async execution failed for: '.$cmd, LOG_LEVEL_NORMAL, LOG_ERROR);
# catch all worker output here
while(<$out>)
{
if( $_ =~ /^:::LOG:::/ )
{
push(#log, $wname.':::'.$acnt.':::'.time().$_);
} else {
push(#log, $wname.':::'.$acnt.':::'.time().':::ERR:::'.$_);
}
if (time() - $last > 1)
{
mlog($acnt, #log);
$last = time();
#log = ();
}
}
close($out);
waitpid($pid, 0);
push(#log, $wname.':::'.$acnt.':::'.time().':::LOG:::--- thread finished ---');
stdbuf is being used here to suppress buffering delays whereever possible and the sed pipe is being used to avoid the need of handling multiple fds in the reader while still being able to separate normal output from errors.
Captured log lines are being stuffed into a local array by the while loop and every other second contents of that array are handed over to a thread-safe global logging method using semaphores that makes sure nothing gets mixed up.
To avoid unneccesary feedback loops from you: I certainly have made sure (using debug output) that the output really is mixed up on the thread level already and is not a result of locking mistakes later in the output chain!
My Question is: how can it be, that the thread-locally defined $out FH from thread A does receive output that definitely comes from a totally different program running in thread B and therefor should end up in the separately defined thread-local $out FH of thread B? Did I make a grave mistake at some point here or is it just that perl threading is a mess? And, finally, what would be a recommended way to separate the data properly (and preferably in some elegant way)?
Update: due to popular demand I have added the full thread method here:
sub async_command {
my $wname = shift;
my $cmd = shift;
my $acnt = shift;
my $delay = shift;
my $errlog = shift;
my $last = time();
my $pid = 0;
my #log;
my $out;
push(#log, $wname.':::'.$acnt.':::'.$last.':::LOG:::--- thread started ---'.($delay ? ' (sleeping for '.$delay.' seconds)':''));
push(#log, $wname.':::'.$acnt.':::'.$last.':::ERR:::--- thread started ---') if ($errlog);
if ($delay) { sleep($delay); }
# Start worker with output pipe. stdbuf prevents unwanted buffering
# sed tags stdout vs stderr
$pid = open($out, "( stdbuf -oL ".$cmd." | stdbuf -oL sed -e 's/^/:::LOG:::/' ) 2>&1 |") or xlog('async execution failed for: '.$cmd, LOG_LEVEL_NORMAL, LOG_ERROR);
# catch all worker output here
while(<$out>)
{
if( $_ =~ /^:::LOG:::/ )
{
push(#log, $wname.':::'.$acnt.':::'.time().$_);
} else {
push(#log, $wname.':::'.$acnt.':::'.time().':::ERR:::'.$_);
}
if (time() - $last > 1)
{
mlog($acnt, #log);
$last = time();
#log = ();
}
}
close($out);
waitpid($pid, 0);
push(#log, $wname.':::'.$acnt.':::'.time().':::LOG:::--- thread finished ---');
push(#log, $wname.':::'.$acnt.':::'.time().':::ERR:::--- thread finished ---') if ($errlog);
mlog($acnt, #log);
byebye();
}
So... here you can see that #log as well as $out are thread-local variables. The xlog (global log) and mlog-methods (worker logs) actually use Thread::Queue for further processing. I just dont want to use it more than once a second per thread to avoid too much locking overhead.
I have duplicated the push(#log... statements into xlog() calls for debugging. Since the worker name $wname is somewhat tied to the $cmd executed and $acnt is a number unique for each thread, I came to see clearly that there is log output being read from the $out FH that definitely comes from a different $cmd than the one executed in this thread, while $acnt and $wname stay the ones that actually belong to the thread. Also I can see that these log lines then do NOT appear on the $out FH in the other thread where they should be.

What is causing memory to continuously rise perl?

Problem
I have created a simple perl script to read log files and process the data asynchronously.
The problem i am facing is that the script appears to continuously use more memory the longer it runs. This seems to be affected by the amount of data it processes. The problem I have is that i am unable to identify what exactly is using all this memory, and whether it is a leak or something is just holding onto it.
Question
How can i modify the below script so that it no longer continuously consumes memory ?
Code
#Multithreaded to read multiple log files at the same time.
use strict;
use warnings;
use threads;
use Thread::Queue;
use threads::shared;
my $logq = Thread::Queue->new();
my %Servers :shared;
my %servername :shared;
sub csvsplit {
my $line = shift;
my $sep = (shift or ',');
return () unless $line;
my #cells;
my $re = qr/(?:^|$sep)(?:"([^"]*)"|([^$sep]*))/;
while($line =~ /$re/g) {
my $value = defined $1 ? $1 : $2;
push #cells, (defined $value ? $value : '');
}
return #cells;
}
sub process_data
{
while(sleep(1)){
if ($logq->pending())
{
my %sites;
my %returns;
while($logq->pending() > 0){
my $data = $logq->dequeue();
my #fields = csvsplit($data);
$returns{$fields[$#fields - 1]}++;
$sites{$fields[$#fields]}++;
}
print "counter:$_, value=\"$sites{$_}\" />\n" for (keys%sites);
print "counter:$_, value=\"$returns{$_}\" />\n" for (keys%returns);
}
}
}
sub read_file
{
my $myFile=$_[0];
open(my $logfile,'<',$myFile) || die "error";
my $Inode=(stat($logfile))[1];
my $fileSize=(stat($logfile))[7];
seek $logfile, 0, 2;
for (;;) {
while (<$logfile>) {
chomp( $_ );
$logq->enqueue( $_ );
}
sleep 5;
if($Inode != (stat($myFile))[1] || (stat($myFile))[7] < $fileSize){
close($logfile);
while (! -e $myFile){
sleep 2;
}
open($logfile,'<',$myFile) || die "error";
$Inode=(stat($logfile))[1];
$fileSize=(stat($logfile))[7];
}
seek $logfile, 0, 1;
}
}
my $thr1 = threads->create(\&read_file,"log");
my $thr4 = threads->create(\&process_data);
$thr1->join();
$thr4->join();
Obeservations and relevant info
The memory only seems to increase when the program has data to process, if i just leave it, it maintains the current memory usage.
Memory only appears to increase for larger throughput and increase about half a Mb per 5 seconds for around 2000 lines in the same time.
I have not included the csv as i do not think it is relevant. If you do and want me to add it please give a valid reason.
Specs
GNU bash, version 3.2.57(1)-release (s390x-ibm-linux-gnu)
perl, v5.10.0
I have looked through other questions but cannot find much of relevance. If this is a duplicate or the relevant info is in another question, feel free to mark as a dupe and ill check it out.
Any more info needed just ask.
The reason is probably that the size of your Thread::Queue is unlimited. If the producer thread is faster than the consumer thread, your queue will continue to grow. So you should simply limit the size of your queue. For example, to set a limit of 1,000 queue items:
$logq->limit = 1000;
(The way you use the pending method is wrong by the way. You should only terminate if the return value is undefined.)

Perl: write value in thread

I am trying to get text of two large files. To speed it up i tried threads.
Before i used threads the script worked, now it does not.
The problem is: I save everything I read in the file into a hash.
When i print out the size (or keys/values) after the read-in in the sub (which the thread executed) it shows a correct number > 0, when i print out the size of the hash anywhere else (after the threads have run) it shows me 0.
print ": ".keys(%c);
is used 2 times, and has different output each time.
(In the final programm 2 Threads are running and a method to compare the stuff is called after the threads finished)
Example code:
my %c;
my #threads = initThreads();
#threads[0] = threads->create(\&ce);
foreach(#threads){
$_->join();
}
print ": ".keys(%c);
sub initThreads{
my #initThreads;
for(my $i = 0; $i<2;$i++){
push(#initThreads, $i);
}
return #initThreads;
}
sub ce(){
my $id = threads->tid();
open my $file, "<", #arg1[1] or die $!;
my #cXY;
my #cDa;
while(my $line = <$file>){
# some regex and push to arrays, works
#c{#cXY} = #cDa;
}
print "Thread $id is done\n";
close $file;
print ": ".keys(%c);
threads->exit();
}
Do i have to run the things after the first 2 threads finished in another thread which waits until the first two are finished?
Or what am i doing wrong with threads?
Thanks.
%c isn't shared across your threads.
use threads;
use threads::shared
my %c :shared;
See threads::shared.
In Perl, threads don't share memory. Each thread operates on a different copy of %c, so the changes aren't reflected to the parent thread. While sharing a variable across threads is possible, this is not generally advisable.
Make use of the possibility to return data from a thread. E.g
my %c = map %{ $_->join }, #threads; # flatten all returned hashes
sub ce {
my %hash;
...;
return \%hash;
}
Some other suggestions:
use strict; use warnings; if you aren't already.
use better variable names.
you only seem to be spawning one thread (in $threads[0]).
my #array; for (my $i = 0; $i < 2; $i++){ push(#array, $i) } is equivalent to my #array = 0 .. 1.
#arg1 is not declared in the current scope.
manually exiting a thread is not neccessary in your case.

Perl: Signals and Threads. How to kill thread with qx() inside

i have a script, that parse log and find errors and warnings.
And i want to use user-friendly interpretation of this log.
For this reason, i use notepad.
Here is code:
use v5.16;
use strict;
use warnings;
use Win32::Clipboard;
use threads;
use utf8;
my $kp = Win32::Clipboard->new();
my $output = shift || "out_log.txt";
#my ($input, $output)=#ARGV;
#open my $ih, "<", $input or die "can't open input file with log\n";
open my $oh, ">", $output or die "can't open output file with log\n";
my #alls=split /\n/,$kp->Get();
for my $k(0..$#alls){
$_ = $alls[$k];
if(/^ERR.+|^WARN.+/){
print {$oh} qq(at position $k --> ).$_."\n";
}
}
my $thread =
threads->create(sub{
$SIG{INT}=sub{die"All good\n";};
qx(notepad $output);
}
);
print qq(type 'y' for quit);
do{
print "want to quit?\n>" ;
chomp;
do{
say "I will kill this thread";
$thread->kill('INT') if defined($thread);
say "and delete output";
unlink $output;
exit(0);
}if (m/y/);
}while(<>);
It falls down, when i trying to kill thread which run notepad.
How to do this, using signals and threads? Is it possible?
And your ideas about solution, please.
Thanks!
This isn't working because your SIGINT never gets passed to notepad. So it never gets closed. (And that handler - probably never gets processed).
You need to approach this differently. Look at Win32::Process for some examples of how to spawn/kill a notepad process.
my $ProcessObj;
Win32::Process::Create( $ProcessObj,
"C:\\Windows\\system32\\notepad.exe",
"notepad", 0, NORMAL_PRIORITY_CLASS, "." )
or die $!;
And then you can use
$ProcessObj -> Kill(1);
I'd suggest using Thread::Semaphore or some sort of shared variable to decide if you want to kill your notepad.

Resources