Perl Queues and Threading - multithreading

I'm trying to accomplish the following:
Have a thread that reads data from a very large file say about
10GB and push them into the queue. (I do not wish for the queue to
get very large either)
While the buildQueue thread is pushing data to the queue at the same time have
about 5 worker threads de-queue and process data.
I've made an attempt but my other threads are unreachable because of a continuous loop in my buildQueue thread.
My approach may be totally wrong. Thanks for any help, it's much appreciated.
Here's the code for buildQueue:
sub buildQueue {
print "Enter a file name: ";
my $dict_path = <STDIN>;
chomp($dict_path);
open DICT_FILE, $dict_path or die("Sorry, could not open file!");
while (1) {
if (<DICT_FILE>) {
if ($queue->pending() < 100) {
my $query = <DICT_FILE>;
chomp($query);
$queue->enqueue($query);
my $count = $queue->pending();
print "Queue Size: $count Query: $query\n";
}
}
}
}
And as I've expected when this thread gets executed nothing else after will be executed because this thread will not finish.
my $builder = new Thread(&buildQueue);
Since the builder thread will be running for a long time I never get to create worker threads.
Here's the entire code:
#!/usr/bin/perl -w
use strict;
use Thread;
use Thread::Queue;
my $queue = new Thread::Queue();
my #threads;
sub buildQueue {
print "Enter a file name: ";
my $dict_path = <STDIN>;
chomp($dict_path);
open dict_file, $dict_path or die("Sorry, could not open file!");
while (1) {
if (<dict_file>) {
if ($queue->pending() < 100) {
my $query = <dict_file>;
chomp($query);
$queue->enqueue($query);
my $count = $queue->pending();
print "Queue Size: $count Query: $query\n";
}
}
}
}
sub processor {
my $query;
while (1) {
if ($query = $queue->dequeue) {
print "$query\n";
}
}
}
my $builder = new Thread(&buildQueue);
push #threads, new Thread(&processor) for 1..5;

You'll need to mark when you want your threads to exit (via either joinor detach ). The fact that you have infinite loops with no last statements to break out of them is also a problem.
Edit: I also forgot a very important part! Each worker thread will block, waiting for another item to process off of the queue until they get an undef in the queue. Hence why we specifically enqueue undef once for each thread after the queue builder is done.
Try:
#!/usr/bin/perl -w
use strict;
use threads;
use Thread::Queue;
my $queue = new Thread::Queue();
our #threads; #Do you really need our instead of my?
sub buildQueue
{
print "Enter a file name: ";
my $dict_path = <STDIN>;
chomp($dict_path);
#Three-argument open, please!
open my $dict_file, "<",$dict_path or die("Sorry, could not open file!");
while(my $query=<$dict_file>)
{
chomp($query);
while(1)
{ #Wait to see if our queue has < 100 items...
if ($queue->pending() < 100)
{
$queue->enqueue($query);
print "Queue Size: " . $queue->pending . "\n";
last; #This breaks out of the infinite loop
}
}
}
close($dict_file);
foreach(1..5)
{
$queue->enqueue(undef);
}
}
sub processor
{
my $query;
while ($query = $queue->dequeue)
{
print "Thread " . threads->tid . " got $query\n";
}
}
my $builder=threads->create(\&buildQueue);
push #threads,threads->create(\&process) for 1..5;
#Waiting for our threads to finish.
$builder->join;
foreach(#threads)
{
$_->join;
}

The MCE module for Perl loves big files. With MCE, one can chunk many lines at once, slurp a big chunk as a scalar string, or read 1 line at a time. Chunking many lines at once reduces the overhead for IPC.
MCE 1.504 is out now. It provides MCE::Queue with support for child processes including threads. In addition, the 1.5 release comes with 5 models (MCE::Flow, MCE::Grep, MCE::Loop, MCE::Map, and MCE::Stream) which take care of instantiating the MCE instance as well as auto-tuning max_workers and chunk_size. One may override these options btw.
Below, MCE::Loop is used for the demonstration.
use MCE::Loop;
print "Enter a file name: ";
my $dict_path = <STDIN>;
chomp($dict_path);
mce_loop_f {
my ($mce, $chunk_ref, $chunk_id) = #_;
foreach my $line ( #$chunk_ref ) {
chomp $line;
## add your code here to process $line
}
} $dict_path;
If you want to specify the number of workers and/or chunk_size, then there are 2 ways to do it.
use MCE::Loop max_workers => 5, chunk_size => 300000;
Or...
use MCE::Loop;
MCE::Loop::init {
max_workers => 5,
chunk_size => 300000
};
Although chunking is preferred for large files, one can compare the time with chunking one line at a time. One may omit the first line inside the block (commented out). Notice how there's no need for an inner for loop. $chunk_ref is still an array ref containing 1 line. The input scalar $_ contains the line when chunk_size equals 1, otherwise points to $chunk_ref.
use MCE::Loop;
MCE::Loop::init {
max_workers => 5,
chunk_size => 1
};
print "Enter a file name: ";
my $dict_path = <STDIN>;
chomp($dict_path);
mce_loop_f {
# my ($mce, $chunk_ref, $chunk_id) = #_;
my $line = $_;
## add your code here to process $line or $_
} $dict_path;
I hope that this demonstration was helpful for folks wanting to process a file in parallel.
:) mario

It sounds like this case could do with the Parallel::ForkManager module.

A different approach: You can also use user_tasks in MCE 1.2+ and create two multi-worker multithreading tasks, one task for reading (since it's a big file, you could also benefit from parallel reading while preserving file read seek) and one task for processing, etc.
The code below still uses Thread::Queue to manage your buffer queue.
The buildQueue sub has your queue size control and it pushes the data directly to the manager process' $R_QUEUE since we've used threads, so it has access to the parent's memory space. If you want to use forks instead, you can still access the queue through a call back function. But here I chose to simply just push to the queue.
The processQueue sub will simply de-queue whatever is in the queue until there's nothing more pending.
The task_end sub in each task is run only once by the manager process at the end of each task, so we use it to signal a stop to our worker processes.
Obviously, there's a lot of freedom in how you want to chunk your data to the workers, so you can decide upon the size of the chunk or even how to slurp your data in.
#!/usr/bin/env perl
use strict;
use warnings;
use threads;
use threads::shared;
use Thread::Queue;
use MCE;
my $R_QUEUE = Thread::Queue->new;
my $queue_workers = 8;
my $process_workers = 8;
my $chunk_size = 1;
print "Enter a file name: ";
my $input_file = <STDIN>;
chomp($input_file);
sub buildQueue {
my ($self, $chunk_ref, $chunk_id) = #_;
if ($R_QUEUE->pending() < 100) {
$R_QUEUE->enqueue($chunk_ref);
$self->sendto('stdout', "Queue Size: " . $R_QUEUE->pending ."\n");
}
}
sub processQueue {
my $self = shift;
my $wid = $self->wid;
while (my $buff = $R_QUEUE->dequeue) {
$self->sendto('stdout', "Thread " . $wid . " got $$buff");
}
}
my $mce = MCE->new(
input_data => $input_file, # this could be a filepath or a file handle or even a scalar to treat like a file, check the documentation for more details.
chunk_size => $chunk_size,
use_slurpio => 1,
user_tasks => [
{ # queueing task
max_workers => $queue_workers,
user_func => \&buildQueue,
use_threads => 1, # we'll use threads to have access to the parent's variables in shared memory.
task_end => sub { $R_QUEUE->enqueue( (undef) x $process_workers ) } # signal stop to our process workers when they hit the end of the queue. Thanks > Jack Maney!
},
{ # process task
max_workers => $process_workers,
user_func => \&processQueue,
use_threads => 1, # we'll use threads to have access to the parent's variables in shared memory
task_end => sub { print "Finished processing!\n"; }
}
]
);
$mce->run();
exit;

Related

Use of global arrays in different threads in perl

Use of global arrays in different threads
I'm going to use Dancer2 and File::Tail to use Tail on the web. So when the Websocket is opened, it stores the $conn in an array, and when File::Tail is detected, it tries to send data to the socket stored in the array. But it doesn't work as expected.
The array that is saved when a websocket connection occurs is probably not a global variable.
# it doesn't works.
foreach (#webs) {
$_->send_utf8("test2!!!!!!!!");
}
I tried to use threads::shared and Cache:::Memcached etc, but I failed.
I don't know perl very well. I tried to solve it myself, but I couldn't solve it for too long, so I leave a question.
This is the whole code.
use File::Tail ();
use threads;
use threads::shared;
use Net::WebSocket::Server;
use strict;
use Dancer2;
my #webs = ();
# my %clients :shared = ();
my $conns :shared = 4;
threads->create(sub {
print "start-end:", "$conns", "\n";
my #files = glob( $ARGV[0] . '/*' );
my #fs = ();
foreach my $fileName(#files) {
my $file = File::Tail->new(name=>"$fileName",
tail => 1000,
maxinterval=>1,
interval=>1,
adjustafter=>5,resetafter=>1,
ignore_nonexistant=>1,
maxbuf=>32768);
push(#fs, $file);
}
do {
my $timeout = 1;
(my $nfound,my $timeleft,my #pending)=
File::Tail::select(undef,undef,undef,$timeout,#fs);
unless ($nfound) {
} else {
foreach (#pending) {
my $str = $_->read;
print $_->{"input"} . " ||||||||| ".localtime(time)." ||||||||| ".$str;
# it doesn't works.
foreach (#webs) {
$_->send_utf8("test!!!!!!!!");
}
}
}
} until(0);
})->detach();
threads->create(sub {
Net::WebSocket::Server->new(
listen => 8080,
on_connect => sub {
my ($serv, $conn) = #_;
push(#webs, $conn);
$conn->on(
utf8 => sub {
my ($conn, $msg) = #_;
$conn->send_utf8($msg);
# it works.
foreach (#webs) {
$_->send_utf8("test!!!!!!!!");
}
},
);
},
)->start;
})->detach();
get '/' => sub {
my $ws_url = "ws://127.0.0.1:8080/";
return <<"END";
<html>
<head><script>
var urlMySocket = "$ws_url";
var mySocket = new WebSocket(urlMySocket);
mySocket.onmessage = function (evt) {
console.log( "Got message " + evt.data );
};
mySocket.onopen = function(evt) {
console.log("opening");
setTimeout( function() {
mySocket.send('hello'); }, 2000 );
};
</script></head>
<body><h1>WebSocket client</h1></body>
</html>
END
};
dance;
Threads in perl are not lightweight. They're separate instances of the program.
The only thing that threads have in common, are things that exist prior to the threads instantating.
You can - with declaring shared variables - allow data structures to share between threads, however I'd warn you to be cautious here - without some manner of locking, you potentially create yourself a race condition.
In your case, you could declare #webs as : shared. This will mean values inserted into it will be visible to all your threads. But you still need a degree of caution there, because 'when stuff is added' is still nondeterministic.
But anyway, this basically works:
#!/usr/bin/env perl
use strict;
use warnings;
use threads;
use threads::shared;
use Data::Dumper;
my #shared_struct : shared;
sub reader {
print "Starting reader\n";
for ( 1..10 ) {
print threads -> self() -> tid(), ":", join (",", #shared_struct ), "\n";
sleep 1;
}
}
sub writer {
print "starting writer\n";
for ( 1..10 ) {
push #shared_struct, rand(10);
print Dumper \#shared_struct;
sleep 1;
}
}
## start the threads;
my $reader = threads -> create ( \&reader );
my $writer = threads -> create ( \&writer );
while ( 1 ) {
print #shared_struct;
sleep 1;
}
More generally, I'd suggest you almost never actually want to detach a thread in perl - in doing so, what you're saying is 'I don't care about your execution'. And clearly that's not the case in your code - you're trying to talk to the threads.
Just creating the thread accomplishes what you want - parallel execution and you can have:
for my $thread ( threads -> list ) {
$thread -> join;
}
As and when you're ready for the thread to terminate.

Is there a thread-safe way to print in Perl?

I currently have a script that kicks off threads to perform various actions on several directories. A snippet of my script is:
#main
sub BuildInit {
my $actionStr = "";
my $compStr = "";
my #component_dirs;
my #compToBeBuilt;
foreach my $comp (#compList) {
#component_dirs = GetDirs($comp); #populates #component_dirs
}
print "Printing Action List: #actionList\n";
#---------------------------------------
#---- Setup Worker Threads ----------
for ( 1 .. NUM_WORKERS ) {
async {
while ( defined( my $job = $q->dequeue() ) ) {
worker($job);
}
};
}
#-----------------------------------
#---- Enqueue The Work ----------
for my $action (#actionList) {
my $sem = Thread::Semaphore->new(0);
$q->enqueue( [ $_, $action, $sem ] ) for #component_dirs;
$sem->down( scalar #component_dirs );
print "\n------>> Waiting for prior actions to finish up... <<------\n";
}
# Nothing more to do - notify the Queue that we're not adding anything else
$q->end();
$_->join() for threads->list();
return 0;
}
#worker
sub worker {
my ($job) = #_;
my ( $component, $action, $sem ) = #$job;
Build( $component, $action );
$sem->up();
}
#builder method
sub Build {
my ( $comp, $action ) = #_;
my $cmd = "$MAKE $MAKE_INVOCATION_PATH/$comp ";
my $retCode = -1;
given ($action) {
when ("depend") { $cmd .= "$action >nul 2>&1" } #suppress output
when ("clean") { $cmd .= $action }
when ("build") { $cmd .= 'l1' }
when ("link") { $cmd .= '' } #add nothing; default is to link
default { die "Action: $action is unknown to me." }
}
print "\n\t\t*** Performing Action: \'$cmd\' on $comp ***" if $verbose;
if ( $action eq "link" ) {
# hack around potential race conditions -- will only be an issue during linking
my $tries = 1;
until ( $retCode == 0 or $tries == 0 ) {
last if ( $retCode = system($cmd) ) == 2; #compile error; stop trying
$tries--;
}
}
else {
$retCode = system($cmd);
}
push( #retCodes, ( $retCode >> 8 ) );
#testing
if ( $retCode != 0 ) {
print "\n\t\t*** ERROR IN $comp: $# !! ***\n";
print "\t\t*** Action: $cmd -->> Error Level: " . ( $retCode >> 8 ) . "\n";
#exit(-1);
}
return $retCode;
}
The print statement I'd like to be thread-safe is: print "\n\t\t*** Performing Action: \'$cmd\' on $comp ***" if $verbose; Ideally, I would like to have this output, and then each component that is having the $action performed on it, would output in related chunks. However, this obviously doesn't work right now - the output is interleaved for the most part, with each thread spitting out it's own information.
E.g.,:
ComponentAFile1.cpp
ComponentAFile2.cpp
ComponentAFile3.cpp
ComponentBFile1.cpp
ComponentCFile1.cpp
ComponentBFile2.cpp
ComponentCFile2.cpp
ComponentCFile3.cpp
... etc.
I considered executing the system commands using backticks, and capturing all of the output in a big string or something, then output it all at once, when the thread terminates. But the issue with this is (a) it seems super inefficient, and (b) I need to capture stderr.
Can anyone see a way to keep my output for each thread separate?
clarification:
My desired output would be:
ComponentAFile1.cpp
ComponentAFile2.cpp
ComponentAFile3.cpp
------------------- #some separator
ComponentBFile1.cpp
ComponentBFile2.cpp
------------------- #some separator
ComponentCFile1.cpp
ComponentCFile2.cpp
ComponentCFile3.cpp
... etc.
To ensure your output isn't interrupted, access to STDOUT and STDERR must be mutually exclusive. That means that between the time a thread starts printing and finishes printing, no other thread can be allowed to print. This can be done using Thread::Semaphore[1].
Capturing the output and printing it all at once allows you to reduce the amount of time a thread holds a lock. If you don't do that, you'll effectively make your system single-threaded system as each thread attempts lock STDOUT and STDERR while one thread runs.
Other options include:
Using a different output file for each thread.
Prepending a job id to each line of output so the output can be sorted later.
In both of those cases, you only need to lock it for a very short time span.
# Once
my $mutex = Thread::Semaphore->new(); # Shared by all threads.
# When you want to print.
$mutex->down();
print ...;
STDOUT->flush();
STDERR->flush();
$mutex->up();
or
# Once
my $mutex = Thread::Semaphore->new(); # Shared by all threads.
STDOUT->autoflush();
STDERR->autoflush();
# When you want to print.
$mutex->down();
print ...;
$mutex->up();
You can utilize the blocking behavior of $sem->down if it attempts to decrease the semaphore counter below zero, as mentioned in perldoc perlthrtut:
If down() attempts to decrement the counter below zero, it blocks
until the counter is large enough.
So here's what one could do:
Initialize a semaphore with counter 1 that is shared across all threads
my $sem = Thread::Semaphore->new( 1 );
Pass a thread counter to worker and Build
for my $thr_counter ( 1 .. NUM_WORKERS ) {
async {
while ( defined( my $job = $q->dequeue() ) ) {
worker( $job, $thr_counter );
}
};
}
sub worker {
my ( $job, $counter ) = #_;
Build( $component, $action, $counter );
}
Go ->down and ->up inside Build (and nowhere else)
sub Build {
my ( $comp, $action, $counter ) = #_;
... # Execute all concurrently-executed code here
$sem->down( 1 << ( $counter -1 ) );
print "\n\t\t*** Performing Action: \'$cmd\' on $comp ***" if $verbose;
# Execute all sequential 'chunks' here
$sem->up( 1 << ( $counter - 1) );
}
By using the thread counter to left-shift the semaphore counter, it guarantees that the threads won't trample on one another:
+-----------+---+---+---+---+
| Thread | 1 | 2 | 3 | 4 |
+-----------+---+---+---+---+
| Semaphore | 1 | 2 | 4 | 8 |
+-----------+---+---+---+---+
I've approached this problem differently in the past, by creating an IO thread, and using that to serialise the file access.
E.g.
my $output_q = Thread::Queue -> new();
sub writer {
open ( my $output_fh, ">", $output_filename );
while ( my $line = $output_q -> dequeue() ) {
print {$output_fh} $line;
}
close ( $output_fh );
}
And within threads, 'print' by:
$output_q -> enqueue ( "text_to_print\n"; );
Either with or without a wrapper - e.g. for timestamping statements if they're going to a log. (You probably want to timestamp when queued, rather than when actually printer).

How do you reuse a queue from Thread::Queue?

I was provided some guidance on here at one time, with the following snippet:
my $q = Thread::Queue->new();
sub worker {
my ($job, $action) = #_;
Build($job, $action);
}
for (1..NUM_WORKERS) {
async {
while (defined(my $job = $q->dequeue())) {
worker($job, 'clean');
}
};
}
$q->enqueue($_) for #compsCopy;
# When you're done adding to the queue.
$q->end();
$_->join() for threads->list();
What is the best option for reusing q? Currently, I'm just making new q objects, q2, q3 and doing all of this over again for each $action that I want to perform. Is there a better way though? I could potentially pass in an array of "actions" that I would like to perform, and would like to avoid duplicating this code 7 times if possible.
Maybe I don't fully understand what a Thread::Queue is..
You should use one queue for one direction. If you just would like to some operation paralel, use one queue. If you would like to report back errors and process those error in main or another thread then you use two queue.
for simple use here is for your reference:
use strict;
use warnings;
use threads;
use threads;
use Thread::Queue;
my $q = Thread::Queue->new(); # A new empty queue
my %seen: shared;
# Worker thread
my #thrs = threads->create(\&doOperation ) for 1..5;#for 5 threads
add_file_to_q('/tmp/');
$q->enqueue('//_DONE_//') for #thrs;
$_->join() for #thrs;
sub add_file_to_q {
my $dir = shift;
my #files = `ls -1 $dir/`;chomp(#files);
#add files to queue
foreach my $f (#files){
# Send work to the thread
$q->enqueue($f);
print "Pending items: "$q->pending()."\n";
}
}
sub doOperation () {
my $ithread = threads->tid() ;
while (my $filename = $q->dequeue()) {
# Do work on $item
sleep(1) if ! defined $filename;
return 1 if $filename eq '//_DONE_//';
next if $seen{$filename};
print "[id=$ithread]\t$filename\n";
$seen{$filename} = 1;
### add files if it is a directory (check with symlinks, no file with //_DONE_// name!)
add_file_to_q($filename) if -d $filename;
}
return 1;
}

Perl Error: thread failed to start: Invalid value for shared scalar

I get the following error when trying to run my test code:
thread failed to start: Invalid value for shared scalar at ./threaded_test.pl line 47.
Line 47 is:
%hoh = hoh(#new_array);
My observations:
If I remove line 47 and other lines referencing %hoh, then the script runs without errors
I can create a new hash %new_hash = (itchy => "Scratchy"); without errors, but when I try to "return" a hash from another sub (line 47), it results in the error above.
Unfortunately, I cannot use a in/out Queue because the version of Thread::Queue that I use is too old (and installed on a system I have no control over) and doesn't support hash and hash-ref types to be returned via a Queue (according to this). Apparently, my version only support strings to be returned via queues.
Is there a way to successfully do this: $hash{$string}{"jc"} = \%hoh;
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
use constant NUM_WORKERS => 10;
my #out_array : shared = ();
main();
sub main
{
my #results = test1();
foreach my $item (#results) {
print "item: $item\n";
}
}
sub test1
{
my $my_queue = Thread::Queue->new();
foreach (1..NUM_WORKERS) {
async {
while (my $job = $my_queue->dequeue()) {
test2($job);
}
};
}
my #sentiments = ("Axe Murderer", "Mauler", "Babyface", "Dragon");
$my_queue->enqueue(#sentiments);
$my_queue->enqueue(undef) for 1..NUM_WORKERS;
$_->join() for threads->list();
my #return_array = #out_array;
return #return_array;
}
sub test2
{
my $string = $_[0];
my %hash : shared;
my #new_array : shared;
my %new_hash : shared;
my %hoh : shared;
#new_array = ("tom", "jerry");
%new_hash = (itchy => "Scratchy");
%hoh = hoh(#new_array);
my %anon : shared;
$hash{$string} = \%anon;
$hash{$string}{"Grenade"} = \#new_array;
$hash{$string}{"Pipe bomb"} = \%new_hash;
$hash{$string}{"jc"} = \%hoh;
push #out_array, \%hash;
return;
}
sub hoh
{
my %hoh;
foreach my $item (#_) {
$hoh{"jeepers"}{"creepers"} = $item;
}
return %hoh;
}
The problem is that your trying to store a reference to something that isn't shared in a shared variable. You need to use share as previously mentioned, or you need to serialise the data structure.
#!/perl/bin/perl
use strict;
use threads;
use threads::shared;
my %hm_n2g:shared = ();
my $row = &share([]);
$hm_n2g{"aa"}=$row;
$row->[0]=1;
$row->[1]=2;
my #arr = #{$hm_n2g{"aa"}};
print #arr[0]." ".#arr[1]."\n";
#If you want to lock the hash in a thread-subroutine
{
lock(%hm_n2g)
}

Multiple file download and processing with Perl

I have a perl program on serverA, the program needs to process data for around 500 DOM ip's. The DOM files are on serverB. For each DOM i need to download 6 files to some formulas and inserted them on MySQL DB. For each DOM takes aproximately 2 minutes to download the files. I need to do that in the lowest time possible, because i have to that process aproximately every two hours.
Right now I am using multithreading:
my #threads;
for my $key (keys %dom) ### Have all DOM ip
{
print "El key es $key\n";
my %data = %{$dom{$key}};
my $t = threads->new(\&sub1, $postD, $preD, $key, $counter, %data);
push(#threads,$t);
if($counter == 40)
{
foreach (#threads) {
my $num = $_->join;
print "done with $num\n";
}
$counter = 1;
#threads=();
}
$counter++;
}
foreach (#threads) {
my $num = $_->join;
print "done with $num\n";
sub sub1
{
my ($postD, $preD, $key, $num, %data) = #_;
my $status = GetRelevantFiles(substr($postD,0,8),substr($preD,0,8),%data) if (!defined($opt_f));
if(ref($status) eq 'ERROR')
{
warnNotify($status->{'message'});
}
return $num;
}
Sometimes do no bring all the files.
I am doing good or there is another way to do it best??
Thanks a lot for your help!
You might consider replacing your threads with Parallel::ForkManager
As for not downloading all of the files:
Is there a consistency to the number not downloaded?
Are there any error messages?

Resources