Multiple file download and processing with Perl - multithreading

I have a perl program on serverA, the program needs to process data for around 500 DOM ip's. The DOM files are on serverB. For each DOM i need to download 6 files to some formulas and inserted them on MySQL DB. For each DOM takes aproximately 2 minutes to download the files. I need to do that in the lowest time possible, because i have to that process aproximately every two hours.
Right now I am using multithreading:
my #threads;
for my $key (keys %dom) ### Have all DOM ip
{
print "El key es $key\n";
my %data = %{$dom{$key}};
my $t = threads->new(\&sub1, $postD, $preD, $key, $counter, %data);
push(#threads,$t);
if($counter == 40)
{
foreach (#threads) {
my $num = $_->join;
print "done with $num\n";
}
$counter = 1;
#threads=();
}
$counter++;
}
foreach (#threads) {
my $num = $_->join;
print "done with $num\n";
sub sub1
{
my ($postD, $preD, $key, $num, %data) = #_;
my $status = GetRelevantFiles(substr($postD,0,8),substr($preD,0,8),%data) if (!defined($opt_f));
if(ref($status) eq 'ERROR')
{
warnNotify($status->{'message'});
}
return $num;
}
Sometimes do no bring all the files.
I am doing good or there is another way to do it best??
Thanks a lot for your help!

You might consider replacing your threads with Parallel::ForkManager
As for not downloading all of the files:
Is there a consistency to the number not downloaded?
Are there any error messages?

Related

Use of global arrays in different threads in perl

Use of global arrays in different threads
I'm going to use Dancer2 and File::Tail to use Tail on the web. So when the Websocket is opened, it stores the $conn in an array, and when File::Tail is detected, it tries to send data to the socket stored in the array. But it doesn't work as expected.
The array that is saved when a websocket connection occurs is probably not a global variable.
# it doesn't works.
foreach (#webs) {
$_->send_utf8("test2!!!!!!!!");
}
I tried to use threads::shared and Cache:::Memcached etc, but I failed.
I don't know perl very well. I tried to solve it myself, but I couldn't solve it for too long, so I leave a question.
This is the whole code.
use File::Tail ();
use threads;
use threads::shared;
use Net::WebSocket::Server;
use strict;
use Dancer2;
my #webs = ();
# my %clients :shared = ();
my $conns :shared = 4;
threads->create(sub {
print "start-end:", "$conns", "\n";
my #files = glob( $ARGV[0] . '/*' );
my #fs = ();
foreach my $fileName(#files) {
my $file = File::Tail->new(name=>"$fileName",
tail => 1000,
maxinterval=>1,
interval=>1,
adjustafter=>5,resetafter=>1,
ignore_nonexistant=>1,
maxbuf=>32768);
push(#fs, $file);
}
do {
my $timeout = 1;
(my $nfound,my $timeleft,my #pending)=
File::Tail::select(undef,undef,undef,$timeout,#fs);
unless ($nfound) {
} else {
foreach (#pending) {
my $str = $_->read;
print $_->{"input"} . " ||||||||| ".localtime(time)." ||||||||| ".$str;
# it doesn't works.
foreach (#webs) {
$_->send_utf8("test!!!!!!!!");
}
}
}
} until(0);
})->detach();
threads->create(sub {
Net::WebSocket::Server->new(
listen => 8080,
on_connect => sub {
my ($serv, $conn) = #_;
push(#webs, $conn);
$conn->on(
utf8 => sub {
my ($conn, $msg) = #_;
$conn->send_utf8($msg);
# it works.
foreach (#webs) {
$_->send_utf8("test!!!!!!!!");
}
},
);
},
)->start;
})->detach();
get '/' => sub {
my $ws_url = "ws://127.0.0.1:8080/";
return <<"END";
<html>
<head><script>
var urlMySocket = "$ws_url";
var mySocket = new WebSocket(urlMySocket);
mySocket.onmessage = function (evt) {
console.log( "Got message " + evt.data );
};
mySocket.onopen = function(evt) {
console.log("opening");
setTimeout( function() {
mySocket.send('hello'); }, 2000 );
};
</script></head>
<body><h1>WebSocket client</h1></body>
</html>
END
};
dance;
Threads in perl are not lightweight. They're separate instances of the program.
The only thing that threads have in common, are things that exist prior to the threads instantating.
You can - with declaring shared variables - allow data structures to share between threads, however I'd warn you to be cautious here - without some manner of locking, you potentially create yourself a race condition.
In your case, you could declare #webs as : shared. This will mean values inserted into it will be visible to all your threads. But you still need a degree of caution there, because 'when stuff is added' is still nondeterministic.
But anyway, this basically works:
#!/usr/bin/env perl
use strict;
use warnings;
use threads;
use threads::shared;
use Data::Dumper;
my #shared_struct : shared;
sub reader {
print "Starting reader\n";
for ( 1..10 ) {
print threads -> self() -> tid(), ":", join (",", #shared_struct ), "\n";
sleep 1;
}
}
sub writer {
print "starting writer\n";
for ( 1..10 ) {
push #shared_struct, rand(10);
print Dumper \#shared_struct;
sleep 1;
}
}
## start the threads;
my $reader = threads -> create ( \&reader );
my $writer = threads -> create ( \&writer );
while ( 1 ) {
print #shared_struct;
sleep 1;
}
More generally, I'd suggest you almost never actually want to detach a thread in perl - in doing so, what you're saying is 'I don't care about your execution'. And clearly that's not the case in your code - you're trying to talk to the threads.
Just creating the thread accomplishes what you want - parallel execution and you can have:
for my $thread ( threads -> list ) {
$thread -> join;
}
As and when you're ready for the thread to terminate.

How to Flatten / Recompile Excel Spreadsheet Using sheetjs or exceljs on Write

We use excel as a configuration file for clients. However, our processes only run on linux servers. We need to take a master file, update all the client workbooks with the new information, and commit to GitLab. The users then check it out, add their own changes, commit back to GitLab and a process promotes the workbook to Server A.
This process works great using nodeJS (exceljs)
Another process on a different server is using perl to pick up the workbook and then saves each sheet as a csv file.
The problem is, what gets written out is the data from the ORIGINAL worksheet and not the updated changes. This is true of both perl and nodejs. Code for perl and nodejs xlsx to csv is at the end of the post.
Modules Tried:
perl : Spreadsheet::ParseExcel; Spreadsheet::XLSX;
nodejs: node-xlsx, exceljs
I assume it has to do with Microsoft using XML inside the excel wrapper, it keeps the old version as history and since it was the original sheet name, it gets pulled instead of the updated latest version.
When I manually open in Excel, everything is correct with the new info as expected.
When I use "Save as..." instead of "Save" then the perl process is able to correctly write out the updated worksheet as csv. So our workaround is having the users always "Save as.." before committing their extra changes to GitLab. We'd like to rely on training, but the sheer number of users and clients makes trusting that the user will "Save AS..." is not practical.
Is there a way to replicate a "Save As..." during my promotion to Server A or at least be able to tell if the file had been saved correctly? I'd like to stick with excelJS, but I'll use whatever is necessary to replicate the "Save as..." which seems to recompile the workbook.
In addition to nodejs, I can use perl, python, ruby - whatever it takes - to make sure the csv creation process picks up the new changes.
Thanks for your time and help.
#!/usr/bin/env perl
use strict;
use warnings;
use Carp;
use Getopt::Long;
use Pod::Usage;
use File::Basename qw/fileparse/;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::XLSX;
use Getopt::Std;
my %args = ();
my $help = undef;
GetOptions(
\%args,
'excel=s',
'sheet=s',
'man|help'=>\$help,
) or die pod2usage(1);
pod2usage(1) if $help;
pod2usage(-verbose=>2, exitstatus=>0, output=>\*STDOUT) unless $args{excel} || $args{sheet};
pod2usage(3) if $help;
pod2usage(-verbose=>2, exitstatus=>3, output=>\*STDOUT) unless $args{excel};
if (_getSuffix($args{excel}) eq ".xls") {
my $file = File::Spec->rel2abs($args{excel});
if (-e $file) {
print _XLS(file=>$file, sheet=>$args{sheet});
} else {
exit 1;
die "Error: Can not find excel file. Please check for exact excel file name and location. \nError: This Program is CASE SENSITIVE. \n";
}
}
elsif (_getSuffix($args{excel}) eq ".xlsx") {
my $file = File::Spec->rel2abs($args{excel});
if (-e $file) {
print _XLSX(file=>$file, sheet=>$args{sheet});
}
else {
exit 1;
die "\nError: Can not find excel file. Please check for exact excel file name and location. \nError: This Program is CASE SENSITIVE.\n";
}
}
else {
exit 5;
}
sub _XLS {
my %opts = (
file => undef,
sheet => undef,
#_,
);
my $aggregated = ();
my $parser = Spreadsheet::ParseExcel->new();
my $workbook = $parser->parse($opts{file});
if (!defined $workbook) {
exit 3;
croak "Error: Workbook not found";
}
foreach my $worksheet ($workbook->worksheet($opts{sheet})) {
if (!defined $worksheet) {
exit 2;
croak "\nError: Worksheet name doesn't exist in the Excel File. Please check the WorkSheet Name. \nError: This program is CASE SENSITIVE.\n\n";
}
my ($row_min, $row_max) = $worksheet->row_range();
my ($col_min, $col_max) = $worksheet->col_range();
foreach my $row ($row_min .. $row_max){
foreach my $col ($col_min .. $col_max){
my $cell = $worksheet->get_cell($row, $col);
if ($cell) {
$aggregated .= $cell->value().',';
}
else {
$aggregated .= ',';
}
}
$aggregated .= "\n";
}
}
return $aggregated;
}
sub _XLSX {
eval {
my %opts = (
file => undef,
sheet => undef,
#_,
);
my $aggregated_x = ();
my $excel = Spreadsheet::XLSX->new($opts{file});
foreach my $sheet ($excel->worksheet($opts{sheet})) {
if (!defined $sheet) {
exit 2;
croak "Error: WorkSheet not found";
}
if ( $sheet->{Name} eq $opts{sheet}) {
$sheet->{MaxRow} ||= $sheet->{MinRow};
foreach my $row ($sheet->{MinRow} .. $sheet->{MaxRow}) {
$sheet->{MaxCol} ||= $sheet->{MinCol};
foreach my $col ($sheet->{MinCol} .. $sheet->{MaxCol}) {
my $cell = $sheet->{Cells}->[$row]->[$col];
if ($cell) {
$aggregated_x .= $cell->{Val}.',';
}
else {
$aggregated_x .= ',';
}
}
$aggregated_x .= "\n";
}
}
}
return $aggregated_x;
}
};
if ($#) {
exit 3;
}
sub _getSuffix {
my $f = shift;
my ($basename, $dirname, $ext) = fileparse($f, qr/\.[^\.]*$/);
return $ext;
}
sub _convertlwr{
my $f = shift;
my ($basename, $dirname, $ext) = fileparse($f, qr/\.[^\.]*$/);
return $ext;
}
var xlsx = require('node-xlsx')
var fs = require('fs')
var obj = xlsx.parse(__dirname + '/test2.xlsx') // parses a file
var rows = []
var writeStr = ""
//looping through all sheets
for(var i = 0; i < obj.length; i++)
{
var sheet = obj[i]
//loop through all rows in the sheet
for(var j = 0; j < sheet['data'].length; j++)
{
//add the row to the rows array
rows.push(sheet['data'][j])
}
}
//creates the csv string to write it to a file
for(var i = 0; i < rows.length; i++)
{
writeStr += rows[i].join(",") + "\n"
}
//writes to a file, but you will presumably send the csv as a
//response instead
fs.writeFile(__dirname + "/test2.csv", writeStr, function(err) {
if(err) {
return console.log(err)
}
console.log("test.csv was saved in the current directory!")
The answer is its impossible. In order to update data inside a workbook that has excel functions, you must open it in Excel for the formulas to trigger. It's that simple.
You could pull the workbook apart, create your own javascript functions, run the data through it and then write it out, but there are so many possible issues that it is not recommended.
Perhaps one day Microsoft will release a linux Excel engine API for linux. But its still unlikely that such a thing would work via command line without invoking the GUI.

Is there a thread-safe way to print in Perl?

I currently have a script that kicks off threads to perform various actions on several directories. A snippet of my script is:
#main
sub BuildInit {
my $actionStr = "";
my $compStr = "";
my #component_dirs;
my #compToBeBuilt;
foreach my $comp (#compList) {
#component_dirs = GetDirs($comp); #populates #component_dirs
}
print "Printing Action List: #actionList\n";
#---------------------------------------
#---- Setup Worker Threads ----------
for ( 1 .. NUM_WORKERS ) {
async {
while ( defined( my $job = $q->dequeue() ) ) {
worker($job);
}
};
}
#-----------------------------------
#---- Enqueue The Work ----------
for my $action (#actionList) {
my $sem = Thread::Semaphore->new(0);
$q->enqueue( [ $_, $action, $sem ] ) for #component_dirs;
$sem->down( scalar #component_dirs );
print "\n------>> Waiting for prior actions to finish up... <<------\n";
}
# Nothing more to do - notify the Queue that we're not adding anything else
$q->end();
$_->join() for threads->list();
return 0;
}
#worker
sub worker {
my ($job) = #_;
my ( $component, $action, $sem ) = #$job;
Build( $component, $action );
$sem->up();
}
#builder method
sub Build {
my ( $comp, $action ) = #_;
my $cmd = "$MAKE $MAKE_INVOCATION_PATH/$comp ";
my $retCode = -1;
given ($action) {
when ("depend") { $cmd .= "$action >nul 2>&1" } #suppress output
when ("clean") { $cmd .= $action }
when ("build") { $cmd .= 'l1' }
when ("link") { $cmd .= '' } #add nothing; default is to link
default { die "Action: $action is unknown to me." }
}
print "\n\t\t*** Performing Action: \'$cmd\' on $comp ***" if $verbose;
if ( $action eq "link" ) {
# hack around potential race conditions -- will only be an issue during linking
my $tries = 1;
until ( $retCode == 0 or $tries == 0 ) {
last if ( $retCode = system($cmd) ) == 2; #compile error; stop trying
$tries--;
}
}
else {
$retCode = system($cmd);
}
push( #retCodes, ( $retCode >> 8 ) );
#testing
if ( $retCode != 0 ) {
print "\n\t\t*** ERROR IN $comp: $# !! ***\n";
print "\t\t*** Action: $cmd -->> Error Level: " . ( $retCode >> 8 ) . "\n";
#exit(-1);
}
return $retCode;
}
The print statement I'd like to be thread-safe is: print "\n\t\t*** Performing Action: \'$cmd\' on $comp ***" if $verbose; Ideally, I would like to have this output, and then each component that is having the $action performed on it, would output in related chunks. However, this obviously doesn't work right now - the output is interleaved for the most part, with each thread spitting out it's own information.
E.g.,:
ComponentAFile1.cpp
ComponentAFile2.cpp
ComponentAFile3.cpp
ComponentBFile1.cpp
ComponentCFile1.cpp
ComponentBFile2.cpp
ComponentCFile2.cpp
ComponentCFile3.cpp
... etc.
I considered executing the system commands using backticks, and capturing all of the output in a big string or something, then output it all at once, when the thread terminates. But the issue with this is (a) it seems super inefficient, and (b) I need to capture stderr.
Can anyone see a way to keep my output for each thread separate?
clarification:
My desired output would be:
ComponentAFile1.cpp
ComponentAFile2.cpp
ComponentAFile3.cpp
------------------- #some separator
ComponentBFile1.cpp
ComponentBFile2.cpp
------------------- #some separator
ComponentCFile1.cpp
ComponentCFile2.cpp
ComponentCFile3.cpp
... etc.
To ensure your output isn't interrupted, access to STDOUT and STDERR must be mutually exclusive. That means that between the time a thread starts printing and finishes printing, no other thread can be allowed to print. This can be done using Thread::Semaphore[1].
Capturing the output and printing it all at once allows you to reduce the amount of time a thread holds a lock. If you don't do that, you'll effectively make your system single-threaded system as each thread attempts lock STDOUT and STDERR while one thread runs.
Other options include:
Using a different output file for each thread.
Prepending a job id to each line of output so the output can be sorted later.
In both of those cases, you only need to lock it for a very short time span.
# Once
my $mutex = Thread::Semaphore->new(); # Shared by all threads.
# When you want to print.
$mutex->down();
print ...;
STDOUT->flush();
STDERR->flush();
$mutex->up();
or
# Once
my $mutex = Thread::Semaphore->new(); # Shared by all threads.
STDOUT->autoflush();
STDERR->autoflush();
# When you want to print.
$mutex->down();
print ...;
$mutex->up();
You can utilize the blocking behavior of $sem->down if it attempts to decrease the semaphore counter below zero, as mentioned in perldoc perlthrtut:
If down() attempts to decrement the counter below zero, it blocks
until the counter is large enough.
So here's what one could do:
Initialize a semaphore with counter 1 that is shared across all threads
my $sem = Thread::Semaphore->new( 1 );
Pass a thread counter to worker and Build
for my $thr_counter ( 1 .. NUM_WORKERS ) {
async {
while ( defined( my $job = $q->dequeue() ) ) {
worker( $job, $thr_counter );
}
};
}
sub worker {
my ( $job, $counter ) = #_;
Build( $component, $action, $counter );
}
Go ->down and ->up inside Build (and nowhere else)
sub Build {
my ( $comp, $action, $counter ) = #_;
... # Execute all concurrently-executed code here
$sem->down( 1 << ( $counter -1 ) );
print "\n\t\t*** Performing Action: \'$cmd\' on $comp ***" if $verbose;
# Execute all sequential 'chunks' here
$sem->up( 1 << ( $counter - 1) );
}
By using the thread counter to left-shift the semaphore counter, it guarantees that the threads won't trample on one another:
+-----------+---+---+---+---+
| Thread | 1 | 2 | 3 | 4 |
+-----------+---+---+---+---+
| Semaphore | 1 | 2 | 4 | 8 |
+-----------+---+---+---+---+
I've approached this problem differently in the past, by creating an IO thread, and using that to serialise the file access.
E.g.
my $output_q = Thread::Queue -> new();
sub writer {
open ( my $output_fh, ">", $output_filename );
while ( my $line = $output_q -> dequeue() ) {
print {$output_fh} $line;
}
close ( $output_fh );
}
And within threads, 'print' by:
$output_q -> enqueue ( "text_to_print\n"; );
Either with or without a wrapper - e.g. for timestamping statements if they're going to a log. (You probably want to timestamp when queued, rather than when actually printer).

How do you reuse a queue from Thread::Queue?

I was provided some guidance on here at one time, with the following snippet:
my $q = Thread::Queue->new();
sub worker {
my ($job, $action) = #_;
Build($job, $action);
}
for (1..NUM_WORKERS) {
async {
while (defined(my $job = $q->dequeue())) {
worker($job, 'clean');
}
};
}
$q->enqueue($_) for #compsCopy;
# When you're done adding to the queue.
$q->end();
$_->join() for threads->list();
What is the best option for reusing q? Currently, I'm just making new q objects, q2, q3 and doing all of this over again for each $action that I want to perform. Is there a better way though? I could potentially pass in an array of "actions" that I would like to perform, and would like to avoid duplicating this code 7 times if possible.
Maybe I don't fully understand what a Thread::Queue is..
You should use one queue for one direction. If you just would like to some operation paralel, use one queue. If you would like to report back errors and process those error in main or another thread then you use two queue.
for simple use here is for your reference:
use strict;
use warnings;
use threads;
use threads;
use Thread::Queue;
my $q = Thread::Queue->new(); # A new empty queue
my %seen: shared;
# Worker thread
my #thrs = threads->create(\&doOperation ) for 1..5;#for 5 threads
add_file_to_q('/tmp/');
$q->enqueue('//_DONE_//') for #thrs;
$_->join() for #thrs;
sub add_file_to_q {
my $dir = shift;
my #files = `ls -1 $dir/`;chomp(#files);
#add files to queue
foreach my $f (#files){
# Send work to the thread
$q->enqueue($f);
print "Pending items: "$q->pending()."\n";
}
}
sub doOperation () {
my $ithread = threads->tid() ;
while (my $filename = $q->dequeue()) {
# Do work on $item
sleep(1) if ! defined $filename;
return 1 if $filename eq '//_DONE_//';
next if $seen{$filename};
print "[id=$ithread]\t$filename\n";
$seen{$filename} = 1;
### add files if it is a directory (check with symlinks, no file with //_DONE_// name!)
add_file_to_q($filename) if -d $filename;
}
return 1;
}

Perl Queues and Threading

I'm trying to accomplish the following:
Have a thread that reads data from a very large file say about
10GB and push them into the queue. (I do not wish for the queue to
get very large either)
While the buildQueue thread is pushing data to the queue at the same time have
about 5 worker threads de-queue and process data.
I've made an attempt but my other threads are unreachable because of a continuous loop in my buildQueue thread.
My approach may be totally wrong. Thanks for any help, it's much appreciated.
Here's the code for buildQueue:
sub buildQueue {
print "Enter a file name: ";
my $dict_path = <STDIN>;
chomp($dict_path);
open DICT_FILE, $dict_path or die("Sorry, could not open file!");
while (1) {
if (<DICT_FILE>) {
if ($queue->pending() < 100) {
my $query = <DICT_FILE>;
chomp($query);
$queue->enqueue($query);
my $count = $queue->pending();
print "Queue Size: $count Query: $query\n";
}
}
}
}
And as I've expected when this thread gets executed nothing else after will be executed because this thread will not finish.
my $builder = new Thread(&buildQueue);
Since the builder thread will be running for a long time I never get to create worker threads.
Here's the entire code:
#!/usr/bin/perl -w
use strict;
use Thread;
use Thread::Queue;
my $queue = new Thread::Queue();
my #threads;
sub buildQueue {
print "Enter a file name: ";
my $dict_path = <STDIN>;
chomp($dict_path);
open dict_file, $dict_path or die("Sorry, could not open file!");
while (1) {
if (<dict_file>) {
if ($queue->pending() < 100) {
my $query = <dict_file>;
chomp($query);
$queue->enqueue($query);
my $count = $queue->pending();
print "Queue Size: $count Query: $query\n";
}
}
}
}
sub processor {
my $query;
while (1) {
if ($query = $queue->dequeue) {
print "$query\n";
}
}
}
my $builder = new Thread(&buildQueue);
push #threads, new Thread(&processor) for 1..5;
You'll need to mark when you want your threads to exit (via either joinor detach ). The fact that you have infinite loops with no last statements to break out of them is also a problem.
Edit: I also forgot a very important part! Each worker thread will block, waiting for another item to process off of the queue until they get an undef in the queue. Hence why we specifically enqueue undef once for each thread after the queue builder is done.
Try:
#!/usr/bin/perl -w
use strict;
use threads;
use Thread::Queue;
my $queue = new Thread::Queue();
our #threads; #Do you really need our instead of my?
sub buildQueue
{
print "Enter a file name: ";
my $dict_path = <STDIN>;
chomp($dict_path);
#Three-argument open, please!
open my $dict_file, "<",$dict_path or die("Sorry, could not open file!");
while(my $query=<$dict_file>)
{
chomp($query);
while(1)
{ #Wait to see if our queue has < 100 items...
if ($queue->pending() < 100)
{
$queue->enqueue($query);
print "Queue Size: " . $queue->pending . "\n";
last; #This breaks out of the infinite loop
}
}
}
close($dict_file);
foreach(1..5)
{
$queue->enqueue(undef);
}
}
sub processor
{
my $query;
while ($query = $queue->dequeue)
{
print "Thread " . threads->tid . " got $query\n";
}
}
my $builder=threads->create(\&buildQueue);
push #threads,threads->create(\&process) for 1..5;
#Waiting for our threads to finish.
$builder->join;
foreach(#threads)
{
$_->join;
}
The MCE module for Perl loves big files. With MCE, one can chunk many lines at once, slurp a big chunk as a scalar string, or read 1 line at a time. Chunking many lines at once reduces the overhead for IPC.
MCE 1.504 is out now. It provides MCE::Queue with support for child processes including threads. In addition, the 1.5 release comes with 5 models (MCE::Flow, MCE::Grep, MCE::Loop, MCE::Map, and MCE::Stream) which take care of instantiating the MCE instance as well as auto-tuning max_workers and chunk_size. One may override these options btw.
Below, MCE::Loop is used for the demonstration.
use MCE::Loop;
print "Enter a file name: ";
my $dict_path = <STDIN>;
chomp($dict_path);
mce_loop_f {
my ($mce, $chunk_ref, $chunk_id) = #_;
foreach my $line ( #$chunk_ref ) {
chomp $line;
## add your code here to process $line
}
} $dict_path;
If you want to specify the number of workers and/or chunk_size, then there are 2 ways to do it.
use MCE::Loop max_workers => 5, chunk_size => 300000;
Or...
use MCE::Loop;
MCE::Loop::init {
max_workers => 5,
chunk_size => 300000
};
Although chunking is preferred for large files, one can compare the time with chunking one line at a time. One may omit the first line inside the block (commented out). Notice how there's no need for an inner for loop. $chunk_ref is still an array ref containing 1 line. The input scalar $_ contains the line when chunk_size equals 1, otherwise points to $chunk_ref.
use MCE::Loop;
MCE::Loop::init {
max_workers => 5,
chunk_size => 1
};
print "Enter a file name: ";
my $dict_path = <STDIN>;
chomp($dict_path);
mce_loop_f {
# my ($mce, $chunk_ref, $chunk_id) = #_;
my $line = $_;
## add your code here to process $line or $_
} $dict_path;
I hope that this demonstration was helpful for folks wanting to process a file in parallel.
:) mario
It sounds like this case could do with the Parallel::ForkManager module.
A different approach: You can also use user_tasks in MCE 1.2+ and create two multi-worker multithreading tasks, one task for reading (since it's a big file, you could also benefit from parallel reading while preserving file read seek) and one task for processing, etc.
The code below still uses Thread::Queue to manage your buffer queue.
The buildQueue sub has your queue size control and it pushes the data directly to the manager process' $R_QUEUE since we've used threads, so it has access to the parent's memory space. If you want to use forks instead, you can still access the queue through a call back function. But here I chose to simply just push to the queue.
The processQueue sub will simply de-queue whatever is in the queue until there's nothing more pending.
The task_end sub in each task is run only once by the manager process at the end of each task, so we use it to signal a stop to our worker processes.
Obviously, there's a lot of freedom in how you want to chunk your data to the workers, so you can decide upon the size of the chunk or even how to slurp your data in.
#!/usr/bin/env perl
use strict;
use warnings;
use threads;
use threads::shared;
use Thread::Queue;
use MCE;
my $R_QUEUE = Thread::Queue->new;
my $queue_workers = 8;
my $process_workers = 8;
my $chunk_size = 1;
print "Enter a file name: ";
my $input_file = <STDIN>;
chomp($input_file);
sub buildQueue {
my ($self, $chunk_ref, $chunk_id) = #_;
if ($R_QUEUE->pending() < 100) {
$R_QUEUE->enqueue($chunk_ref);
$self->sendto('stdout', "Queue Size: " . $R_QUEUE->pending ."\n");
}
}
sub processQueue {
my $self = shift;
my $wid = $self->wid;
while (my $buff = $R_QUEUE->dequeue) {
$self->sendto('stdout', "Thread " . $wid . " got $$buff");
}
}
my $mce = MCE->new(
input_data => $input_file, # this could be a filepath or a file handle or even a scalar to treat like a file, check the documentation for more details.
chunk_size => $chunk_size,
use_slurpio => 1,
user_tasks => [
{ # queueing task
max_workers => $queue_workers,
user_func => \&buildQueue,
use_threads => 1, # we'll use threads to have access to the parent's variables in shared memory.
task_end => sub { $R_QUEUE->enqueue( (undef) x $process_workers ) } # signal stop to our process workers when they hit the end of the queue. Thanks > Jack Maney!
},
{ # process task
max_workers => $process_workers,
user_func => \&processQueue,
use_threads => 1, # we'll use threads to have access to the parent's variables in shared memory
task_end => sub { print "Finished processing!\n"; }
}
]
);
$mce->run();
exit;

Resources