I have created a Perl script with the two threads:
1) Create .cfg files
2) Read those .cfg files
I want to accomplish that while one thread is busy making files, side by side my second thread would read those files.
Bu I am facing a problem at that moment when my second thread is created it will only read that files which are created at that moment. Can anyone give idea about accomplishing this task.
The Create threads and Reading files functions are mentioned below:
sub CreateThreads() {
$threads[0] = threads->create( \&CreateCFG );
sleep 3;
$threads[1] = threads->create( \&ReadCFG );
}
sub ReadCFG() {
my $id = threads->tid();
opendir( DIR, $filedir ) or die $!;
while ( my $files = readdir(DIR) ) {
if ( $files eq "." || $files eq ".." ) {
next;
}
else {
my $filepath = $filedir . "/" . $files;
if ( $files =~ m/$probename/ ) {
my #file = split( "&", $files );
if ( $file[3] !~~ #reportrobots ) {
push( #reportrobots, $file[3] );
my $raddr =
"/" . $file[1] . "/" . $file[2] . "/" . $file[3];
&ProbeReport( \$userdata_file, \$filepath, \$raddr );
}
}
}
}
}
I think your problem here will be that you're not rereading the directory.
opendir( DIR, $filedir ) or die $!;
while ( my $files = readdir(DIR) ) {
Because it's not obvious, what's actually happening is the first time you call readdir it reads the directory, and subsequent calls just iterate that list.
That's to prevent strange things happening if - for example - I'm creating a file in that directory within my loop.
So in effect - your second thread only ever looks at that directory once. It's functionally similar to:
my #files = readdir(DIR);
foreach my $file ( #files ) {
I would suggest that what you need is the rewinddir function. Or perhaps use a Thread::Queue or other mechanism to pass through files in need of processing.
Related
Use of global arrays in different threads
I'm going to use Dancer2 and File::Tail to use Tail on the web. So when the Websocket is opened, it stores the $conn in an array, and when File::Tail is detected, it tries to send data to the socket stored in the array. But it doesn't work as expected.
The array that is saved when a websocket connection occurs is probably not a global variable.
# it doesn't works.
foreach (#webs) {
$_->send_utf8("test2!!!!!!!!");
}
I tried to use threads::shared and Cache:::Memcached etc, but I failed.
I don't know perl very well. I tried to solve it myself, but I couldn't solve it for too long, so I leave a question.
This is the whole code.
use File::Tail ();
use threads;
use threads::shared;
use Net::WebSocket::Server;
use strict;
use Dancer2;
my #webs = ();
# my %clients :shared = ();
my $conns :shared = 4;
threads->create(sub {
print "start-end:", "$conns", "\n";
my #files = glob( $ARGV[0] . '/*' );
my #fs = ();
foreach my $fileName(#files) {
my $file = File::Tail->new(name=>"$fileName",
tail => 1000,
maxinterval=>1,
interval=>1,
adjustafter=>5,resetafter=>1,
ignore_nonexistant=>1,
maxbuf=>32768);
push(#fs, $file);
}
do {
my $timeout = 1;
(my $nfound,my $timeleft,my #pending)=
File::Tail::select(undef,undef,undef,$timeout,#fs);
unless ($nfound) {
} else {
foreach (#pending) {
my $str = $_->read;
print $_->{"input"} . " ||||||||| ".localtime(time)." ||||||||| ".$str;
# it doesn't works.
foreach (#webs) {
$_->send_utf8("test!!!!!!!!");
}
}
}
} until(0);
})->detach();
threads->create(sub {
Net::WebSocket::Server->new(
listen => 8080,
on_connect => sub {
my ($serv, $conn) = #_;
push(#webs, $conn);
$conn->on(
utf8 => sub {
my ($conn, $msg) = #_;
$conn->send_utf8($msg);
# it works.
foreach (#webs) {
$_->send_utf8("test!!!!!!!!");
}
},
);
},
)->start;
})->detach();
get '/' => sub {
my $ws_url = "ws://127.0.0.1:8080/";
return <<"END";
<html>
<head><script>
var urlMySocket = "$ws_url";
var mySocket = new WebSocket(urlMySocket);
mySocket.onmessage = function (evt) {
console.log( "Got message " + evt.data );
};
mySocket.onopen = function(evt) {
console.log("opening");
setTimeout( function() {
mySocket.send('hello'); }, 2000 );
};
</script></head>
<body><h1>WebSocket client</h1></body>
</html>
END
};
dance;
Threads in perl are not lightweight. They're separate instances of the program.
The only thing that threads have in common, are things that exist prior to the threads instantating.
You can - with declaring shared variables - allow data structures to share between threads, however I'd warn you to be cautious here - without some manner of locking, you potentially create yourself a race condition.
In your case, you could declare #webs as : shared. This will mean values inserted into it will be visible to all your threads. But you still need a degree of caution there, because 'when stuff is added' is still nondeterministic.
But anyway, this basically works:
#!/usr/bin/env perl
use strict;
use warnings;
use threads;
use threads::shared;
use Data::Dumper;
my #shared_struct : shared;
sub reader {
print "Starting reader\n";
for ( 1..10 ) {
print threads -> self() -> tid(), ":", join (",", #shared_struct ), "\n";
sleep 1;
}
}
sub writer {
print "starting writer\n";
for ( 1..10 ) {
push #shared_struct, rand(10);
print Dumper \#shared_struct;
sleep 1;
}
}
## start the threads;
my $reader = threads -> create ( \&reader );
my $writer = threads -> create ( \&writer );
while ( 1 ) {
print #shared_struct;
sleep 1;
}
More generally, I'd suggest you almost never actually want to detach a thread in perl - in doing so, what you're saying is 'I don't care about your execution'. And clearly that's not the case in your code - you're trying to talk to the threads.
Just creating the thread accomplishes what you want - parallel execution and you can have:
for my $thread ( threads -> list ) {
$thread -> join;
}
As and when you're ready for the thread to terminate.
I'm a BI developer working with perl scripts as my ETL - I receive data over email, take the file, parse it and push it into the DB.
Most of the files are CSV, but occasionally I have an XLSX file.
I've been using Spreadsheet::XLSX to convert, but I've noticed that the CSV output comes out with the wrong encoding (needs to be UTF8, because accents and foreign languages).
That's the sub I'm using ($input_file is an Excel file), but I keep getting the data with the wrong characters.
WHAT am I missing?
Thanks a lot all!
sub convert_to_csv {
my $input_file = $_[0];
my ( $filename, $extension ) = split( '\.', $input_file );
open( format_file, ">:**encoding(utf-8)**", "$filename.csv" ) or die "could not open out file $!\n";
my $excel = Spreadsheet::XLSX->new($input_file);
my $line;
foreach my $sheet ( #{ $excel->{Worksheet} } ) {
#printf( "Sheet: %s\n", $sheet->{Name} );
$sheet->{MaxRow} ||= $sheet->{MinRow};
foreach my $row ( $sheet->{MinRow} .. $sheet->{MaxRow} ) {
$sheet->{MaxCol} ||= $sheet->{MinCol};
foreach my $col ( $sheet->{MinCol} .. $sheet->{MaxCol} ) {
my $cell = $sheet->{Cells}[$row][$col];
if ($cell) {
my $trimcell;
$trimcell = $cell->value();
print STDERR "cell: $trimcell\n"; ## Just for the tests so I don't have to open the file to see if it's ok
$trimcell =~ s/^\s+|\s+$//g; ## Just to make sure I don't have extra spaces
$line .= "\"" . $trimcell . "\",";
}
}
chomp($line);
if ($line =~ /Grand Total/){} ##customized for the files
else {
print format_file "$line\n";
$line = '';
}
}
}
close format_file;
}
My knowledge is from using ETL::Pipeline and it uses Spreadsheet::XLSX for reading .xlsx-files.
But I know which fields are UTF-8
I wrote a Local ETL::Pipeline module to handle output for Excel files
use Encode qw(decode encode);
$ra_rec->{name} = decode( 'UTF-8', $ra_rec->{name}, Encode::FB_CROAK );
I currently have a script that kicks off threads to perform various actions on several directories. A snippet of my script is:
#main
sub BuildInit {
my $actionStr = "";
my $compStr = "";
my #component_dirs;
my #compToBeBuilt;
foreach my $comp (#compList) {
#component_dirs = GetDirs($comp); #populates #component_dirs
}
print "Printing Action List: #actionList\n";
#---------------------------------------
#---- Setup Worker Threads ----------
for ( 1 .. NUM_WORKERS ) {
async {
while ( defined( my $job = $q->dequeue() ) ) {
worker($job);
}
};
}
#-----------------------------------
#---- Enqueue The Work ----------
for my $action (#actionList) {
my $sem = Thread::Semaphore->new(0);
$q->enqueue( [ $_, $action, $sem ] ) for #component_dirs;
$sem->down( scalar #component_dirs );
print "\n------>> Waiting for prior actions to finish up... <<------\n";
}
# Nothing more to do - notify the Queue that we're not adding anything else
$q->end();
$_->join() for threads->list();
return 0;
}
#worker
sub worker {
my ($job) = #_;
my ( $component, $action, $sem ) = #$job;
Build( $component, $action );
$sem->up();
}
#builder method
sub Build {
my ( $comp, $action ) = #_;
my $cmd = "$MAKE $MAKE_INVOCATION_PATH/$comp ";
my $retCode = -1;
given ($action) {
when ("depend") { $cmd .= "$action >nul 2>&1" } #suppress output
when ("clean") { $cmd .= $action }
when ("build") { $cmd .= 'l1' }
when ("link") { $cmd .= '' } #add nothing; default is to link
default { die "Action: $action is unknown to me." }
}
print "\n\t\t*** Performing Action: \'$cmd\' on $comp ***" if $verbose;
if ( $action eq "link" ) {
# hack around potential race conditions -- will only be an issue during linking
my $tries = 1;
until ( $retCode == 0 or $tries == 0 ) {
last if ( $retCode = system($cmd) ) == 2; #compile error; stop trying
$tries--;
}
}
else {
$retCode = system($cmd);
}
push( #retCodes, ( $retCode >> 8 ) );
#testing
if ( $retCode != 0 ) {
print "\n\t\t*** ERROR IN $comp: $# !! ***\n";
print "\t\t*** Action: $cmd -->> Error Level: " . ( $retCode >> 8 ) . "\n";
#exit(-1);
}
return $retCode;
}
The print statement I'd like to be thread-safe is: print "\n\t\t*** Performing Action: \'$cmd\' on $comp ***" if $verbose; Ideally, I would like to have this output, and then each component that is having the $action performed on it, would output in related chunks. However, this obviously doesn't work right now - the output is interleaved for the most part, with each thread spitting out it's own information.
E.g.,:
ComponentAFile1.cpp
ComponentAFile2.cpp
ComponentAFile3.cpp
ComponentBFile1.cpp
ComponentCFile1.cpp
ComponentBFile2.cpp
ComponentCFile2.cpp
ComponentCFile3.cpp
... etc.
I considered executing the system commands using backticks, and capturing all of the output in a big string or something, then output it all at once, when the thread terminates. But the issue with this is (a) it seems super inefficient, and (b) I need to capture stderr.
Can anyone see a way to keep my output for each thread separate?
clarification:
My desired output would be:
ComponentAFile1.cpp
ComponentAFile2.cpp
ComponentAFile3.cpp
------------------- #some separator
ComponentBFile1.cpp
ComponentBFile2.cpp
------------------- #some separator
ComponentCFile1.cpp
ComponentCFile2.cpp
ComponentCFile3.cpp
... etc.
To ensure your output isn't interrupted, access to STDOUT and STDERR must be mutually exclusive. That means that between the time a thread starts printing and finishes printing, no other thread can be allowed to print. This can be done using Thread::Semaphore[1].
Capturing the output and printing it all at once allows you to reduce the amount of time a thread holds a lock. If you don't do that, you'll effectively make your system single-threaded system as each thread attempts lock STDOUT and STDERR while one thread runs.
Other options include:
Using a different output file for each thread.
Prepending a job id to each line of output so the output can be sorted later.
In both of those cases, you only need to lock it for a very short time span.
# Once
my $mutex = Thread::Semaphore->new(); # Shared by all threads.
# When you want to print.
$mutex->down();
print ...;
STDOUT->flush();
STDERR->flush();
$mutex->up();
or
# Once
my $mutex = Thread::Semaphore->new(); # Shared by all threads.
STDOUT->autoflush();
STDERR->autoflush();
# When you want to print.
$mutex->down();
print ...;
$mutex->up();
You can utilize the blocking behavior of $sem->down if it attempts to decrease the semaphore counter below zero, as mentioned in perldoc perlthrtut:
If down() attempts to decrement the counter below zero, it blocks
until the counter is large enough.
So here's what one could do:
Initialize a semaphore with counter 1 that is shared across all threads
my $sem = Thread::Semaphore->new( 1 );
Pass a thread counter to worker and Build
for my $thr_counter ( 1 .. NUM_WORKERS ) {
async {
while ( defined( my $job = $q->dequeue() ) ) {
worker( $job, $thr_counter );
}
};
}
sub worker {
my ( $job, $counter ) = #_;
Build( $component, $action, $counter );
}
Go ->down and ->up inside Build (and nowhere else)
sub Build {
my ( $comp, $action, $counter ) = #_;
... # Execute all concurrently-executed code here
$sem->down( 1 << ( $counter -1 ) );
print "\n\t\t*** Performing Action: \'$cmd\' on $comp ***" if $verbose;
# Execute all sequential 'chunks' here
$sem->up( 1 << ( $counter - 1) );
}
By using the thread counter to left-shift the semaphore counter, it guarantees that the threads won't trample on one another:
+-----------+---+---+---+---+
| Thread | 1 | 2 | 3 | 4 |
+-----------+---+---+---+---+
| Semaphore | 1 | 2 | 4 | 8 |
+-----------+---+---+---+---+
I've approached this problem differently in the past, by creating an IO thread, and using that to serialise the file access.
E.g.
my $output_q = Thread::Queue -> new();
sub writer {
open ( my $output_fh, ">", $output_filename );
while ( my $line = $output_q -> dequeue() ) {
print {$output_fh} $line;
}
close ( $output_fh );
}
And within threads, 'print' by:
$output_q -> enqueue ( "text_to_print\n"; );
Either with or without a wrapper - e.g. for timestamping statements if they're going to a log. (You probably want to timestamp when queued, rather than when actually printer).
I was provided some guidance on here at one time, with the following snippet:
my $q = Thread::Queue->new();
sub worker {
my ($job, $action) = #_;
Build($job, $action);
}
for (1..NUM_WORKERS) {
async {
while (defined(my $job = $q->dequeue())) {
worker($job, 'clean');
}
};
}
$q->enqueue($_) for #compsCopy;
# When you're done adding to the queue.
$q->end();
$_->join() for threads->list();
What is the best option for reusing q? Currently, I'm just making new q objects, q2, q3 and doing all of this over again for each $action that I want to perform. Is there a better way though? I could potentially pass in an array of "actions" that I would like to perform, and would like to avoid duplicating this code 7 times if possible.
Maybe I don't fully understand what a Thread::Queue is..
You should use one queue for one direction. If you just would like to some operation paralel, use one queue. If you would like to report back errors and process those error in main or another thread then you use two queue.
for simple use here is for your reference:
use strict;
use warnings;
use threads;
use threads;
use Thread::Queue;
my $q = Thread::Queue->new(); # A new empty queue
my %seen: shared;
# Worker thread
my #thrs = threads->create(\&doOperation ) for 1..5;#for 5 threads
add_file_to_q('/tmp/');
$q->enqueue('//_DONE_//') for #thrs;
$_->join() for #thrs;
sub add_file_to_q {
my $dir = shift;
my #files = `ls -1 $dir/`;chomp(#files);
#add files to queue
foreach my $f (#files){
# Send work to the thread
$q->enqueue($f);
print "Pending items: "$q->pending()."\n";
}
}
sub doOperation () {
my $ithread = threads->tid() ;
while (my $filename = $q->dequeue()) {
# Do work on $item
sleep(1) if ! defined $filename;
return 1 if $filename eq '//_DONE_//';
next if $seen{$filename};
print "[id=$ithread]\t$filename\n";
$seen{$filename} = 1;
### add files if it is a directory (check with symlinks, no file with //_DONE_// name!)
add_file_to_q($filename) if -d $filename;
}
return 1;
}
I have a perl program on serverA, the program needs to process data for around 500 DOM ip's. The DOM files are on serverB. For each DOM i need to download 6 files to some formulas and inserted them on MySQL DB. For each DOM takes aproximately 2 minutes to download the files. I need to do that in the lowest time possible, because i have to that process aproximately every two hours.
Right now I am using multithreading:
my #threads;
for my $key (keys %dom) ### Have all DOM ip
{
print "El key es $key\n";
my %data = %{$dom{$key}};
my $t = threads->new(\&sub1, $postD, $preD, $key, $counter, %data);
push(#threads,$t);
if($counter == 40)
{
foreach (#threads) {
my $num = $_->join;
print "done with $num\n";
}
$counter = 1;
#threads=();
}
$counter++;
}
foreach (#threads) {
my $num = $_->join;
print "done with $num\n";
sub sub1
{
my ($postD, $preD, $key, $num, %data) = #_;
my $status = GetRelevantFiles(substr($postD,0,8),substr($preD,0,8),%data) if (!defined($opt_f));
if(ref($status) eq 'ERROR')
{
warnNotify($status->{'message'});
}
return $num;
}
Sometimes do no bring all the files.
I am doing good or there is another way to do it best??
Thanks a lot for your help!
You might consider replacing your threads with Parallel::ForkManager
As for not downloading all of the files:
Is there a consistency to the number not downloaded?
Are there any error messages?