Perl thread and shared variable - multithreading

How can I shared a list of array in Perl ?
My code in sub thread:
#token_list = (#token_list, [$token, time() + 1200]);
My code in main sub:
my #array1 = [];
my #token_list : shared = ();
.
.
.
$thr = threads->new(\&connect, $c, $r);
.
.
.
foreach my $token_tab (#token_list)
{
#array1 = #$token_tab;
print "List content: $array1[0] $array1[1]\n";
}
Because soft never enter into foreach

First off - threads are officially discouraged in perl:
The "interpreter-based threads" provided by Perl are not the fast, lightweight system for multitasking that one might expect or hope for. Threads are implemented in a way that make them easy to misuse. Few people know how to use them correctly or will be able to provide help.
The use of interpreter-based threads in perl is officially discouraged.
Anyway, that aside - you talk about sharing a nested data structure. The way you're doing that doesn't work.
The problem is - the way perl handles multidimensional data structures is via references.
So with your #token_list:
my #token_list = (#token_list, [$token, time() + 1200]);
You're actually doing:
my #token_list : shared;
my $anon_array = [ $token, time() + 1200 ];
#token_list = ( #token_list, $anon_array );
and because $anon_array isn't shared, it doesn't work 'properly'. In order to do this, you would need to share the 'sub' array separately. share doesn't work very well, but shared_clone does.
use strict;
use warnings;
use threads;
use threads::shared;
use Data::Dumper;
my #array_of_arrays : shared;
for ( 0 .. 5 ) {
my $array_ref = [ 1, 2, 3, 4, 5 ];
push( #array_of_arrays, shared_clone($array_ref) );
}
print Dumper \#array_of_arrays;
sub inc_all_elements {
foreach my $outer (#array_of_arrays) {
foreach my $inner (#$outer) {
$inner++;
}
push( #$outer, 8 );
}
}
for ( 1 .. 5 ) {
threads->create( \&inc_all_elements );
}
foreach my $thr ( threads->list ) {
$thr->join;
}
print Dumper \#array_of_arrays;
This makes the inner array reference 'shared' as well as the outer, which is what's missing from your code.
I think that means it would be as simple as:
#tokens = ( #tokens, shared_clone( [$token, time() + 1200] ) );
To get your code working.

Related

How to apply multithreading to Bio::SeqIO translate code (Bioperl)?

I am translating a fasta nucleotide file into protein sequences by this code
use Bio::SeqIO;
use Getopt::Long;
my ($format,$outfile) = 'fasta';
GetOptions(
'f|format:s' => \$format,
'o|out|outfile:s' => \$outfile,
);
my $oformat = 'fasta';
$file=$ARGV[0];
chomp $file;
# this implicity uses the <> file stream
my $seqin = Bio::SeqIO->new( -format => $format, -fh => \*ARGV);
my $seqout;
if( $outfile ) {
$seqout = Bio::SeqIO->new( -format => $oformat, -file => ">$outfile" );
} else {
# defaults to writing to STDOUT
$seqout = Bio::SeqIO->new( -format => $oformat );
}
while( (my $seq = $seqin->next_seq()) ) {
my $pseq = $seq->translate();
$seqout->write_seq($pseq);
}
I implement
threads and threads::shared perl modules to achieve in other cases but I want to apply following code into previous task
use threads;
use threads::shared;
use List::Util qw( sum );
use YAML;
use constant NUM_THREADS =>100;
my #output :shared;
my $chunk_size = #data / NUM_THREADS;
my #threads;
for my $chunk ( 1 .. NUM_THREADS ) {
my $start = ($chunk - 1) * $chunk_size;
push #threads, threads->create(
\&doOperation,
\#data,
$start,
($start + $chunk_size - 1),
\#output,
);
}
$_->join for #threads;
sub doOperation{
my ($data, $start, $end, $output) = #_;
my $id = threads->tid;
print "$id ";
for my $i ($start .. $end) {
print "Thread [$id] processing row $i\n";
#THIS WHILE SHOULD BE MULTITHREADED
while( (my $seq = $seqin->next_seq()) ) {
my $pseq = $seq->translate();
$seqout->write_seq($pseq);
}
#THIS WHILE SHOULD BE MULTITHREADED
sleep 1 if 0.2 > rand;
}
print "Thread done.\n";
return;
}
print "\n$time\n";
my $time = localtime;
print "$time\n";
Threads are being created but somehow it can not process the fasta file.
The fisrt code works fine without multi threading.
I'm afraid I'm not going to rewrite your code for you, but I can give you some pointers on how to accomplish threading.
The thing you need to understand about perl threading is it's not a lightweight thread. You should spawn a number of threads equal to the parallelism, run them off a Thread::Queue and go from there.
You also need to avoid any non-thread-safe modules - you can use them if you're careful but that usually means instantiating them within the thread with require and import instead of use at the start of the program.
I would also suggest avoiding trying to do your output IO in parallel - return the thread results and coalesce them (sorting if necessary) in the 'main' thread (or spin off a single writer).
So I'd go with something like;
#!/usr/bin/env perl
use strict;
use warnings;
use threads;
use Thread::Queue;
use Storable qw ( freeze thaw );
my $NUM_THREADS = 16; #approx number of cores.
my $translate_q = Thread::Queue->new;
my $translate_results_q = Thread::Queue->new;
sub translate_thread {
while ( my $item = translate_q->dequeue ) {
my $seq = thaw $item;
my $pseq = $seq->translate();
$translate_results_q->enqueue( freeze $pseq );
}
}
threads->create( \&translate_thread ) for 1 .. $NUM_THREADS;
while ( my $seq => $seqin->next_seq ) {
$translate_q->enqueue( freeze($seq) );
}
$translate_q->end;
$_->join for threads->list;
$translate_results_q->end;
while ( my $result = $translate_results_q->dequeue ) {
my $pseg = thaw($result);
}
Note - this won't work as is, because it's missing merging with the rest of your ocde. But hopefully it illustrates how the queue and threading can work to get parallelism?
You pass around your objects using freeze and thaw from Storable, and use the parallelism to unpack them.
Don't go too mad on the number of threads - for primarily compute workloads (e.g. no IO) then a number of threads equal to the number of cores is about right. If they'll be blocking on IO, you can increase this number, but going past about double isn't going to do very much.
You can't really parallelise disk IO efficiently - it just doesn't work like that. So do that in the 'main' thread.

Perl - Synchronizing Data Access

I'm starting to learn parallel programming, and I want to compare a single-threaded program with a multi-threaded one.
I need to make a very simple algorithm that calculates the largest number of possible prime numbers within one minute and shows me the last calculated prime number and its position in the prime numbers.
For example, prime number 23, would appear as the number 23 and its position 9, because it is the 9th prime number.
Without using threads, the number of primes found was 233,596 and the last prime 3,246,107. But with threads, 229,972 primes were found and the last prime was 3,192,463.
I think that is wrong, because multi-threading was supposed to obtain a superior result to the single thread. I believe it is a pretty basic error, but I cannot solve it as I still do not understand much of Perl's parallelism.
This is the code. It calculates prime numbers for one minute single-threaded, and then the same calculation with four threads using shared variables.
use threads;
use threads::shared;
my $seconds = 60;
# WITHOUT THREAD #
print "\n\n Calc without Threads:\n";
my $endTime = time() + $seconds;
calcWithoutThread();
print "\n\n ----------------===========================---------------- \n";
# WITH THREAD #
print "\n\n Calc with Threads:\n";
my $prime :shared = 5; # Starts from the 5th prime
my $totalPrime :shared = 2; # Starts with prime 2 and prime 3
my $lastPrime :shared = 0;
my $endTime1 = time() + $seconds;
my $thread1 = threads->create(\&calcWithThread);
my $thread2 = threads->create(\&calcWithThread);
my $thread3 = threads->create(\&calcWithThread);
my $thread4 = threads->create(\&calcWithThread);
$thread1->join();
$thread2->join();
$thread3->join();
$thread4->join();
print " Was found $totalPrime prime numbers. Last prime: $lastPrime.";
# SUB's #
sub calcWithoutThread{
$prime = 5; # Starts from the 5th prime
$totalPrime = 2; # Starts with prime 2 and prime 3
$lastPrime = 0;
while (time() < $endTime){
if(calcPrime($prime)){
$totalPrime ++;
$lastPrime = $prime;
}
$prime ++;
}
print " Was found $totalPrime prime numbers. Last prime: $lastPrime.";
}
sub calcWithThread{
while (time() < $endTime1) {
lock($prime);
if(calcPrime($prime)){
$totalPrime ++;
$lastPrime = $prime;
}
$prime ++;
}
}
sub calcPrime{
for($i=2 ; $i< sqrt ($prime) ; $i++){
if( $prime % $i == 0){
return 0;
}
}
return 1;
}
The logic is that the threads do this calculation synchronously, if it is prime number or not, and also do not overlap values while calculating.
The problem is that your threads are synchronised with one another by the locked variable $prime. That means they don't have a chance to run simultaneously, and will also suffer the overhead of switching threads and synchronising access to the variable
Primes aren't a great test for parallel processing, as the calculation of each prime depends on the previous result. However, you could do it by keeping a separate $prime variable, but starting at 5, 6, 7, and 8 for the four threads and adding 4 between tests. That way they don't duplicate each other's work but together cover every integer
There is immediately a problem with this in that none of the even numbers will ever be prime, so two of the four threads will never produce a result. It is still a valid test of parallelism, but clearly very inefficient. You could fix that by starting the threads with 5, 7, 9, and 11, and incrementing by 8 before each test. Then every thread will be profitable
Don't forget that you will have to code the same algorithm for the single-threaded code, otherwise the parallel section gets an unfair advantage
Update
Perhaps a better way would be to lock $prime only to fetch the next number to be tested and increment it. That way all the threads can do their calculations in parallel and only queue up to get another job
You would then have to lock $total_prime to prevent two threads from incrementing it simultaneously, and also update $last_prime while that lock is in effect. Because parallel threads may well generate prime numbers out of sequence you will also have to check whether the prime just found is greater than the latest one discovered by any thread
Your subroutine would look like this. I apologise for changing the identifiers, but Perl traditionally uses snake_case and I find camelCase unpleasant to read. Snake case is also much easier for many people whose first language isn't English, and who cannot pick out the capital letters so easily
sub calc_with_thread {
while ( time() < $end_time_1 ) {
my $test = do {
lock $prime;
$prime++;
};
if ( calc_prime($test) ) {
lock $total_prime;
++$total_prime;
$last_prime = $test if $test > $last_prime;
}
}
}
Multi-threading is not an automatic speed boost.
Shared variables are tied, and tied variables are slow.
Here's a simpler, improved version that goes up to a max number rather than by time. It avoids locking. Threads are still significantly slower, because each of them have to do a get and set on that sluggish shared variable for every number they check.
#!/usr/bin/env perl
use strict;
use warnings;
use Time::HiRes qw(time);
use v5.10;
use threads;
use threads::shared;
my $Max_Number = 50_000;
{
print "\n\nCalc without Threads:\n";
my $start_time = time;
calcWithoutThread();
my $end_time = time;
say "$Max_Number took #{[ $end_time - $start_time ]} seconds.";
}
my $Shared_Prime :shared = 5;
{
print "\n\nCalc with Threads:\n";
my $start_time = time;
my $thread1 = threads->create(\&calcWithThread);
my $thread2 = threads->create(\&calcWithThread);
my $thread3 = threads->create(\&calcWithThread);
my $thread4 = threads->create(\&calcWithThread);
$thread1->join();
$thread2->join();
$thread3->join();
$thread4->join();
my $end_time = time;
say "$Max_Number took #{[ $end_time - $start_time ]} seconds.";
}
sub calcWithoutThread {
my $prime = 5; # Starts from the 5th prime
while( $prime <= $Max_Number ) {
calcPrime($prime++);
}
}
sub calcWithThread {
while( $Shared_Prime <= $Max_Number ) {
calcPrime($Shared_Prime++);
}
}
sub calcPrime {
my $prime = shift;
for( my $i=2 ; $i < sqrt($prime) ; $i++){
if( $prime % $i == 0){
return 0;
}
}
return 1;
}
Instead, avoid shared state or coordinating workers as much as possible. For example, you can slice up the work into pieces. If you have 4 workers, give them each a starting number and they each increment by 4. Then they can cover the whole space without shared state.
Worker 1: 5, 9, 13, ...
Worker 2: 6, 10, 14, ...
Worker 3: 7, 11, 15, ...
Worker 4: 8, 12, 16, ...
#!/usr/bin/env perl
use strict;
use warnings;
use Time::HiRes qw(time);
use v5.10;
use threads;
my $Max_Number = 7_000;
{
print "\n\nCalc without Threads:\n";
my $start_time = time;
calcWithoutThread(5, 1);
my $end_time = time;
say "$Max_Number took #{[ $end_time - $start_time ]} seconds.";
}
{
print "\n\nCalc with Threads:\n";
my $start_time = time;
my $thread1 = threads->create(\&calcWithThread, 5, 4 );
my $thread2 = threads->create(\&calcWithThread, 6, 4 );
my $thread3 = threads->create(\&calcWithThread, 7, 4 );
my $thread4 = threads->create(\&calcWithThread, 8, 4 );
$thread1->join();
$thread2->join();
$thread3->join();
$thread4->join();
my $end_time = time;
say "$Max_Number took #{[ $end_time - $start_time ]} seconds.";
}
sub calcWithoutThread {
my($start, $inc) = #_;
my #primes;
for( my $prime = $start; $prime <= $Max_Number; $prime += $inc ) {
push #primes, $prime if calcPrime($prime);
}
return \#primes;
}
sub calcWithThread {
my($start, $inc) = #_;
my #primes;
for( my $prime = $start; $prime <= $Max_Number; $prime += $inc ) {
push #primes, $prime if calcPrime($prime);
}
return \#primes;
}
sub calcPrime {
my $prime = shift;
for( my $i=2 ; $i < sqrt($prime) ; $i++){
if( $prime % $i == 0){
return 0;
}
}
return 1;
}
For me on my i7 MacBook (2 real cores, 4 virtual), threads break even at about 6000. To get the results, $thread->join will return the list of primes from calcWithThread.

Fastest way to search string with two mismatch in big files in perl?

I modified the code to work with two files. to_search.txt has string to be searched. big_file.fastq has lines where to be searched and if string found (2 mismatch allowed with exact length which range from 8-10, no addition and deletion), place in respective name. So each string is searched in all lines (2nd line) in big_file.fastq.
# to_search.txt: (length can be from 8-20 characters)
1 TCCCTTGT
2 ACGAGACT
3 GCTGTACG
4 ATCACCAG
5 TGGTCAAC
6 ATCGCACA
7 GTCGTGTA
8 AGCGGAGG
9 ATCCTTTG
10 TACAGCGC
#2000 search needed
# big_file.fastq: 2 billions lines (each 4 lines are associated: string search is in second line of each 4 lines).
# Second line can have 100-200 characters
#M04398:19:000000000-APDK3:1:1101:21860:1000 1:N:0:1
TCttTTGTGATCGATCGATCGATCGATCGGTCGTGTAGCCTCCAACCAAGCACCCCATCTGTTCCAAATCTTCTCCCACTGCTACTTGAAGACGCTGAAGTTGAAGGGCCACCTTCATCATTCTGG
+
#8ACCDGGGGGGGGGGGGGEGGGGGGGGDFFEGGG#FFGGGGGGGGGGGGGGGGGCF#<FFGGGGGFE9FGGGFEACE8EFFFGGGGGF9F?CECEFEG,CFGF,CF#CCC?BFFG<,9<9AEFG,,
#M04398:19:000000000-APDK3:1:1101:13382:1000 1:N:0:1
NTCGATCGATCGATCGATCGATCGATCGTTCTGAGAGGTACCAACCAAGCACACCACGGGCGACACAGACAGCTCCGTGTTGAACGGGTTGTTCTTCTTCTTGCCTTCATCATCCCCATCCTCAGTGGACGCAGCTTGCTCATCCTTCCTC
+
#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
#M04398:19:000000000-APDK3:1:1101:18888:1000 1:N:0:1
NCAGAATGAGGAAGGATGAGCCCCGTCGTGTCGAAGCTATTGACACAGCGCTATTCCGTCTTTATGTTCACTTTAAGCGGTACAAGGAGCTGCTTGTTCTGATTCAGGAACCGAACCCTGGTGGTGTGCTTGGTTGGCAAGTTTACGGCTC
+
#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGE
Here is the code for two mismatches. I tried with exact match, speed is not bad: takes around a day. I have used Time::Progress module. When I use 2 mismatch: shows 115 days to finish. How the speed can be improved here?
#!/usr/bin/perl
use strict;
use warnings;
$| = 1;
open( IN_P1, "big_file.fastq" ) or die "File not found";
my ( #sample_file_names, #barcode1 );
open( BC_FILE, "to_search.txt" ) or die "No barcode file";
my #barcode_file_content = <BC_FILE>;
foreach (#barcode_file_content) {
chomp $_;
$_ =~ s/\r//;
$_ =~ s/\n//;
#print $_;
my #elements = split( "(\t|,| )", $_ );
push #sample_file_names, $elements[0];
push #barcode1, $elements[2];
}
# open FH
my #fh_array_R1;
foreach (#sample_file_names) {
chomp $_;
local *OUT_R1;
open( OUT_R1, ">", "$_\.fq" ) or die "cannot write file";
push #fh_array_R1, *OUT_R1;
}
# unknown barcode file
open( UNKNOWN_R1, ">unknown-barcode_SE.fq" ) or die "cannot create unknown-r1 file";
while ( defined( my $firstp1 = <IN_P1> ) ) {
my $p1_first_line = $firstp1;
my $p1_second_line = <IN_P1>;
my $p1_third_line = <IN_P1>;
my $p1_fourth_line = <IN_P1>;
chomp( $p1_first_line, $p1_second_line, $p1_third_line, $p1_fourth_line, );
my $matched_R1 = "$p1_first_line\n$p1_second_line\n$p1_third_line\n$p1_fourth_line\n";
for ( my $j = 0 ; $j < scalar #barcode1 ; $j++ ) {
chomp $barcode1[$j];
my $barcode1_regex = make_barcode_fragments( $barcode1[$j] );
if ( $p1_second_line =~ /$barcode1_regex/i ) {
# keep if matched
print { $fh_array_R1[$j] } $matched_R1;
last;
}
else {
#print to unknown;
print UNKNOWN_R1 $matched_R1;
}
}
}
# make two mismatch patterm of barcode
sub make_barcode_fragments {
my ($in1) = #_;
my #subpats;
for my $i ( 0 .. length($in1) - 1 ) {
for my $j ( $i + 1 .. length($in1) - 1 ) {
my $subpat = join( '',
substr( $in1, 0, $i ),
'\\w', substr( $in1, $i + 1, $j - $i - 1 ),
'\\w', substr( $in1, $j + 1 ),
);
push #subpats, $subpat;
}
}
my $pat = join( '|', #subpats );
#print $pat;
return "$pat";
}
exit;
If your algorithm cannot be changed/improved in Perl itself, you can still get speedup by writing the time consuming parts in C. Here is an example using inline C:
use strict;
use warnings;
use Benchmark qw(timethese);
use Inline C => './check_line_c.c';
my $find = "MATCH1";
my $search = "saasdadadadadasd";
my %sub_info = (
c => sub { check_line_c( $find, $search ) },
perl => sub { check_line_perl( $find, $search ) },
);
timethese( 4_000_000, \%sub_info );
sub check_line_perl {
my ($find, $search ) = #_;
my $max_distance = 2;
for my $offset ( 0 .. length($search) - length($find) ) {
my $substr = substr( $search, $offset, length($find) );
my $hd = hd( $find, $substr );
if ( $hd <= $max_distance ) {
return ( $hd, $substr );
}
}
return ( undef, undef );
}
sub hd {
return ( $_[0] ^ $_[1] ) =~ tr/\001-\377//;
}
where check_line_c.c is:
void check_line_c( char* find, char * search ) {
int max_distance = 2;
int flen = strlen(find);
int last_ind = strlen(search) - flen;
SV *dis = &PL_sv_undef;
SV *match = &PL_sv_undef;
for ( int ind = 0; ind <= last_ind; ind++ )
{
int count = 0;
for ( int j = 0; j < flen; j++ )
{
if ( find[j] ^ search[ind+j] ) count++;
}
if ( count < max_distance )
{
match = newSV(flen);
sv_catpvn(match, search+ind, flen );
dis = newSViv(count);
break;
}
}
Inline_Stack_Vars;
Inline_Stack_Reset;
Inline_Stack_Push(sv_2mortal(dis));
Inline_Stack_Push(sv_2mortal(match));
Inline_Stack_Done;
}
The output is (Ubuntu Laptop using Intel Core i7-4702MQ CPU #2.20GHz):
Benchmark: timing 4000000 iterations of c, perl...
c: 2 wallclock secs ( 0.76 usr + 0.00 sys = 0.76 CPU) # 5263157.89/s (n=4000000)
perl: 19 wallclock secs (18.30 usr + 0.00 sys = 18.30 CPU) # 218579.23/s (n=4000000)
So this gives a 24-fold speedup for this case.
I'd suggest creating a really bad hashing algorithm. Something nice and reversible and inefficient, like the sum of the characters. Or maybe the sum of unique values (1-4) represented by the characters.
Compute the target sums, and also compute the maximum allowed variance. That is, if the objective is a match with two substitutions, then what is the maximum possible difference? (4-1 + 4-1 = 6).
Then, for each "window" of text of the appropriate length in the target data file, compute a running score. (Add a character to the end, drop a character from the start, update the hash score.) If the score for a window is within the allowable range, you can do further investigation.
You might want to implement this as different passes. Possibly even as different stages in a shell pipeline or script. The idea being that you might be able to parallelize parts of the search. (For instance, all the match strings with the same length could be searched by one process, since the hash windows are the same.)
Also, of course, it is beneficial that you can keep your early work if your program crashes in the later stages. And you can even have the early parts of the process running while you are still developing the end stages.

Perl MCE return hash data to main process

I'm trying to run Perl Many-Core Engine which works fine. But when a worker adds data to a global hash in a subroutine, this data is lost once the MCE process completes (see position of the comment below). I didn't think this would be a problem as %data is global. Any advice much appreciated.
#!/usr/bin/perl
use strict;
use warnings;
use MCE;
my #symbol = ('abc','def');
my $CPUs = 1;
my %data = ();
process();
sub process {
my $mce = MCE->new(
input_data => \#symbol,
max_workers => $CPUs,
user_func => sub {
my ($self, $sym, $chunk_id) = #_;
$sym = $sym->[0];
doTask($sym);
}
);
$mce->run();
print %data; # if I check contents of %data at this line in the code, its empty
}
sub doTask {
my ($sym) = #_;
$data{$sym} = 1;
print %data;
print "\n\n";
return;
}
It works ok using the MCE::Map model, which has an inbuilt data gathering function.
#!/usr/bin/perl
use strict;
use warnings;
use MCE::Map;
my #symbol = ('abc','def');
my $CPUs = 1;
my %data = ();
process();
sub process {
MCE::Map::init {chunk_size => 1, max_workers => $CPUs};
%data = mce_map {doTask($_)} #symbol;
print %data; # data exchange between the manager and worker processes works
}
sub doTask {
my ($sym) = #_;
$data{$sym} = 1;
return %data;
}
Full data sharing capability supporting deep sharing will be included with the upcoming MCE 1.7 release.
use MCE;
use MCE::Shared;
mce_share my %data;
The hash is shared between workers supporting processes and threads.
Does doTask actually run? It looks like user_func is supposed to be part of user_tasks:
my $mce = MCE->new(
input_data => \#symbol,
user_tasks => [{
max_workers => $CPUs,
user_func => sub {
my ($self, $sym, $chunk_id) = #_;
$sym = $sym->[0];
doTask($sym);
}]
);
$mce->run();
To return or share a complex structure among all the threads, such as a hash of hash of arrary, the MCE foreach, or the MCE::Loop, with gather are tested not working. Neither do the MCE::Map. The only tested working way is via threads::share, by recursively add shared sub layers, as shown by http://www.perlmonks.org/?node_id=798735, and use shared_clone for the end assignment.
# the following lines cannot be ommitted otherwise error "Invalid value for shared scalar"
unless (exists $HoH{$F[0]})
{
my %p : shared;
$HoH{$F[0]} = \%p;
}
# the following lines if skipped will randomly missing a few records
unless (exists $HoH{$F[0]}{$mf})
{
my #c : shared;
$HoH{$F[0]}{$mf} = \#c;
}
$HoH{$F[0]}{$mf}=shared_clone([$F[3],$F[4]]);
#we cannot use $HoH{$mf}=shared_clone(\%Loss) directly, where %Loss is a hash of arrary
BTW, this method only work when run as standalone script, if running from within perl debugger which do not permit "use threads" before "use threads::shared", will return empty values.

How to use Log4perl to rotate log files in multithread perl application

Below is the sample code that I have tried to rotate a log file in multithread application using log4perl. But it is working fine unless it is a multithread application. Logs are not rotated and Log file grows in size. Can anyone guide me on where I am going wrong?
use strict;
use warnings;
use Log::Log4perl;
use POSIX;
use threads;
use threads::shared;
my #InputFiles;
my $InputDirectory=$ARGV[0];
my $LogName=$ARGV[1];
opendir(DIR,$InputDirectory) or die "could not open the input directory";
#InputFiles=readdir(DIR);
close(DIR);
my $file;
#logger_configuration
my $log_conf ="
log4perl.rootLogger = DEBUG, LOG1
log4perl.appender.LOG1 = Log::Dispatch::FileRotate
log4perl.appender.LOG1.filename = $LogName
log4perl.appender.LOG1.mode = append
log4perl.appender.LOG1.autoflush = 1
log4perl.appender.LOG1.size = 10000
log4perl.appender.LOG1.max = 20
log4perl.appender.LOG1.layout = Log::Log4perl::Layout::PatternLayout
log4perl.appender.LOG1.layout.ConversionPattern = \%d{yyyy-MM-dd HH:mm:ss}|\%P|\%m|\%n
";
#loading the configuration file
Log::Log4perl::init(\$log_conf);
#creating logger instance
my $logger = Log::Log4perl->get_logger();
my $thread_count=5;
my $file_total= scalar #InputFiles;
#print STDERR "$file_total\n";
#dividing total files among the no of given threads
my $div = $file_total/$thread_count;
$div = ceil($div);
my $start = 0;
my $end = $div;
my #threads;
for (my $count = 1; $count <=$thread_count ; $count++)
{
my $thread = threads->new(\&process,$start,$end);
push(#threads,$thread);
$start = $end;
$end = $end + $div;
if ($end > $file_total)
{
$end = $file_total;
}
}
foreach (#threads)
{
$_->join;
}
sub process
{
my $lstart = shift;
my $lend = shift;
my $id = threads->tid();
for (my $index = $lstart; $index < $lend; ++$index)
{
$logger->info($InputFiles[$index]);
}
}
OK, pretty fundamentally your problem is this - your 'logger' is created before your threads start. This means that all your threads will have the same file handles.
This will inherently cause you problems, unless you've got some sort of arbitration mechanism for file IO. Think of your threads as separate programs, all trying to open and write to the same file - and you can see how messy it might get.
I would suggest instead that what you need to do is create another thread for the logger, and send the IO through something like Thread::Queue
use Thread::Queue;
my $log_q = Thread::Queue -> new();
sub logger_thread {
#init logger here
while ( my $log_item = $log_q -> dequeue() ) {
$logger -> info ( $log_item );
}
}
my $log_thread = threads -> create ( \&logger_thread );
And then replace the $logger -> info (....) with:
$log_q -> enqueue($message_to_log);
Then, once you've joined all your 'process' threads (e.g. as you are right now) close off the logger thread:
$log_q -> end();
$log_thread -> join();
This will cause each of the threads to queue log messages, and once they're finished (and joined) you close the queue so the logger knows it's 'done' - and so will exit once the queue is empty and can be joined.
Multithreading file IO is messy, so it's better to avoid as much as possible.

Resources