Problem
I have created a simple perl script to read log files and process the data asynchronously.
The problem i am facing is that the script appears to continuously use more memory the longer it runs. This seems to be affected by the amount of data it processes. The problem I have is that i am unable to identify what exactly is using all this memory, and whether it is a leak or something is just holding onto it.
Question
How can i modify the below script so that it no longer continuously consumes memory ?
Code
#Multithreaded to read multiple log files at the same time.
use strict;
use warnings;
use threads;
use Thread::Queue;
use threads::shared;
my $logq = Thread::Queue->new();
my %Servers :shared;
my %servername :shared;
sub csvsplit {
my $line = shift;
my $sep = (shift or ',');
return () unless $line;
my #cells;
my $re = qr/(?:^|$sep)(?:"([^"]*)"|([^$sep]*))/;
while($line =~ /$re/g) {
my $value = defined $1 ? $1 : $2;
push #cells, (defined $value ? $value : '');
}
return #cells;
}
sub process_data
{
while(sleep(1)){
if ($logq->pending())
{
my %sites;
my %returns;
while($logq->pending() > 0){
my $data = $logq->dequeue();
my #fields = csvsplit($data);
$returns{$fields[$#fields - 1]}++;
$sites{$fields[$#fields]}++;
}
print "counter:$_, value=\"$sites{$_}\" />\n" for (keys%sites);
print "counter:$_, value=\"$returns{$_}\" />\n" for (keys%returns);
}
}
}
sub read_file
{
my $myFile=$_[0];
open(my $logfile,'<',$myFile) || die "error";
my $Inode=(stat($logfile))[1];
my $fileSize=(stat($logfile))[7];
seek $logfile, 0, 2;
for (;;) {
while (<$logfile>) {
chomp( $_ );
$logq->enqueue( $_ );
}
sleep 5;
if($Inode != (stat($myFile))[1] || (stat($myFile))[7] < $fileSize){
close($logfile);
while (! -e $myFile){
sleep 2;
}
open($logfile,'<',$myFile) || die "error";
$Inode=(stat($logfile))[1];
$fileSize=(stat($logfile))[7];
}
seek $logfile, 0, 1;
}
}
my $thr1 = threads->create(\&read_file,"log");
my $thr4 = threads->create(\&process_data);
$thr1->join();
$thr4->join();
Obeservations and relevant info
The memory only seems to increase when the program has data to process, if i just leave it, it maintains the current memory usage.
Memory only appears to increase for larger throughput and increase about half a Mb per 5 seconds for around 2000 lines in the same time.
I have not included the csv as i do not think it is relevant. If you do and want me to add it please give a valid reason.
Specs
GNU bash, version 3.2.57(1)-release (s390x-ibm-linux-gnu)
perl, v5.10.0
I have looked through other questions but cannot find much of relevance. If this is a duplicate or the relevant info is in another question, feel free to mark as a dupe and ill check it out.
Any more info needed just ask.
The reason is probably that the size of your Thread::Queue is unlimited. If the producer thread is faster than the consumer thread, your queue will continue to grow. So you should simply limit the size of your queue. For example, to set a limit of 1,000 queue items:
$logq->limit = 1000;
(The way you use the pending method is wrong by the way. You should only terminate if the return value is undefined.)
Related
I am simply trying to find out how to properly use the open2 function.
See an example below. It works for a small $max, but naturally, if I write long enough to the $hIn, so eventually it will get blocked because nothing reads the data on the output continuously.
use 5.26.0;
use IPC::Open2;
my $max = 100000;
my $pid = open2(my $hOut, my $hIn, "cat") || die "failed 'cat': $!";
{
my $cnt = 0;
#When $max is big (e.g. 100000) so the code below will get blocked
#when writing to $hIn
while ($cnt<$max) {say $hIn $cnt++;}
close($hIn) || say "can't close hIn";
}
while(<$hOut>) { print; }
close($hOut) || say "can't close hOut";
waitpid( $pid, 0 );
The only solution, that I can think about, is launching an other thread that will do the writing on the background.
With the code below I can write into the $hIn as much data as I want and read them in the main thread but the $hIn seems not to get closed. Because of that the while(<$hOut>) will never finish while waiting for more output.
use 5.26.0;
use threads;
use IPC::Open2;
my $max = 100000;
my $pid = open2(my $hOut, my $hIn, "cat") || die "failed 'cat': $!";
my $thr = threads->create(sub {
my $cnt = 0;
while ($cnt<$max) {say $hIn $cnt++;}
#The close does not have any effect here (although no error is produced)
close($hIn) || say "can't close hIn";
});
#This outputs all the data written to $hIn but never leaves the loop...
while(<$hOut> ) { print; }
close($hOut) || say "can't close hOut";
$thr->join;
waitpid( $pid, 0 );
My questions are:
Provided that my approach with threads is ok, how can I close the file handle from the background thread?
If it is not ok (actually use threads is discouraged in Perl), so can someone provide a working example of open2 that can write and read a lot of data without a risk of getting blocked waiting for the reader or writer?
EDIT: Following your suggestions here is an implementation of the code above using IPC::Run:
use 5.26.0;
use IPC::Run qw/ run /;
my $max = 1000000;
run sub {
my $cnt = 0;
while ($cnt<$max) {say $cnt++;}
},
"|", "cat",
"|", sub {
while(<> ) {
print;
}
}
or die "run sub | cat | sub failed: $?";
It runs without flaws, the code is very readable... I am very happy to have learned about this module. Thanks to everyone!
Yet, I consider the question to be unanswered. If it is not possible to write this functionality using open2 directly, why does that even exist and confuse people? Also not being able to close the file handle from a different thread looks like a bug to me (certainly it is - the close should at least report an error).
Your program stopped because the pipe to which it was writing became full.
The pipe to cat became full because cat stopped reading from it.
cat stopped because the pipe to which it was writing became full.
The pipe from cat became full because you program isn't reading from it.
So you have two programs waiting for each other to do something. This is a deadlock.
The low-level solution is to use select to monitor both ends of the pipe.
The high-level solution is to let IPC::Run or IPC::Run3 do that hard work for you.
use IPC::Run qw( run );
my $cnt_max = 100000;
my $cnt = 0;
run [ "cat" ],
'<', sub { $cnt < $cnt_max ? $cnt++ . "\n" : undef };
I have a directory that contains other directories (the number of directories is arbitrary), like this:
Main_directory_samples/
subdirectory_sample_1/
subdirectory_sample_2/
subdirectory_sample_3/
subdirectory_sample_4/
I have a script that receives as input one directory each time and it takes 1h to run (for each directory). To run the script I have the following code:
opendir DIR, $maindirectory or die "Can't open directory!!";
while(my $dir = readdir DIR){
if($dir ne '.' && $dir ne '..'){
system("/bin/bash", "my_script.sh", $maindirectory.'/'.$dir);
}
}
closedir DIR;
However, I want to run the script for different directories at the same time. For instance, the subdirectory_sample_1/ and subdirectory_sample_2/ would run in the same thread; subdirectory_sample_3/ and subdirectory_sample_4/ in another. But I just can't find a way to do this.
As you're just starting external processes and waiting for them, a non-threading option:
use strict;
use warnings;
use Path::Tiny;
use IO::Async::Loop;
use Future::Utils 'fmap_concat';
my $loop = IO::Async::Loop->new;
my $maindirectory = '/foo/bar';
my #subdirs = grep { -d } path($maindirectory)->children; # excludes . and ..
# runs this code to maintain up to 'concurrent' pending futures at once
my $main_future = fmap_concat {
my $dir = shift;
my $future = $loop->new_future;
my $process = $loop->open_process(
command => ['/bin/bash', 'my_script.sh', $dir],
on_finish => sub { $future->done(#_) },
on_exception => sub { $future->fail(#_) },
);
return $future;
} foreach => \#subdirs, concurrent => 2;
# run event loop until all futures are done or one fails, throw exception on failure
my #exit_codes = $main_future->get;
See the docs for IO::Async::Loop and Future::Utils.
One way is to fork and in each child process a group of directories.
A basic example
use warnings;
use strict;
use feature 'say';
use List::MoreUtils qw(natatime);
use POSIX qw(:sys_wait_h); # for WNOHANG
use Time::HiRes qw(sleep); # for fractional seconds
my #all_dirs = qw(d1 d2 d3 d4);
my $path = 'maindir';
my #procs;
# Get iterator over groups (of 2)
my $it = natatime 2, #all_dirs;
while (my #dirs = $it->()) {
my $pid = fork // do { #/
warn "Can't fork for #dirs: $!";
next;
};
if ($pid == 0) {
foreach my $dir (#dirs) {
my #cmd = ('/bin/bash/', 'my_script.sh', "$path/$dir");
say "in $$, \#cmd: (#cmd)";
# system(#cmd) == 0 or do { inspect $? }
};
exit;
};
push #procs, $pid;
}
# Poll with non-blocking wait for processes (reap them)
my $gone;
while (($gone = waitpid -1, WNOHANG) > -1) {
my $status = $?;
say "Process $gone exited with $status" if $gone > 0;
sleep 0.1;
}
See system and/or exec for details, in particular on error checking, as well as $? variable. It can be unpacked to retrieve more details about the error; or, at least print a warning and skip to the next item (which happens above anyway).
The code above prints out the command and pid's with their exit status, but replace #cmd with a test command of no consequence and un-comment the system line to try this out.
Watch for how many jobs there are. A basic rule of thumb is to not have more than 2 per core at which point the performance starts suffering, but this depends on many details. Experiment to find the sweet spot for your case. I like to have a job per core and then at least one core free. In order to throttle this see modules linked at the end.
To break all jobs (directories) into groups I used natatime from List::MoreUtils (n-at-a-time). If there are more specific criteria about how to group directories adjust that.
See Forks::Super and Parallel::ForkManager for higher-level ways to work with forked processes.
Let's say we have a script which open a file, then read it line by line and print the line to the terminal. We have a sigle thread and a multithread version.
The problem is than the resulting output of both scripts is almost the same, but not exactly. In the multithread versions there are about ten lines which missed the first 2 chars. I mean, if the real line is something line "Stackoverflow rocks", I obtain "ackoverflow rocks".
I think that this is related to some race condition since if I adjust the parameters to create a lot of little workers, I get more faults than If I use less and bigger workers.
The single thread is like this:
$file = "some/file.txt";
open (INPUT, $file) or die "Error: $!\n";
while ($line = <STDIN>) {
print $line;
}
The multithread version make's use of the thread queue and this implementation is based on the #ikegami approach:
use threads qw( async );
use Thread::Queue 3.01 qw( );
use constant NUM_WORKERS => 4;
use constant WORK_UNIT_SIZE => 100000;
sub worker {
my ($job) = #_;
for (#$job) {
print $_;
}
}
my $q = Thread::Queue->new();
async { while (defined( my $job = $q->dequeue() )) { worker($job); } }
for 1..NUM_WORKERS;
my $done = 0;
while (!$done) {
my #lines;
while (#lines < WORK_UNIT_SIZE) {
my $line = <>;
if (!defined($line)) {
$done = 1;
last;
}
push #lines, $line;
}
$q->enqueue(\#lines) if #lines;
}
$q->end();
$_->join for threads->list;
I tried your program and got similar (wrong) results. Instead of Thread::Semaphore I used lock from threads::shared around the print as it's simpler to use than T::S, i.e.:
use threads;
use threads::shared;
...
my $mtx : shared;
sub worker
{
my ($job) = #_;
for (#$job) {
lock($mtx); # (b)locks
print $_;
# autom. unlocked here
}
}
...
The global variable $mtx serves as a mutex. Its value doesn't matter, even undef (like here) is ok.
The call to lock blocks and returns only if no other threads currently holds the lock on that variable.
It automatically unlocks (and thus makes lock return) when it goes out of scope. In this sample that happens
after every single iteration of the for loop; there's no need for an extra {…} block.
Now we have syncronized the print calls…
But this didn't work either, because print does buffered I/O (well, only O). So I forced unbuffered output:
use threads;
use threads::shared;
...
my $mtx : shared;
$| = 1; # force unbuffered output
sub worker
{
# as above
}
...
and then it worked. To my surprise I could then remove the lock and it still worked. Perhaps by accident. Note that your script will run significantly slower without buffering.
My conclusion is: you're suffering from buffering.
I am trying to get text of two large files. To speed it up i tried threads.
Before i used threads the script worked, now it does not.
The problem is: I save everything I read in the file into a hash.
When i print out the size (or keys/values) after the read-in in the sub (which the thread executed) it shows a correct number > 0, when i print out the size of the hash anywhere else (after the threads have run) it shows me 0.
print ": ".keys(%c);
is used 2 times, and has different output each time.
(In the final programm 2 Threads are running and a method to compare the stuff is called after the threads finished)
Example code:
my %c;
my #threads = initThreads();
#threads[0] = threads->create(\&ce);
foreach(#threads){
$_->join();
}
print ": ".keys(%c);
sub initThreads{
my #initThreads;
for(my $i = 0; $i<2;$i++){
push(#initThreads, $i);
}
return #initThreads;
}
sub ce(){
my $id = threads->tid();
open my $file, "<", #arg1[1] or die $!;
my #cXY;
my #cDa;
while(my $line = <$file>){
# some regex and push to arrays, works
#c{#cXY} = #cDa;
}
print "Thread $id is done\n";
close $file;
print ": ".keys(%c);
threads->exit();
}
Do i have to run the things after the first 2 threads finished in another thread which waits until the first two are finished?
Or what am i doing wrong with threads?
Thanks.
%c isn't shared across your threads.
use threads;
use threads::shared
my %c :shared;
See threads::shared.
In Perl, threads don't share memory. Each thread operates on a different copy of %c, so the changes aren't reflected to the parent thread. While sharing a variable across threads is possible, this is not generally advisable.
Make use of the possibility to return data from a thread. E.g
my %c = map %{ $_->join }, #threads; # flatten all returned hashes
sub ce {
my %hash;
...;
return \%hash;
}
Some other suggestions:
use strict; use warnings; if you aren't already.
use better variable names.
you only seem to be spawning one thread (in $threads[0]).
my #array; for (my $i = 0; $i < 2; $i++){ push(#array, $i) } is equivalent to my #array = 0 .. 1.
#arg1 is not declared in the current scope.
manually exiting a thread is not neccessary in your case.
I have a text file mapping of two integers, separated by commas:
123,456
789,555
...
It's 120Megs... so it's a very long file.
I keep to search for the first column and return the second, e.g., look up 789 --returns--> 555 and I need to do it FAST, using regular Linux built-ins.
I'm doing this right now and it takes several seconds per look-up.
If I had a database I could index it. I guess I need an indexed text file!
Here is what I'm doing now:
my $lineFound=`awk -F, '/$COLUMN1/ { print $2 }' ../MyBigMappingFile.csv`;
Is there any easy way to pull this off with a performance improvement?
The hash suggestions are the natural way an experienced Perler would do this, but it may be suboptimal in this case. It scans the entire file and builds a large, flat datastructure in linear time. Cruder methods can short circuit with a worst case linear time, usually less in practice.
I first made a big mapping file:
my $LEN = shift;
for (1 .. $LEN) {
my $rnd = int rand( 999 );
print "$_,$rnd\n";
}
With $LEN passed on the command line as 10000000, the file came out to 113MB. Then I benchmarked three implemntations. The first is the hash lookup method. The second slurps the file and scans it with a regex. The third reads line-by-line and stops when it matches. Complete implementation:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw{timethese};
my $FILE = shift;
my $COUNT = 100;
my $ENTRY = 40;
slurp(); # Initial file slurp, to get it into the hard drive cache
timethese( $COUNT, {
'hash' => sub { hash_lookup( $ENTRY ) },
'scalar' => sub { scalar_lookup( $ENTRY ) },
'linebyline' => sub { line_lookup( $ENTRY ) },
});
sub slurp
{
open( my $fh, '<', $FILE ) or die "Can't open $FILE: $!\n";
undef $/;
my $s = <$fh>;
close $fh;
return $s;
}
sub hash_lookup
{
my ($entry) = #_;
my %data;
open( my $fh, '<', $FILE ) or die "Can't open $FILE: $!\n";
while( <$fh> ) {
my ($name, $val) = split /,/;
$data{$name} = $val;
}
close $fh;
return $data{$entry};
}
sub scalar_lookup
{
my ($entry) = #_;
my $data = slurp();
my ($val) = $data =~ /\A $entry , (\d+) \z/x;
return $val;
}
sub line_lookup
{
my ($entry) = #_;
my $found;
open( my $fh, '<', $FILE ) or die "Can't open $FILE: $!\n";
while( <$fh> ) {
my ($name, $val) = split /,/;
if( $name == $entry ) {
$found = $val;
last;
}
}
close $fh;
return $found;
}
Results on my system:
Benchmark: timing 100 iterations of hash, linebyline, scalar...
hash: 47 wallclock secs (18.86 usr + 27.88 sys = 46.74 CPU) # 2.14/s (n=100)
linebyline: 47 wallclock secs (18.86 usr + 27.80 sys = 46.66 CPU) # 2.14/s (n=100)
scalar: 42 wallclock secs (16.80 usr + 24.37 sys = 41.17 CPU) # 2.43/s (n=100)
(Note I'm running this off an SSD, so I/O is very fast, and perhaps makes that initial slurp() unnecessary. YMMV.)
Interestingly, the hash implementation is just as fast as linebyline, which isn't what I expected. By using slurping, scalar may end up being faster on a traditional hard drive.
However, by far the fastest is a simple call to grep:
$ time grep '^40,' int_map.txt
40,795
real 0m0.508s
user 0m0.374s
sys 0m0.046
Perl could easily read that output and split apart the comma in hardly any time at all.
Edit: Never mind about grep. I misread the numbers.
120 meg isn't that big. Assuming you've got at least 512MB of ram, you could easily read the whole file into a hash and then do all of your lookups against that.
use:
sed -n "/^$COLUMN1/{s/.*,//p;q}" file
This optimizes your code in three ways:
1) No needless splitting each line in two on ",".
2) You stop processing the file after the first hit.
3) sed is faster than awk.
This should more than half your search time.
HTH Chris
It all depends on how often the data change and how often in the course of a single script invocation you need to look up.
If there are many lookups during each script invocation, I would recommend parsing the file into a hash (or array if the range of keys is narrow enough).
If the file changes every day, creating a new SQLite database might or might not be worth your time.
If each script invocation needs to look up just one key, and if the data file changes often, you might get an improvement by slurping the entire file into a scalar (minimizing memory overhead, and do a pattern match on that (instead of parsing each line).
#!/usr/bin/env perl
use warnings; use strict;
die "Need key\n" unless #ARGV;
my $lookup_file = 'lookup.txt';
my ($key) = #ARGV;
my $re = qr/^$key,([0-9]+)$/m;
open my $input, '<', $lookup_file
or die "Cannot open '$lookup_file': $!";
my $buffer = do { local $/; <$input> };
close $input;
if (my ($val) = ($buffer =~ $re)) {
print "$key => $val\n";
}
else {
print "$key not found\n";
}
On my old slow laptop, with a key towards the end of the file:
C:\Temp> dir lookup.txt
...
2011/10/14 10:05 AM 135,436,073 lookup.txt
C:\Temp> tail lookup.txt
4522701,5840
5439981,16075
7367284,649
8417130,14090
438297,20820
3567548,23410
2014461,10795
9640262,21171
5345399,31041
C:\Temp> timethis lookup.pl 5345399
5345399 => 31041
TimeThis : Elapsed Time : 00:00:03.343
This example loads the file into a hash (which takes about 20s for 120M on my system). Subsequent lookups are then nearly instantaneous. This assumes that each number in the left column is unique. If that's not the case then you would need to push numbers on the right with the same number on the left onto an array or something.
use strict;
use warnings;
my ($csv) = #ARGV;
my $start=time;
open(my $fh, $csv) or die("$csv: $!");
$|=1;
print("loading $csv... ");
my %numHash;
my $p=0;
while(<$fh>) { $p+=length; my($k,$v)=split(/,/); $numHash{$k}=$v }
print("\nprocessed $p bytes in ",time()-$start, " seconds\n");
while(1) { print("\nEnter number: "); chomp(my $i=<STDIN>); print($numHash{$i}) }
Example usage and output:
$ ./lookup.pl MyBigMappingFile.csv
loading MyBigMappingFile.csv...
processed 125829128 bytes in 19 seconds
Enter number: 123
322
Enter number: 456
93
Enter number:
does it help if you cp the file to your /dev/shm, and using /awk/sed/perl/grep/ack/whatever query a mapping?
don't tell me you are working on a 128MB ram machine. :)