I am trying to get text of two large files. To speed it up i tried threads.
Before i used threads the script worked, now it does not.
The problem is: I save everything I read in the file into a hash.
When i print out the size (or keys/values) after the read-in in the sub (which the thread executed) it shows a correct number > 0, when i print out the size of the hash anywhere else (after the threads have run) it shows me 0.
print ": ".keys(%c);
is used 2 times, and has different output each time.
(In the final programm 2 Threads are running and a method to compare the stuff is called after the threads finished)
Example code:
my %c;
my #threads = initThreads();
#threads[0] = threads->create(\&ce);
foreach(#threads){
$_->join();
}
print ": ".keys(%c);
sub initThreads{
my #initThreads;
for(my $i = 0; $i<2;$i++){
push(#initThreads, $i);
}
return #initThreads;
}
sub ce(){
my $id = threads->tid();
open my $file, "<", #arg1[1] or die $!;
my #cXY;
my #cDa;
while(my $line = <$file>){
# some regex and push to arrays, works
#c{#cXY} = #cDa;
}
print "Thread $id is done\n";
close $file;
print ": ".keys(%c);
threads->exit();
}
Do i have to run the things after the first 2 threads finished in another thread which waits until the first two are finished?
Or what am i doing wrong with threads?
Thanks.
%c isn't shared across your threads.
use threads;
use threads::shared
my %c :shared;
See threads::shared.
In Perl, threads don't share memory. Each thread operates on a different copy of %c, so the changes aren't reflected to the parent thread. While sharing a variable across threads is possible, this is not generally advisable.
Make use of the possibility to return data from a thread. E.g
my %c = map %{ $_->join }, #threads; # flatten all returned hashes
sub ce {
my %hash;
...;
return \%hash;
}
Some other suggestions:
use strict; use warnings; if you aren't already.
use better variable names.
you only seem to be spawning one thread (in $threads[0]).
my #array; for (my $i = 0; $i < 2; $i++){ push(#array, $i) } is equivalent to my #array = 0 .. 1.
#arg1 is not declared in the current scope.
manually exiting a thread is not neccessary in your case.
Related
I am simply trying to find out how to properly use the open2 function.
See an example below. It works for a small $max, but naturally, if I write long enough to the $hIn, so eventually it will get blocked because nothing reads the data on the output continuously.
use 5.26.0;
use IPC::Open2;
my $max = 100000;
my $pid = open2(my $hOut, my $hIn, "cat") || die "failed 'cat': $!";
{
my $cnt = 0;
#When $max is big (e.g. 100000) so the code below will get blocked
#when writing to $hIn
while ($cnt<$max) {say $hIn $cnt++;}
close($hIn) || say "can't close hIn";
}
while(<$hOut>) { print; }
close($hOut) || say "can't close hOut";
waitpid( $pid, 0 );
The only solution, that I can think about, is launching an other thread that will do the writing on the background.
With the code below I can write into the $hIn as much data as I want and read them in the main thread but the $hIn seems not to get closed. Because of that the while(<$hOut>) will never finish while waiting for more output.
use 5.26.0;
use threads;
use IPC::Open2;
my $max = 100000;
my $pid = open2(my $hOut, my $hIn, "cat") || die "failed 'cat': $!";
my $thr = threads->create(sub {
my $cnt = 0;
while ($cnt<$max) {say $hIn $cnt++;}
#The close does not have any effect here (although no error is produced)
close($hIn) || say "can't close hIn";
});
#This outputs all the data written to $hIn but never leaves the loop...
while(<$hOut> ) { print; }
close($hOut) || say "can't close hOut";
$thr->join;
waitpid( $pid, 0 );
My questions are:
Provided that my approach with threads is ok, how can I close the file handle from the background thread?
If it is not ok (actually use threads is discouraged in Perl), so can someone provide a working example of open2 that can write and read a lot of data without a risk of getting blocked waiting for the reader or writer?
EDIT: Following your suggestions here is an implementation of the code above using IPC::Run:
use 5.26.0;
use IPC::Run qw/ run /;
my $max = 1000000;
run sub {
my $cnt = 0;
while ($cnt<$max) {say $cnt++;}
},
"|", "cat",
"|", sub {
while(<> ) {
print;
}
}
or die "run sub | cat | sub failed: $?";
It runs without flaws, the code is very readable... I am very happy to have learned about this module. Thanks to everyone!
Yet, I consider the question to be unanswered. If it is not possible to write this functionality using open2 directly, why does that even exist and confuse people? Also not being able to close the file handle from a different thread looks like a bug to me (certainly it is - the close should at least report an error).
Your program stopped because the pipe to which it was writing became full.
The pipe to cat became full because cat stopped reading from it.
cat stopped because the pipe to which it was writing became full.
The pipe from cat became full because you program isn't reading from it.
So you have two programs waiting for each other to do something. This is a deadlock.
The low-level solution is to use select to monitor both ends of the pipe.
The high-level solution is to let IPC::Run or IPC::Run3 do that hard work for you.
use IPC::Run qw( run );
my $cnt_max = 100000;
my $cnt = 0;
run [ "cat" ],
'<', sub { $cnt < $cnt_max ? $cnt++ . "\n" : undef };
I am trying to take a portion of an existing script and have it run multiple nmap scans simultaneously to increase the speed of the script.
I initially tried using fork, but it was suggested to me that I should be using threads instead as I am doing this on a Windows box. I modified a code snippet I found online and it partially works.
I am using a list of 23 IP addresses. I have been able to open 10 threads and scan the first 10 addresses, but then the code exits. Ideally, the code would open a new thread each time one exits so that there are always 10 threads running, until it reaches the remainder, in this case there would be three. Then only 3 threads would be open.
This entire code needs to be run inside a subroutine that I have in my original sequential code. I am using ping instead of the nmap command to test.
#!/usr/bin/Perl
use strict;
use threads;
my $i = 0;
my #lines;
# Define the number of threads
my $num_of_threads = 10;
# use the initThreads subroutine to create an array of threads.
my #threads = initThreads();
my #files = glob( "./ping.txt" ) or die "Can't open CMS HostInventory$!"; # Open the CMS Host Inventory csv files for parsing
foreach my $file ( #files ) {
open (CONFIG, '<', $file) or die "Can't ip360_DNS File$!";
#lines = <CONFIG>;
chomp (#lines);
}
# Loop through the array:
foreach ( #threads ) {
# Tell each thread to perform our 'doOperation()' subroutine.
$_ = threads->create(\&doOperation);
}
# This tells the main program to keep running until all threads have finished.
foreach ( #threads ) {
$_->join();
}
print "\nProgram Done!\nPress Enter to exit";
$a = <>;
####################### SUBROUTINES ############################
sub initThreads{
my #initThreads;
for ( my $i = 1; $i <= $num_of_threads; $i++ ) {
push(#initThreads, $i);
}
return #initThreads;
}
sub doOperation{
# Get the thread id. Allows each thread to be identified.
my $id = threads->tid();
my $ip = ($id - 1);
system("ping $lines[$ip] >> ./input/$lines[$ip].txt");
print "Thread $id done!\n";
# Exit the thread
threads->exit();
}
Let's say we have a script which open a file, then read it line by line and print the line to the terminal. We have a sigle thread and a multithread version.
The problem is than the resulting output of both scripts is almost the same, but not exactly. In the multithread versions there are about ten lines which missed the first 2 chars. I mean, if the real line is something line "Stackoverflow rocks", I obtain "ackoverflow rocks".
I think that this is related to some race condition since if I adjust the parameters to create a lot of little workers, I get more faults than If I use less and bigger workers.
The single thread is like this:
$file = "some/file.txt";
open (INPUT, $file) or die "Error: $!\n";
while ($line = <STDIN>) {
print $line;
}
The multithread version make's use of the thread queue and this implementation is based on the #ikegami approach:
use threads qw( async );
use Thread::Queue 3.01 qw( );
use constant NUM_WORKERS => 4;
use constant WORK_UNIT_SIZE => 100000;
sub worker {
my ($job) = #_;
for (#$job) {
print $_;
}
}
my $q = Thread::Queue->new();
async { while (defined( my $job = $q->dequeue() )) { worker($job); } }
for 1..NUM_WORKERS;
my $done = 0;
while (!$done) {
my #lines;
while (#lines < WORK_UNIT_SIZE) {
my $line = <>;
if (!defined($line)) {
$done = 1;
last;
}
push #lines, $line;
}
$q->enqueue(\#lines) if #lines;
}
$q->end();
$_->join for threads->list;
I tried your program and got similar (wrong) results. Instead of Thread::Semaphore I used lock from threads::shared around the print as it's simpler to use than T::S, i.e.:
use threads;
use threads::shared;
...
my $mtx : shared;
sub worker
{
my ($job) = #_;
for (#$job) {
lock($mtx); # (b)locks
print $_;
# autom. unlocked here
}
}
...
The global variable $mtx serves as a mutex. Its value doesn't matter, even undef (like here) is ok.
The call to lock blocks and returns only if no other threads currently holds the lock on that variable.
It automatically unlocks (and thus makes lock return) when it goes out of scope. In this sample that happens
after every single iteration of the for loop; there's no need for an extra {…} block.
Now we have syncronized the print calls…
But this didn't work either, because print does buffered I/O (well, only O). So I forced unbuffered output:
use threads;
use threads::shared;
...
my $mtx : shared;
$| = 1; # force unbuffered output
sub worker
{
# as above
}
...
and then it worked. To my surprise I could then remove the lock and it still worked. Perhaps by accident. Note that your script will run significantly slower without buffering.
My conclusion is: you're suffering from buffering.
Problem
I have created a simple perl script to read log files and process the data asynchronously.
The problem i am facing is that the script appears to continuously use more memory the longer it runs. This seems to be affected by the amount of data it processes. The problem I have is that i am unable to identify what exactly is using all this memory, and whether it is a leak or something is just holding onto it.
Question
How can i modify the below script so that it no longer continuously consumes memory ?
Code
#Multithreaded to read multiple log files at the same time.
use strict;
use warnings;
use threads;
use Thread::Queue;
use threads::shared;
my $logq = Thread::Queue->new();
my %Servers :shared;
my %servername :shared;
sub csvsplit {
my $line = shift;
my $sep = (shift or ',');
return () unless $line;
my #cells;
my $re = qr/(?:^|$sep)(?:"([^"]*)"|([^$sep]*))/;
while($line =~ /$re/g) {
my $value = defined $1 ? $1 : $2;
push #cells, (defined $value ? $value : '');
}
return #cells;
}
sub process_data
{
while(sleep(1)){
if ($logq->pending())
{
my %sites;
my %returns;
while($logq->pending() > 0){
my $data = $logq->dequeue();
my #fields = csvsplit($data);
$returns{$fields[$#fields - 1]}++;
$sites{$fields[$#fields]}++;
}
print "counter:$_, value=\"$sites{$_}\" />\n" for (keys%sites);
print "counter:$_, value=\"$returns{$_}\" />\n" for (keys%returns);
}
}
}
sub read_file
{
my $myFile=$_[0];
open(my $logfile,'<',$myFile) || die "error";
my $Inode=(stat($logfile))[1];
my $fileSize=(stat($logfile))[7];
seek $logfile, 0, 2;
for (;;) {
while (<$logfile>) {
chomp( $_ );
$logq->enqueue( $_ );
}
sleep 5;
if($Inode != (stat($myFile))[1] || (stat($myFile))[7] < $fileSize){
close($logfile);
while (! -e $myFile){
sleep 2;
}
open($logfile,'<',$myFile) || die "error";
$Inode=(stat($logfile))[1];
$fileSize=(stat($logfile))[7];
}
seek $logfile, 0, 1;
}
}
my $thr1 = threads->create(\&read_file,"log");
my $thr4 = threads->create(\&process_data);
$thr1->join();
$thr4->join();
Obeservations and relevant info
The memory only seems to increase when the program has data to process, if i just leave it, it maintains the current memory usage.
Memory only appears to increase for larger throughput and increase about half a Mb per 5 seconds for around 2000 lines in the same time.
I have not included the csv as i do not think it is relevant. If you do and want me to add it please give a valid reason.
Specs
GNU bash, version 3.2.57(1)-release (s390x-ibm-linux-gnu)
perl, v5.10.0
I have looked through other questions but cannot find much of relevance. If this is a duplicate or the relevant info is in another question, feel free to mark as a dupe and ill check it out.
Any more info needed just ask.
The reason is probably that the size of your Thread::Queue is unlimited. If the producer thread is faster than the consumer thread, your queue will continue to grow. So you should simply limit the size of your queue. For example, to set a limit of 1,000 queue items:
$logq->limit = 1000;
(The way you use the pending method is wrong by the way. You should only terminate if the return value is undefined.)
I am using Thread::Queue to push an array onto a queue and process each element of it using threads. Below is a simplified version of my program to demonstrate what is happening.
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use threads::shared;
use Thread::Queue;
# Define queue
my $QUEUE :shared = new Thread::Queue();
# Define values
my #values = qw(string1 string2 string3);
# Enqueue values
$QUEUE->enqueue(#values);
# Get thread limit
my $QUEUE_SIZE = $QUEUE->pending();
my $thread_limit = $QUEUE_SIZE;
# Create threads
for my $i (1 .. $thread_limit) {
my $thread = threads->create(\&work);
}
# Join threads
my $i = 0;
for my $thread (threads->list()) {
$thread->join();
}
print "COMPLETE\n";
# Thread work function
sub work {
while (my $value = $QUEUE->dequeue()) {
print "VALUE: $value\n";
sleep(5);
print "Finished sleeping\n";
}
print "Got out of loop\n";
}
When I run this code I get the following output and then it just hangs forever:
VALUE: string1
VALUE: string2
VALUE: string3
Finished sleeping
Finished sleeping
Finished sleeping
Once the queue reaches its end, the while loop should break and the script should continue but it doesn't appear to ever get out of the loop.
Why is this getting stuck?
Since you never call $QUEUE->end(), your threads are blocking on dequeue() waiting for more entries to appear.
So, ensure you do call $QUEUE->end() after the last call to enqueue, or before joining the threads.