Missing characters while reading input with threads - multithreading

Let's say we have a script which open a file, then read it line by line and print the line to the terminal. We have a sigle thread and a multithread version.
The problem is than the resulting output of both scripts is almost the same, but not exactly. In the multithread versions there are about ten lines which missed the first 2 chars. I mean, if the real line is something line "Stackoverflow rocks", I obtain "ackoverflow rocks".
I think that this is related to some race condition since if I adjust the parameters to create a lot of little workers, I get more faults than If I use less and bigger workers.
The single thread is like this:
$file = "some/file.txt";
open (INPUT, $file) or die "Error: $!\n";
while ($line = <STDIN>) {
print $line;
}
The multithread version make's use of the thread queue and this implementation is based on the #ikegami approach:
use threads qw( async );
use Thread::Queue 3.01 qw( );
use constant NUM_WORKERS => 4;
use constant WORK_UNIT_SIZE => 100000;
sub worker {
my ($job) = #_;
for (#$job) {
print $_;
}
}
my $q = Thread::Queue->new();
async { while (defined( my $job = $q->dequeue() )) { worker($job); } }
for 1..NUM_WORKERS;
my $done = 0;
while (!$done) {
my #lines;
while (#lines < WORK_UNIT_SIZE) {
my $line = <>;
if (!defined($line)) {
$done = 1;
last;
}
push #lines, $line;
}
$q->enqueue(\#lines) if #lines;
}
$q->end();
$_->join for threads->list;

I tried your program and got similar (wrong) results. Instead of Thread::Semaphore I used lock from threads::shared around the print as it's simpler to use than T::S, i.e.:
use threads;
use threads::shared;
...
my $mtx : shared;
sub worker
{
my ($job) = #_;
for (#$job) {
lock($mtx); # (b)locks
print $_;
# autom. unlocked here
}
}
...
The global variable $mtx serves as a mutex. Its value doesn't matter, even undef (like here) is ok.
The call to lock blocks and returns only if no other threads currently holds the lock on that variable.
It automatically unlocks (and thus makes lock return) when it goes out of scope. In this sample that happens
after every single iteration of the for loop; there's no need for an extra {…} block.
Now we have syncronized the print calls…
But this didn't work either, because print does buffered I/O (well, only O). So I forced unbuffered output:
use threads;
use threads::shared;
...
my $mtx : shared;
$| = 1; # force unbuffered output
sub worker
{
# as above
}
...
and then it worked. To my surprise I could then remove the lock and it still worked. Perhaps by accident. Note that your script will run significantly slower without buffering.
My conclusion is: you're suffering from buffering.

Related

Threaded code exits before all tasks are complete

I am trying to take a portion of an existing script and have it run multiple nmap scans simultaneously to increase the speed of the script.
I initially tried using fork, but it was suggested to me that I should be using threads instead as I am doing this on a Windows box. I modified a code snippet I found online and it partially works.
I am using a list of 23 IP addresses. I have been able to open 10 threads and scan the first 10 addresses, but then the code exits. Ideally, the code would open a new thread each time one exits so that there are always 10 threads running, until it reaches the remainder, in this case there would be three. Then only 3 threads would be open.
This entire code needs to be run inside a subroutine that I have in my original sequential code. I am using ping instead of the nmap command to test.
#!/usr/bin/Perl
use strict;
use threads;
my $i = 0;
my #lines;
# Define the number of threads
my $num_of_threads = 10;
# use the initThreads subroutine to create an array of threads.
my #threads = initThreads();
my #files = glob( "./ping.txt" ) or die "Can't open CMS HostInventory$!"; # Open the CMS Host Inventory csv files for parsing
foreach my $file ( #files ) {
open (CONFIG, '<', $file) or die "Can't ip360_DNS File$!";
#lines = <CONFIG>;
chomp (#lines);
}
# Loop through the array:
foreach ( #threads ) {
# Tell each thread to perform our 'doOperation()' subroutine.
$_ = threads->create(\&doOperation);
}
# This tells the main program to keep running until all threads have finished.
foreach ( #threads ) {
$_->join();
}
print "\nProgram Done!\nPress Enter to exit";
$a = <>;
####################### SUBROUTINES ############################
sub initThreads{
my #initThreads;
for ( my $i = 1; $i <= $num_of_threads; $i++ ) {
push(#initThreads, $i);
}
return #initThreads;
}
sub doOperation{
# Get the thread id. Allows each thread to be identified.
my $id = threads->tid();
my $ip = ($id - 1);
system("ping $lines[$ip] >> ./input/$lines[$ip].txt");
print "Thread $id done!\n";
# Exit the thread
threads->exit();
}

What is causing memory to continuously rise perl?

Problem
I have created a simple perl script to read log files and process the data asynchronously.
The problem i am facing is that the script appears to continuously use more memory the longer it runs. This seems to be affected by the amount of data it processes. The problem I have is that i am unable to identify what exactly is using all this memory, and whether it is a leak or something is just holding onto it.
Question
How can i modify the below script so that it no longer continuously consumes memory ?
Code
#Multithreaded to read multiple log files at the same time.
use strict;
use warnings;
use threads;
use Thread::Queue;
use threads::shared;
my $logq = Thread::Queue->new();
my %Servers :shared;
my %servername :shared;
sub csvsplit {
my $line = shift;
my $sep = (shift or ',');
return () unless $line;
my #cells;
my $re = qr/(?:^|$sep)(?:"([^"]*)"|([^$sep]*))/;
while($line =~ /$re/g) {
my $value = defined $1 ? $1 : $2;
push #cells, (defined $value ? $value : '');
}
return #cells;
}
sub process_data
{
while(sleep(1)){
if ($logq->pending())
{
my %sites;
my %returns;
while($logq->pending() > 0){
my $data = $logq->dequeue();
my #fields = csvsplit($data);
$returns{$fields[$#fields - 1]}++;
$sites{$fields[$#fields]}++;
}
print "counter:$_, value=\"$sites{$_}\" />\n" for (keys%sites);
print "counter:$_, value=\"$returns{$_}\" />\n" for (keys%returns);
}
}
}
sub read_file
{
my $myFile=$_[0];
open(my $logfile,'<',$myFile) || die "error";
my $Inode=(stat($logfile))[1];
my $fileSize=(stat($logfile))[7];
seek $logfile, 0, 2;
for (;;) {
while (<$logfile>) {
chomp( $_ );
$logq->enqueue( $_ );
}
sleep 5;
if($Inode != (stat($myFile))[1] || (stat($myFile))[7] < $fileSize){
close($logfile);
while (! -e $myFile){
sleep 2;
}
open($logfile,'<',$myFile) || die "error";
$Inode=(stat($logfile))[1];
$fileSize=(stat($logfile))[7];
}
seek $logfile, 0, 1;
}
}
my $thr1 = threads->create(\&read_file,"log");
my $thr4 = threads->create(\&process_data);
$thr1->join();
$thr4->join();
Obeservations and relevant info
The memory only seems to increase when the program has data to process, if i just leave it, it maintains the current memory usage.
Memory only appears to increase for larger throughput and increase about half a Mb per 5 seconds for around 2000 lines in the same time.
I have not included the csv as i do not think it is relevant. If you do and want me to add it please give a valid reason.
Specs
GNU bash, version 3.2.57(1)-release (s390x-ibm-linux-gnu)
perl, v5.10.0
I have looked through other questions but cannot find much of relevance. If this is a duplicate or the relevant info is in another question, feel free to mark as a dupe and ill check it out.
Any more info needed just ask.
The reason is probably that the size of your Thread::Queue is unlimited. If the producer thread is faster than the consumer thread, your queue will continue to grow. So you should simply limit the size of your queue. For example, to set a limit of 1,000 queue items:
$logq->limit = 1000;
(The way you use the pending method is wrong by the way. You should only terminate if the return value is undefined.)

Why are my Perl threads stuck in an infinite loop? (Thread::Queue)

I am using Thread::Queue to push an array onto a queue and process each element of it using threads. Below is a simplified version of my program to demonstrate what is happening.
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use threads::shared;
use Thread::Queue;
# Define queue
my $QUEUE :shared = new Thread::Queue();
# Define values
my #values = qw(string1 string2 string3);
# Enqueue values
$QUEUE->enqueue(#values);
# Get thread limit
my $QUEUE_SIZE = $QUEUE->pending();
my $thread_limit = $QUEUE_SIZE;
# Create threads
for my $i (1 .. $thread_limit) {
my $thread = threads->create(\&work);
}
# Join threads
my $i = 0;
for my $thread (threads->list()) {
$thread->join();
}
print "COMPLETE\n";
# Thread work function
sub work {
while (my $value = $QUEUE->dequeue()) {
print "VALUE: $value\n";
sleep(5);
print "Finished sleeping\n";
}
print "Got out of loop\n";
}
When I run this code I get the following output and then it just hangs forever:
VALUE: string1
VALUE: string2
VALUE: string3
Finished sleeping
Finished sleeping
Finished sleeping
Once the queue reaches its end, the while loop should break and the script should continue but it doesn't appear to ever get out of the loop.
Why is this getting stuck?
Since you never call $QUEUE->end(), your threads are blocking on dequeue() waiting for more entries to appear.
So, ensure you do call $QUEUE->end() after the last call to enqueue, or before joining the threads.

csv format issue using multithreading in perl

I'm running a perl script consisting of 30 threads to run a subroutine. For each thread, I'm supplying 100 data. In the subroutine, after the code does what its supposed to, I'm storing the output in a csv file. However, I find that on execution, the csv file has some data overlapped. For example, in the csv file, I'm storing name, age, gender, country this way-
print OUTPUT $name.",".$age.",".$gender.",".$country.",4\n";
The csv file should have outputs as such-
Randy,35,M,USA,4
Tina,76,F,UK,4
etc.
However, in the csv file, I see that some columns has overlapped or has been entered haphazardly in this way-
Randy,35,M,USA,4
TinaMike,76,UK
23,F,4
Is it because some threads are executing at the same time? What could I do to avoid this? I'm using the print statement only after I'm getting the data. Any suggestions?
4 is the group id which will remain constant.
Below is the code snippet:
#!/usr/bin/perl
use DBI;
use strict;
use warnings;
use threads;
use threads::shared;
my $host = "1.1.1.1";
my $database = "somedb";
my $user = "someuser";
my $pw = "somepwd";
my #threads;
open(PUT,">/tmp/file1.csv") || die "can not open file";
open(OUTPUT,">/tmp/file2.csv") || die "can not open file";
my $dbh = DBI->connect("DBI:mysql:$database;host=$host", $user, $pw ,) || die "Could not connect to database: $DBI::errstr";
$dbh->{'mysql_auto_reconnect'} = 1;
my $sql = qq{
//some sql to get a primary keys
};
my $sth = $dbh->prepare($sql);
$sth->execute();
while(my #request = $sth->fetchrow_array())
{
#get other columns and print to file1.csv
print PUT $net.",".$sub.",4\n";
$i++; #this has been declared before
}
for ( my $count = 1; $count <= 30; $count++) {
my $t = threads->new(\&sub1, $count);
push(#threads,$t);
}
foreach (#threads) {
my $num = $_->join;
print "done with $num\n";
}
sub sub1 {
my $num = shift;
//calculated start_num and end_num based on an internal logic
for(my $x=$start_num; $x<=$end_num; $x++){
print OUTPUT $name.",".$age.",".$gender.",".$country.",4\n";
$j++; #this has been declared before
}
sleep(1);
return $num;
}
I have problem in the file2 which has the OUTPUT handler
You are multithreading and printing to a file from multiple threads. This will always end badly - print is not an 'atomic' operation, so different prints can interrupt each other.
What you need to do is serialize your output such that this cannot happen. The simplest way is to use a lock or a semaphore:
my $print_lock : shared;
{
lock $print_lock;
print OUTPUT $stuff,"\n";
}
when the 'lock' drifts out of scope, it'll be released.
Alternatively, have a separate thread that 'does' file IO, and use Thread::Queue to feed lines to it. Depends somewhat on whether you need any ordering/processing of the contents of 'OUTPUT'.
Something like:
use Thread::Queue;
my $output_q = Thread::Queue -> new();
sub output_thread {
open ( my $output_fh, ">", "output_filename.csv" ) or die $!;
while ( my $output_line = $output_q -> dequeue() ) {
print {$output_fh} $output_line,"\n";
}
close ( $output_fh );
sub doing_stuff_thread {
$output_q -> enqueue ( "something to output" ); #\n added by sub!
}
my $output_thread = threads -> create ( \&output_thread );
my $doing_stuff_thread = threads -> create ( \&doing_stuff_thread );
#wait for doing_stuff to finish - closing the queue will cause output_thread to flush/exit.
$doing_stuff_thread -> join();
$output_q -> end;
$output_thread -> join();
Open the File handle globally, then try using flock on the file handle as demonstrated:
sub log_write {
my $line = shift;
flock(OUTPUT, LOCK_EX) or die "can't lock: $!";
seek(OUTPUT, 0, SEEK_END) or die "can't fast forward: $!";
print OUTPUT $line;
flock(OUTPUT, LOCK_UN) or die "can't unlock: $!";
}
Other example:
perlfaq5 - I still don't get locking. I just want to increment the number in the file. How can I do this?

Perl: write value in thread

I am trying to get text of two large files. To speed it up i tried threads.
Before i used threads the script worked, now it does not.
The problem is: I save everything I read in the file into a hash.
When i print out the size (or keys/values) after the read-in in the sub (which the thread executed) it shows a correct number > 0, when i print out the size of the hash anywhere else (after the threads have run) it shows me 0.
print ": ".keys(%c);
is used 2 times, and has different output each time.
(In the final programm 2 Threads are running and a method to compare the stuff is called after the threads finished)
Example code:
my %c;
my #threads = initThreads();
#threads[0] = threads->create(\&ce);
foreach(#threads){
$_->join();
}
print ": ".keys(%c);
sub initThreads{
my #initThreads;
for(my $i = 0; $i<2;$i++){
push(#initThreads, $i);
}
return #initThreads;
}
sub ce(){
my $id = threads->tid();
open my $file, "<", #arg1[1] or die $!;
my #cXY;
my #cDa;
while(my $line = <$file>){
# some regex and push to arrays, works
#c{#cXY} = #cDa;
}
print "Thread $id is done\n";
close $file;
print ": ".keys(%c);
threads->exit();
}
Do i have to run the things after the first 2 threads finished in another thread which waits until the first two are finished?
Or what am i doing wrong with threads?
Thanks.
%c isn't shared across your threads.
use threads;
use threads::shared
my %c :shared;
See threads::shared.
In Perl, threads don't share memory. Each thread operates on a different copy of %c, so the changes aren't reflected to the parent thread. While sharing a variable across threads is possible, this is not generally advisable.
Make use of the possibility to return data from a thread. E.g
my %c = map %{ $_->join }, #threads; # flatten all returned hashes
sub ce {
my %hash;
...;
return \%hash;
}
Some other suggestions:
use strict; use warnings; if you aren't already.
use better variable names.
you only seem to be spawning one thread (in $threads[0]).
my #array; for (my $i = 0; $i < 2; $i++){ push(#array, $i) } is equivalent to my #array = 0 .. 1.
#arg1 is not declared in the current scope.
manually exiting a thread is not neccessary in your case.

Resources