Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have several very large tables in mysql (Millions of rows) that I need to load into my perl script.
We then do some custom processing of the data, and aggregate it into a hash. Unfortunately, that custom processing can't be implemented in MySQL.
Heres a quick pseudocode.
my #data;
for my $table_num(#table_numbers){
my $sth = $dbh->prepare(...);
$sth->execute();
$sth->bind_columns(\my($a,$b,$c,...));
while(($sth->fetch()){
$data[$table_num]{black_box($a)}{secret_func($b)}+=$c;
}
}
my $x = $#data + 1;
for my $num (#table_numbers){
for my $a (keys %{$data[$num]}){
for my $b (keys %{$data[$num]{$a}){
$data[$x]{$a}{$b} += $data[$num]{$a}{$b};
}
}
}
Now, the first loop can take several minutes per iteration to run, so I am thinking of ways I can run them in parallel. I have looked at using Perl Threads before, but they seem to just run several perl interpreters at once, and my script is already using a lot of memory, and merging the data would seem t be problematic. Also, at this stage, the script is not using a lot of CPU.
I have been looking at possibly using Coro threads, but it seem like there would be a learning curve, plus a fairly complex integration of my current code. What I would like to know if I am likely to see any gains by going this route. Are there better ways of multithreading code like this. I can not afford to use any more memory then my code already uses. Is there something else I can do here?
Unfortunately doing the aggregation in MySQL is not an option, and rewriting the code in a different language would be too time consuming. I am aware that using arrays instead of hashes is likely to make my code faster/use less memory, but again that would require a major rewrite of a large script.
Edit: The above is pseudo code, the actual logic is a lot more complex. The bucketing is based on several db tables, and many more inputs then just $a and $b. Precomputing them is not practical, as the there are Trillions+ possible combinations. The main goal is how do I make the perl script run faster, not how to fix the SQL Part of things. That requires changes to how the data is stored and indexed in the actual server. Which would affect a lot of other code. There are other people working on doing those optimizations. My current goal is to attempt to make the code faster without changing any sql.
You could do it in mysql simply by making black_box and secret_func tables (temporary tables, if necessary) prepopulated with the results for every existing value of the relevant columns.
Short of that, measure how much time is spent in the calls to black_box and secret_func vs. execute/fetch. If a lot is in the former, you could memoize the results:
my %black_box;
my %secret_func;
for my $table_num...
...
$data[$table_num]{ $black_box{$a} //= black_box($a) }{ $secret_func{$b} //= secret_func($b) } += $c;
If you have memory concerns, using forks instead of threads may help. They use much less memory than the standard perl threads. There is going to be somewhat of a memory penalty for multi-threading, and YMMV as far as performance goes, but you might want to try something like:
use forks;
use Thread::Queue;
my $inQueue = Thread::Queue->new;
my $outQueue = Thread::Queue->new;
$inQueue->enqueue(#table_numbers);
# create the worker threads
my $numThreads = 4;
for(1 .. $numThreads) {
threads->create(\&doMagic);
}
# wait for the threads to finish
$_->join for threads->list;
# collect the data
my #data;
while(my $result = $outQueue->dequeue_nb) {
# merge $result into #data
}
sub doMagic {
while(my $table_num = $inQueue->dequeue_nb) {
my #data;
# your first loop goes here
$outQueue->enqueue(\#data);
}
return;
}
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
Following this question and other questions I asked, I got some suggestions.
tl;dr:
I'm trying to run a "foreach" loop asynchronously. Each iteration updates a few hashes independently. The problem is that the memory is kept with each thread and I didn't know how to unite it all together.
I got a few suggestions, but I had problem with almost each:
When tried the thread/fork, there was a problem with shared memory that I needed to update everything to be shared and you're allowed to assign only shared values to those hashes and it made a big mess... (If there's a way to share everything, even variables that would later be defined. that might be a solution)
When trying to write all the hashes to files (by json), all the blessing is gone and I need to bless everything from the top which is a big mess too...
Any ideas how can I do it easier/faster?
In some problems a common data structure must indeed be shared between different threads.
When this isn't necessary things are greatly simplified by simply returning from each thread, when join-ing, a reference to a data structure built in the thread during the run. The main thread can then process those results (or merge them first if needed). Here is a simple demo
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd); # or use core Data::Dumper
use threads;
# Start threads. Like threads->create(...)
my #thr = map { async { proc_thr($_) } } 1..3;
# Wait for threads to complete. If they return, that happens here
my #res = map { $_->join } #thr;
# Process results (just print in this case)
dd $_ for #res;
sub proc_thr {
my ($num) = #_;
# A convoluted example, to return a complex data structure
my %ds = map { 'k'.$_ => [$_*10 .. $_*10 + 2] } 10*$num .. 10*$num+2;
return \%ds;
}
This prints
{ k10 => [100, 101, 102], k11 => [110, 111, 112], k12 => [120, 121, 122] }
{ k20 => [200, 201, 202], k21 => [210, 211, 212], k22 => [220, 221, 222] }
{ k30 => [300, 301, 302], k31 => [310, 311, 312], k32 => [320, 321, 322] }
Now manipulate these returned data structures as suitable; work with them as they stand or merge them. I can't discuss that because we aren't told what kind of data need be passed around. This roughly provides for what was asked for, as far as I can tell.
Important notes
Lots of threads? Large data structures to merge? Then this may not be a good way
The word "bless" was mentioned, tantalizingly. If what you'd pass around are objects then they need be serialized for that, in such a way that the main thread can reconstruct the object.
Or, pass the object's data, either as a reference (as above), or by serializing it and passing the string; then the main thread can populate its own object from that data.
Returning (join-ing) an object itself (so a reference like above, an object being a reference) doesn't seem to fully "protect your rights;" I find that at least some operator overloading is lost (even as all methods seem to work and data is accessible and workable).
This is a whole other question, of passing objects around.†
If the work to be done is I/O-bound (lots of work with the filesystem) then this whole approach (with threads) need be carefully reconsidered. It may even slow it down
Altogether -- we need more of a description, and way more detail.
† This has been addressed on Stackoverflow. A couple of directly related pages that readily come to mind for me are here and here.
In short, objects can be serialized and restored using for example Storable. Pure JSON cannot do objects, while extensions can. On the other hand, pure JSON is excellent for serializing data in an object, which can then be used on the other end to populate an identical object of that class.
I am using EC2 amazon server to perform data processing of 63 files,
the server i am using has 16 core but using perl Parallel::ForkManager with number of thread = number of core then it seems like half the core are sleeping and the working core are not at 100% and fluctuate around 25%~50%
I also checked IO and it is mostly iddling.
use Sys::Info;
use Sys::Info::Constants qw( :device_cpu );
my $info = Sys::Info->new;
my $cpu = $info->device( CPU => %options );
use Parallel::ForkManager;
my $manager=new Parallel::ForkManager($cpu->count);
for($i=0;$i<=$#files_l;$i++)
{
$manager->start and next;
do_stuff($files_l[$i]);
$manager->finish;
}
$manager->wait_all_children;
The short answer is - we can't tell you, because it depends entirely on what 'do_stuff' is doing.
The major reasons why parallel code doesn't create linear speed increases are:
Process creation overhead - some 'work' is done to spawn a process, so if the children are trivially small, that 'wastes' effort.
Contented resources - the most common is disk IO, but things like file locks, database handles, sockets, or interprocess communication can also play a part.
something else causing a 'back off' that stalls a process.
And without knowing what 'do_stuff' does, we can't second guess what it might be.
However I'll suggest a couple of steps:
Double the number of processes to twice CPU count. That's often a 'sweet spot' because it means that any non-CPU delay in a process just means one of the others get to go full speed.
Try strace -fTt <yourprogram> (if you're on linux, the commands are slightly different on other Unix variants). Then do it again with strace -fTtc because the c will summarise syscall run times. Look at which ones take the most 'time'.
Profile your code to see where the hot spots are. Devel::NYTProf is one library you can use for this.
And on a couple of minor points:
my $manager=new Parallel::ForkManager($cpu->count);
Would be better off written:
my $manager=Parallel::ForkManager -> new ( $cpu->count);
Rather than using indirect object notation.
If you are just iterating #files then it might be better to not use a loop count variable and instead:
foreach my $file ( #files ) {
$manager -> start and next;
do_stuff($file);
$manager -> finish;
}
in my perl script I'm collecting a large data and later I need it to post to server, up to this I'm good but my criteria is that post to server takes subsequently large time so I need to a threading / forking concept so that one will post and parallely I can dig my second data set at same time while posting to server is taking place.
code snippet
if(system("curl -sS $post_url --data-binary \#$filename -H 'Content-type:text/xml;charset=utf-8' 1>/dev/null") != 0)
{
exit_script(" xml: Error ","Unable to update $filename xml on $post_url");
}
can any one tell me is this achievable with threading or forking.
It's difficult to give an answer to your question, because it depends.
Yes, Perl supports both forking and threading.
In general, I would suggest looking at threading for data-oriented tasks, and forking for almost anything else.
And so what you want to so is eminently achievable.
First you need to:
Encapsulate your tasks into subroutines. Get that working first. (This is very important - parallel stuff causes worlds of pain and is difficult to troubleshoot if you're not careful - get it working single threaded first).
Run your subroutines as threads, and capture their results.
Something like this:
use threads;
sub curl_update {
my $result = system ( "you_curl_command" );
return $result;
}
#start the async curl
my $thr = threads -> create ( \&curl_update );
#do your other stuff....
sleep ( 60 );
my $result = $thr -> join();
if ( $result ) {
#do whatever you would if the curl update failed
}
In this, the join is a blocking call - your main code will stop and wait for your thread to complete. If you want to do something more complicated, you can use is_running or is_joinable which are non blocking.
I'd suggest neither.
You're just talking lots of HTTP. You can talk concurrent HTTP a lot nicer, because it's just network IO, by using any of the asynchronous IO systems. Perl has many of them.
Principally I'd suggest IO::Async, but then I wrote it. You can use Net::Async::HTTP to make an HTTP hit. This will fully support doing many of them at once - many hundreds or thousands if need be.
Otherwise, you can also try either POE or AnyEvent, which will both support the same thing in their own way.
am using Perl on a linux box and my memory usage is going up and up, I believe because of previously run threads that have not been killed/joined.
I think I need to somehow signal the thread(s) that have done/run to terminate, and then detach
it/them so that it/they will get cleaned up automatically giving me back memory...
I have tried return(); with $thr_List->join(); & $thr_List->detach(); but my gui doesn't show for ages with join, and the mem problem seems
to be still there with the detach... Any help...
$mw->repeat(100, sub { # shared var handler/pivoting-point !?
while (defined(my $command = $q->dequeue_nb())) { #???
# to update a statusbar's text
$text->delete('0.0', "end");
$text->insert('0.0', $command);
$indicatorbar->value($val); # to update a ProgressBar's value to $val
$mw->update();
for ( #threadids ) { # #threadids is a shared array containing thread ids
# that is to say I have got the info I wanted from a thread and pushed its id into the above #threadids array.
print "I want to now kill or join the thread with id: $_\n";
#$thrWithId->detach();
#$thrWithId->join();
# then delete that id from the array
# delete $threadids[elWithThatIdInIt];
# as this seems to be in a repeat(100, sub... too, there are problems??!
# locks maybe?!?
# for ( #threadids ) if its not empty?!?
}
} # end of while
}); # end of sub
# Some worker... that works with the above handler/piviot me thinks#???
async {
for (;;) {
sleep(0.1);
$q->enqueue($StatusLabel);
}
}->detach();
I have uploaded my full code here (http://cid-99cdb89630050fff.office.live.com/browse.aspx/.Public) if needed, its in the Boxy.zip...
First sorry for replying here but I've lost my cookie that allows editing etc...
Thanks very much gangabass, that looks like great info, I will have to spend some time on it though, but at least it looks like others are asking the same questions... I was worried I was making a total mess of things.
Thanks guys...
So it sounds like you got the join working but it was very slow?
Threading in perl is not lightweight. Creating and joining threads takes significant memory and time.
If your task allows, it is much better to keep threads running and give them additional work rather than ending them and starting new threads later. Thread::Queue can help with this. That said, unless you are on Windows, there is not a lot of point to doing that instead of forking and using Parallel::ForkManager.
You need to use only one Worker thread and update GUI from it. Other threads just process data from the queue and there is no need to terminate them.
See this example for more info
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
I'm very curious in what multithreading is. I have heard the name battered around here and there in answers I have received on StackOverflow but I have no idea what it is, so my main two questions being what is it and how can I benefit from it?
EDIT:
Ok, since the first question didn't really get the response I was looking for I'll go with this..
I have never heard of 'threading' even in other languages. This is an example I have found on the internet:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use threads::shared;
print "Starting main program\n";
my #threads;
for ( my $count = 1; $count <= 10; $count++) {
my $t = threads->new(\&sub1, $count);
push(#threads,$t);
}
foreach (#threads) {
my $num = $_->join;
print "done with $num\n";
}
print "End of main program\n";
sub sub1 {
my $num = shift;
print "started thread $num\n";
sleep $num;
print "done with thread $num\n";
return $num;
}
I cant seem to understand what it is that it is doing. Could anyone shine any light?
Regards,
Phil
Threading is a way of having more than one thing happen at the same time, at least conceptually speaking (on a single-core single-CPU computer, perhaps with an ARM or Atom, there's only one thread of execution at a time).
The Perl example launches ten different sections of code simultaneously. Each chunk of code only has to take care of what it is doing, and doesn't have to worry about anything else. This means you can get the output of the Perl program with one simple subroutine called ten times in a reasonably simple fashion.
One use is in programs that interact with the user. It's typically necessary to have a responsive interface along with things happening behind the scenes. It's hard to do this in one lump of code, so often the interface will be in one thread and the behind-the-scenes stuff in other threads, so that the interface can be snappy and the background tasks running.
If you're familiar with running multiple processes, threads are very similar, although they're more closely connected. This means it's easier to set up and communicate between threads, but it's also easy to mess up by not allowing for all possible orders of execution.
It is like forking, but lighter weight and it is easier to share data between threads than between processes. The downside is that, because of the sharing of data, it is easier to write buggy code (thread a modifies something thread b thought should stay the same, thread a gets a lock on resource c and tries to get a lock on resource d, but thread b has a lock on resource d and is trying to get a lock on resource c, etc.).