Perl threads to execute a sybase stored proc parallel - multithreading

I have written a sybase stored procedure to move data from certain tables[~50] on primary db for given id to archive db. Since it's taking a very long time to archive, I am thinking to execute the same stored procedure in parallel with unique input id for each call.
I manually ran the stored proc twice at same time with different input and it seems to work. Now I want to use Perl threads[maximum 4 threads] and each thread execute the same procedure with different input.
Please advise if this is recommended way or any other efficient way to achieve this. If the experts choice is threads, any pointers or examples would be helpful.

What you do in Perl does not really matter here: what matters is what happens on the side of the Sybase server. Assuming each client task creates its own connection to the database, then it's all fine and how the client achieved this makes no diff for the Sybase server. But do not use a model where the different client tasks will try to use the same client-server connection as that will never happen in parallel.

No 'answer' per se, but some questions/comments:
Can you quantify taking a very long time to archive? Assuming your archive process consists of a mix of insert/select and delete operations, do query plans and MDA data show fast, efficient operations? If you're seeing table scans, sort merges, deferred inserts/deletes, etc ... then it may be worth the effort to address said performance issues.
Can you expand on the comment that running two stored proc invocations at the same time seems to work? Again, any sign of performance issues for the individual proc calls? Any sign of contention (eg, blocking) between the two proc calls? If the archival proc isn't designed properly for parallel/concurrent operations (eg, eliminate blocking), then you may not be gaining much by running multiple procs in parallel.
How many engines does your dataserver have, and are you planning on running your archive process during a period of moderate-to-heavy user activity? If the current archive process runs at/near 100% cpu utilization on a single dataserver engine, then spawning 4 copies of the same process could see your archive process tying up 4 dataserver engines with heavy cpu utilization ... and if your dataserver doesn't have many engines ... combined with moderate-to-heavy user activity at the same time ... you could end up invoking the wrath of your DBA(s) and users. Net result is that you may need to make sure your archive process hog the dataserver.
One other item to consider, and this may require input from the DBAs ... if you're replicating out of either database (source or archive), increasing the volume of transactions per a given time period could have a negative effect on replication throughput (ie, an increase in replication latency); if replication latency needs to be kept at a minimum, then you may want to rethink your entire archive process from the point of view of spreading out transactional activity enough so as to not have an effect on replication latency (eg, single-threaded archive process that does a few insert/select/delete operations, sleeps a bit, then does another batch, then sleeps, ...).
It's been my experience that archive processes are not considered high-priority operations (assuming they're run on a regular basis, and before the source db fills up); this in turn means the archive process is usually designed so that it's efficient while at the same time putting a (relatively) light load on the dataserver (think: running as a trickle in the background) ... ymmv ...

Related

Threads: worth it for this situation?

I have never used threads before, but think I may have encountered an opportunity:
I have written a script that chews through an array of ~500 Excel files, and uses Parse::Excel to pull values from specific sheets in the workbook (on average, two sheets per workbook; one cell extracted per sheet.)
Running it now, where I just go through the array of files one by one and extract the relevant info from the file, it takes about 45 minutes to complete.
My question is: is this an opportunity to use threads, and have more than one file get hit at a time*, or should I maybe just accept the 45 minute run time?
(* - if this is a gross misunderstanding of what I can do with threads, please say so!)
Thanks in advance for any guidance you can offer!
Edit - adding example code. The code below is a sub that is called in a foreach loop for each file location stored in an array:
# Init the parser
my $parser = Spreadsheet::ParseExcel->new;
my $workbook = $parser->parse($inputFile) or die("Unable to load $inputFile: $!");
# Get a list of any sheets that have 'QA' in the sheet name
foreach my $sheet ($workbook->worksheets) {
if ($sheet->get_name =~ m/QA/) {
push #sheetsToScan, $sheet->get_name;
}
}
shift #sheetsToScan;
# Extract the value from the appropriate cell
foreach (#sheetsToScan) {
my $worksheet = $workbook->worksheet($_);
if ($_ =~ m/Production/ or $_ =~ m/Prod/) {
$cell = $worksheet->get_cell(1, 1);
$value = $cell ? $cell->value: undef;
if (not defined $value) {
$value = "Not found.";
}
} else {
$cell = $worksheet->get_cell(6,1);
$value = $cell ? $cell->value: undef;
if (not defined $value) {
$value = "Not found.";
}
}
push(#outputBuffer, $line);
Threads (or using multiple processes using fork) allow your script to utilize more than one CPU at time. For many tasks, this can save a lot of "user time" but will not save "system time" (and may even increase system time to handle the overhead of starting and managing threads and processes). Here are the situations where threading/multiprocessing will not be helpful:
the task of your script does not lend itself to parallelization -- when each step of your algorithm depends on the previous steps
the task your script performs is fast and lightweight compared to the overhead of creating and managing a new thread or new process
your system only has one CPU or your script is only enabled to use one CPU
your task is constrained by a different resource than CPU, such as disk access, network bandwidth, or memory -- if your task involves processing large files that you download through a slow network connection, then your network is the bottleneck, and processing the file on multiple CPUs will not help. Likewise, if your task consumes 70% of your system's memory, than using a second and third thread will require paging to your swap space and will not save any time. Parallelization will also be less effective if your threads compete for some synchronized resource -- file locks, database access, etc.
you need to be considerate of other users on your system -- if you are using all the cores on a machine, then other users will have a poor experience
[added, threads only] your code uses any package that is not thread-safe. Most pure Perl code will be thread-safe, but packages that use XS may not be
[added] when you are still actively developing your core task. Debugging is a lot harder in parallel code
Even if none of these apply, it is sometimes hard to tell how much a task will benefit from parallelization, and the only way to be sure is to actually implement the parallel task and benchmark it. But the task you have described looks like it could be a good candidate for parallelization.
It seems to me that your task should benefit from multiple threads of execution (processes or threads), as it seems to have a very roughly even blend of I/O and CPU. I would expect a speedup of a factor of a few but it is hard to tell without knowing details.
One way is to break the list of files into groups, as many as there are cores that you can spare. Then process each group in a fork, which assembles its results and passes them back to the parent once done, via a pipe or files. There are modules that do this and much more, for example Forks::Super or Parallel::ForkManager. They also offer a queue, another approach you can use.
I do this regularly when a lot of data in files is involved and get near linear speedup with up to 4 or 5 cores (on NFS), or even with more cores depending on the job details and on hardware.
I would cautiously assert that this may be simpler than threads, so to try first.
Another way would be to create a thread queue (Thread::Queue)
and feed it the filename groups. Note that Perl's threads are not the lightweight "threads" as one might expect; quite the opposite, they are heavy, they copy everything to each thread (so start them upfront, before there is much data in the program), and they come with yet other subtleties. Have a small number of workers with a sizable job (nice list of files) for each, instead of many threads rapidly working with the queue.
In this approach, too, be careful about how to pass results back since frequent communication poses a significant overhead for (Perl's) threads.
In either case it is important that the groups are formed so to provide for a balanced workload per thread/process. If this is not possible (you may not know which files may take much longer than others), then threads should take smaller batches while for forks use a queue from a module.
Handing only a file or a few to a thread or a process is most likely way too light of a workload, in which case the overhead of managing may erase (or reverse) possible speed gains. The I/O overlap across threads/processes would also increase, which is the main limit to speedup here.
The optimal number of files to pass to a thread/process is hard to estimate, even with all details on hand; just have to try. I assume that the reported runtime (over 5sec for a file) is due to some inefficiency which can be removed so first check your code for undue inefficiencies. If a file somehow really takes that long to process then start by passing a single file at a time to the queue.
Also, please consider mob's answer carefully. And note that these are advanced techniques.
What you do is just change "for ...." into "mce_loop...." and you'll see the boost, although I suggest you take a look mceloop first.

How to make FIO replay a trace with multiple thread

I'm trying to use fio to replay some block traces.
The job file I wrote looks like:
[global]
name=replay
filename=/dev/md0
direct=1
ioengine=psync
[replay]
read_iolog=iolog.fio
replay_no_stall=0
write_lat_log=replay_metrics
numjobs=1
The key here is I want to use "psync" as the ioengine, and replay the iolog.
However, with psync, fio seems to ignore "replay_no_stall" option, which ignore the timestamp in the iolog.
And by setting numjobs to be 4, fio seems to make 4 copies of the same workload, instead of using 4 threads to split the workload.
So, how could I make fio with psync respect the timestamp, and use multiple threads to replay the trace?
Without seeing a small problem snippet of the iolog itself I can't say why the replay is always going as fast as possible. Be aware that waits are in milliseconds and successive waits in the iolog MUST increase if the later ones are to have an effect (as they are relative to the start of the job itself and not to each other or the previous I/O). See the "Trace file format v2" section of the HOWTO for more details. This problem sounds like a good question for the fio mailing list (but as it's a question please don't put it in the bug tracker).
numjobs is documented as only creating clones in the HOWTO so your experience matches the documented behaviour.
Sadly fio replay currently (end of 2016) doesn't work in a way that a single replay file can be arbitrarily split among multiple jobs and you need multiple jobs to have fio use multiple threads/processes. If you don't mind the fact that you will lose I/O ordering between jobs you could split the iolog into 4 pieces and create a job that uses each of the new iolog files.

Waiting on many parallel shell commands with Perl

Concise-ish problem explanation:
I'd like to be able to run multiple (we'll say a few hundred) shell commands, each of which starts a long running process and blocks for hours or days with at most a line or two of output (this command is simply a job submission to a cluster). This blocking is helpful so I can know exactly when each finishes, because I'd like to investigate each result and possibly re-run each multiple times in case they fail. My program will act as a sort of controller for these programs.
for all commands in parallel {
submit_job_and_wait()
tries = 1
while ! job_was_successful and tries < 3{
resubmit_with_extra_memory_and_wait()
tries++
}
}
What I've tried/investigated:
I was so far thinking it would be best to create a thread for each submission which just blocks waiting for input. There is enough memory for quite a few waiting threads. But from what I've read, perl threads are closer to duplicate processes than in other languages, so creating hundreds of them is not feasible (nor does it feel right).
There also seem to be a variety of event-loop-ish cooperative systems like AnyEvent and Coro, but these seem to require you to rely on asynchronous libraries, otherwise you can't really do anything concurrently. I can't figure out how to make multiple shell commands with it. I've tried using AnyEvent::Util::run_cmd, but after I submit multiple commands, I have to specify the order in which I want to wait for them. I don't know in advance how long each submission will take, so I can't recv without sometimes getting very unlucky. This isn't really parallel.
my $cv1 = run_cmd("qsub -sync y 'sleep $RANDOM'");
my $cv2 = run_cmd("qsub -sync y 'sleep $RANDOM'");
# Now should I $cv1->recv first or $cv2->recv? Who knows!
# Out of 100 submissions, I may have to wait on the longest one before processing any.
My understanding of AnyEvent and friends may be wrong, so please correct me if so. :)
The other option is to run the job submission in its non-blocking form and have it communicate its completion back to my process, but the inter-process communication required to accomplish and coordinate this across different machines daunts me a little. I'm hoping to find a local solution before resorting to that.
Is there a solution I've overlooked?
You could rather use Scientific Workflow software such as fireworks or pegasus which are designed to help scientists submit large numbers of computing jobs to shared or dedicated resources. But they can also do much more so it might be overkill for your problem, but they are still worth having a look at.
If your goal is to try and find the tightest memory requirements for you job, you could also simply submit your job with a large amount or requested memory, and then extract actual memory usage from accounting (qacct), or , cluster policy permitting, logging on the compute node(s) where your job is running and view the memory usage with top or ps.

Node.js async parallel - what consequences are?

There is code,
async.series(tasks, function (err) {
return callback ({message: 'tasks execution error', error: err});
});
where, tasks is array of functions, each of it peforms HTTP request (using request module) and calling MongoDB API to store the data (to MongoHQ instance).
With my current input, (~200 task to execute), it takes
[normal mode] collection cycle: 1356.843 sec. (22.61405 mins.)
But simply trying change from series to parallel, it gives magnificent benefit. The almost same amount of tasks run in ~30 secs instead of ~23 mins.
But, knowing that nothing is for free, I'm trying to understand what the consequences of that change? Can I tell that number of open sockets will be much higher, more memory consumption, more hit to DB servers?
Machine that I run the code is only 1GB of RAM Ubuntu, so I so that app hangs there one time, can it be caused by lacking of resources?
Your intuition is correct that the parallelism doesn't come for free, but you certainly may be able to pay for it.
Using a load testing module (or collection of modules) like nodeload, you can quantify how this parallel operation is affecting your server to determine if it is acceptable.
Async.parallelLimit can be a good way of limiting server load if you need to, but first it is important to discover if limiting is necessary. Testing explicitly is the best way to discover the limits of your system (eachLimit has a different signature, but could be used as well).
Beyond this, common pitfalls using async.parallel include wanting more complicated control flow than that function offers (which, from your description doesn't seem to apply) and using parallel on too large of a collection naively (which, say, may cause you to bump into your system's file descriptor limit if you are writing many files). With your ~200 request and save operations on 1GB RAM, I would imagine you would be fine as long as you aren't doing much massaging in the event handlers, but if you are experiencing server hangs, parallelLimit could be a good way out.
Again, testing is the best way to figure these things out.
I would point out that async.parallel executes multiple functions concurrently not (completely) parallely. It is more like virtual parallelism.
Executing concurrently is like running different programs on a single CPU core, via multitasking/scheduling. True parallel execution would be running different program on each core of multi-core CPU. This is important as node.js has single-threaded architecture.
The best thing about node is that you don't have to worry about I/O. It handles I/O very efficiently.
In your case you are storing data to MongoDB, is mostly I/O. So running them parallely will use up your network bandwidth and if reading/writing from disk then disk bandwidth too. Your server will not hang because of CPU overload.
The consequence of this would be that if you overburden your server, your requests may fail. You may get EMFILE error (Too many open files). Each socket counts as a file. Usually connections are pooled, meaning to establish connection a socket is picked from the pool and when finished return to the pool. You can increase the file descriptor with ulimit -n xxxx.
You may also get socket errors when overburdened like ECONNRESET(Error: socket hang up), ECONNREFUSED or ETIMEDOUT. So handle them with properly. Also check the maximum number of simultaneous connections for mongoDB server too.
Finally the server can hangup because of garbage collection. Garbage collection kicks in after your memory increases to a certain point, then runs periodically after some time. The max heap memory V8 can have is around 1.5 GB, so expect GC to run frequently if its memory is high. Node will crash with process out of memory if asking for more, than that limit. So fix the memory leaks in your program. You can look at these tools.
The main downside you'll see here is a spike in database server load. That may or may not be okay depending on your setup.
If your database server is a shared resource then you will probably want to limit the parallel requests by using async.eachLimit instead.
you'll realize the difference if multiple users connect:
in this case the processor can handle multiple operations
asynch tries to run several operations of multiple users relative equal
T = task
U = user
(T1.U1 = task 1 of user 1)
T1.U1 => T1.U2 => T2.U1 => T8.U3 => T2.U2 => etc
this is the oposite of atomicy (so maybe watch for atomicy on special db operations - but thats another topic)
so maybe it is faster to use:
T2.U1 before T1.U1
- this is no problem until
T2.U1 is based on T1.U1
- this is preventable by using callbacks/ or therefore are callbacks
...hope this is what you wanted to know... its a bit late here

Returning LOTS of items from a MongoDB via Node.js

I'm returning A LOT (500k+) documents from a MongoDB collection in Node.js. It's not for display on a website, but rather for data some number crunching. If I grab ALL of those documents, the system freezes. Is there a better way to grab it all?
I'm thinking pagination might work?
Edit: This is already outside the main node.js server event loop, so "the system freezes" does not mean "incoming requests are not being processed"
After learning more about your situation, I have some ideas:
Do as much as you can in a Map/Reduce function in Mongo - perhaps if you throw less data at Node that might be the solution.
Perhaps this much data is eating all your memory on your system. Your "freeze" could be V8 stopping the system to do a garbage collection (see this SO question). You could Use V8 flag --trace-gc to log GCs & prove this hypothesis. (thanks to another SO answer about V8 and Garbage collection
Pagination, like you suggested may help. Perhaps even splitting up your data even further into worker queues (create one worker task with references to records 1-10, another with references to records 11-20, etc). Depending on your calculation
Perhaps pre-processing your data - ie: somehow returning much smaller data for each record. Or not using an ORM for this particular calculation, if you're using one now. Making sure each record has only the data you need in it means less data to transfer and less memory your app needs.
I would put your big fetch+process task on a worker queue, background process, or forking mechanism (there are a lot of different options here).
That way you do your calculations outside of your main event loop and keep that free to process other requests. While you should be doing your Mongo lookup in a callback, the calculations themselves may take up time, thus "freezing" node - you're not giving it a break to process other requests.
Since you don't need them all at the same time (that's what I've deduced from you asking about pagination), perhaps it's better to separate those 500k stuff into smaller chunks to be processed at the nextTick?
You could also use something like Kue to queue the chunks and process them later (thus not everything in the same time).

Resources