Nodejs and calculation heavy operations, utilizing cpu to the maximum with worker threads, while still getting some responsiveness - node.js

I'm trying to solve the following scenario in nodejs in a performant meaner.
I have a 100Mb worth of jsons which I need to process and the time function to process each entry is about O(sweet_jesus(n)). In real time it takes about ~4-5 seconds for each entry.
The only silver lining that I can totally run the processing of each entry individually (about 900 entries in total), they are unrelated.
My first choice was to go for worker_threads with node-worker-threads-pool:
import fs from 'fs';
import path from 'path';
import _ from 'lodash';
import moment from 'moment';
import workerPool from 'node-worker-threads-pool';
function generateShortEvaluationsByWorkers(){
const pool = new workerPool.StaticPool({
size: 10,
task: path.resolve('src/simulator/evaluationGenerator.js')
});
let simulationEvaluations = [];
const promises = [];
fs.readdirSync(path.resolve(`results/companies`)).forEach(file => {
const rawData = fs.readFileSync(path.resolve(`results/companies/${file}`));
const company = JSON.parse(rawData);
console.log(new Date(), ": company parsed, sending it for processing:", file);
promises.push(pool.exec(company).then(result=>{
simulationEvaluations.push(result);
}));
});
Promise.all(promises).then(()=>{
fs.writeFileSync(
path.resolve(`results/bundles/simulationEvaluations.json`), JSON.stringify(simulationEvaluations, null, 2)
);
pool.destroy();
})
}
The above code runs beautifully, it shows that the I/O - of reading all the files and feeding it to the pool - takes about 5-6 seconds...
But after that there is absolutely no difference whatsoever compared to running whole thing in a single thread. The logs do show that the processing of the individual files no longer happen in order as before, so I guess there are some threading happening in the background, but the total time does not change one bit. It takes about an hour either way.
Also my hyper-threaded Intel 8750 with 6 cores (12 logical) shows 86% utilization goes to the node process. So my alleged 10 separate thread doesn't even manage to utilize one full core. - EDIT: I was a retard it does make a huge difference I wrote down the times wrong...
After this I crank the thread pool size up to 100 and slice the number of files down to a 100. And that's where freaky stuff starts to happen. First, all my CPU cores go brrrr and my laptop properly melts through the table as one would expect. OS gives zero responsiveness everything is a slideshow.
The first 20 or so files gets processed within the same second after which the processing of individual files go to ~3 seconds each (neatly after each other, one message 3-5 seconds after the other). The last 10 or so files gets processed within the same second again.
Why does 10 threads doesn't make a difference compared to 1 thread?
Shouldn't I see files to be processed in clusters, where the cluster size is comparable to the number of logical cores, instead of timestamps one after the other?
Is there a way to "leave" a core to process something else, while calculations still go to Neptune with all the other cores?
EDIT: I wont delete this, maybe somebody will learn from it :)
So to answer my own questions:
It does, I could not measure, could not write, and could not read my CPU meter either at this point... totally my fault
This one I still don't fully get, but after a few runs I suspect that when you start a whole buttload of threads, you make the whole system hang so much just by the strain of starting them all that by the time its able to spew out the first log, its already done with a bunch of calculation.
Yeah this is also kinda obvious, do not use so many threads that the thread management itself will make the OS throw a shitfit.
In the end I got the best results with 11 threads btw.

Related

Handling large number of outbound HTTP requests

I am building a feed reader application were I expect to have a large number of sources. I would request new data from each source in a given time interval (e.g., hourly) and then cache the response on my server. I am assuming requesting data from all sources at the same time is not the most optimal solution, as I will probably experience network congestion (I am curious to know if there would be any other bottlenecks too).
What would be an efficient way to perform such a large number of requests?
Thanks
Since, there's no urgency to any given request and you just want to make sure you hit them each periodically, you can just space all the requests out in time.
For example, if you have N sources and you want to hit each one once an hour, you can just create a list of all the sources, and keep track of an index for which source is next. Then, calculate how far apart you can make each request and still get through them all in an hour.
So, if you had N requests to process once an hour:
let listOfSources = [...];
let nextSourceIndex = 0;
const cycleTime = 1000 * 60 * 60; // an hour in ms
const delta = Math.round(cycleTime / listOfSources.length);
// create interval timer that cycles through the sources
setInterval(() => {
let index = nextSourceIndex++;
if (index >= listOfSources.length) {
// wrap back to start
index = 0;
nextSourceIndex = 1;
}
processNextSource(listOfSources[index]);
}, delta);
function processNextSource(item) {
// process this source
}
Note, if you have a lot of sources and it takes a little while to process each one, you may still have more than one source "in flight" at the same time, but that should be OK.
If the processing was really CPU or network heavy, you would have to keep an eye on whether you're getting bogged down and can't get through all the sources in an hour. If that was the case, depending upon the bottleneck issue, you may need either more bandwidth, faster storage or more CPUs applied to the project (perhaps using worker threads or child processes).
If the number of sources is dynamic or the time to process each is dynamic and you're anywhere near your processing limits, you could make this system adaptable so that if it was getting overly busy, it would just automatically space things out more than once an hour or vice versa, if things were not so busy it could visit them more frequently. This would require keeping track of some stats and calculating a new cycleTime variable and adjusting the timer each time through the cycle.
There are different types of approaches to. A common procedure when you have a large number of asynchronous operations to get through is to process them in a way that N of them are in-flight at any given time (where N is a relatively small number such as 3 to 10). This generally avoids overloading any local resources (such as memory usage, sockets in flight, bandwidth, etc...) while still allowing you to do some parallelism in the network aspect of things. This would be the type of approach you might use if you want to get through all of them as fast as possible without overwhelming local resources whereas the previous discussion is more about spacing them out in time.
Here's an implementation of a function called mapConcurrent() that iterates an array asynchronously with no more than N requests in flight at the same time. And, here's a function called rateMap() that is even more advanced in what type of concurrency controls it supports.

Why is Python consistently struggling to keep up with constant generation of asyncio tasks?

I have a Python project with a server that distributes work to one or more clients. Each client is given a number of assignments which contain parameters for querying a target API. This includes a maximum number of requests per second they can make with a given API key. The clients process the response and send the results back to the server to store into a database.
Both the server and clients use Tornado for asynchronous networking. My initial implementation for the clients relied on the PeriodicCallback to ensure that n-number of calls to the API would occur. I thought that this was working properly as my tests would last 1-2 minutes.
I added some telemetry to collect statistics on performance and noticed that the clients were actually having issues after almost exactly 2 minutes of runtime. I had set the API requests to 20 per second (the maximum allowed by the API itself) which the clients could reliably hit. However, after 2 minutes performance would fluctuate between 12 and 18 requests per second. The number of active tasks steadily increased until it hit the maximum amount of active assignments (100) given from the server and the HTTP request time to the API was reported by Tornado to go from 0.2-0.5 seconds to 6-10 seconds. Performance is steady if I only do 14 requests per second. Anything higher than 15 requests will experience issues 2-3 minutes after starting. Logs can be seen here. Notice how the column of "Active Queries" is steady until 01:19:26. I've truncated the log to demonstrate
I believed the issue was the use of a single process on the client to handle both communication to the server and the API. I proceeded to split the primary process into several different processes. One handles all communication to the server, one (or more) handles queries to the API, another processes API responses into a flattened class, and finally a multiprocessing Manager for Queues. The performance issues were still present.
I thought that, perhaps, Tornado was the bottleneck and decided to refactor. I chose aiohttp and uvloop. I split the primary process in a similar manner to that in the previous attempt. Unfortunately, performance issues are unchanged.
I took both refactors and enabled them to split work into several querying processes. However, no matter how much you split the work, you still encounter problems after 2-3 minutes.
I am using both Python 3.7 and 3.8 on MacOS and Linux.
At this point, it does not appear to be a limitation of a single package. I've thought about the following:
Python's asyncio library cannot handle more than 15 coroutines/tasks being generated per second
I doubt that this is true given that different libraries claim to be able to handle several thousand messages per second simultaneously. Also, we can hit 20 requests per second just fine at the start with very consistent results.
The API is unable to handle more than 15 requests from a single client IP
This is unlikely as I am not the only user of the API and I can request 20 times per second fairly consistently over an extended period of time if I over-subscribe processes to query from the API.
There is a system configuration causing the limitation
I've tried both MacOS and Debian which yield the same results. It's possible that's it a *nix problem.
Variations in responses cause a backlog which grows linearly until it cannot be tackled fast enough
Sometimes responses from the API grow and shrink between 0.2 and 1.2 seconds. The number of active tasks returned by asyncio.all_tasks remains consistent in the telemetry data. If this were true, we wouldn't be consistently encountering the issue at the same time every time.
We're overtaxing the hardware with the number of tasks generated per second and causing thermal throttling
Although CPU temperatures spike, neither MacOS nor Linux report any thermal throttling in the logs. We are not hitting more than 80% CPU utilization on a single core.
At this point, I'm not sure what's causing it and have considered refactoring the clients into a different language (perhaps C++ with Boost libraries). Before I dive into something so foolish, I wanted to ask if I'm missing something simple.
Conclusion
Performance appears to vary wildly depending on time of day. It's likely to be the API.
How this conclusion was made
I created a new project to demonstrate the capabilities of asyncio and determine if it's the bottleneck. This project takes two websites, one to act as the baseline and the other is the target API, and runs through different methods of testing:
Spawn one process per core, pass a semaphore, and query up to n-times per second
Create a single event loop and create n-number of tasks per second
Create multiple processes with an event loop each to distribute the work, with each loop performing (n-number / processes) tasks per second
(Note that spawning processes is incredibly slow and often commented out unless using high-end desktop processors with 12 or more cores)
The baseline website would be queried up to 50 times per second. asyncio could complete 30 tasks per second reliably for an extended period, with each task completing their run in 0.01 to 0.02 seconds. Responses were very consistent.
The target website would be queried up to 20 times per second. Sometimes asyncio would struggle despite circumstances being identical (JSON handling, dumping response data to queue, returning immediately, no CPU-bound processing). However, results varied between tests and could not always be reproduced. Responses would be under 0.4 seconds initially but quickly increase to 4-10 seconds per request. 10-20 requests would return as complete per second.
As an alternative method, I chose a parent URI for the target website. This URI wouldn't have a large query to their database but instead be served back with a static JSON response. Responses bounced between 0.06 seconds to 2.5-4.5 seconds. However, 30-40 responses would be completed per second.
Splitting requests across processes with their own event loop would decrease response time in the upper-bound range by almost half, but still took more than one second each to complete.
The inability to reproduce consistent results every time from the target website would indicate that it's a performance issue on their end.

Performance issue while using Parallel.foreach() with MaximumDegreeOfParallelism set as ProcessorCount

I wanted to process records from a database concurrently and within minimum time. So I thought of using parallel.foreach() loop to process the records with the value of MaximumDegreeOfParallelism set as ProcessorCount.
ParallelOptions po = new ParallelOptions
{
};
po.MaxDegreeOfParallelism = Environment.ProcessorCount;
Parallel.ForEach(listUsers, po, (user) =>
{
//Parallel processing
ProcessEachUser(user);
});
But to my surprise, the CPU utilization was not even close to 20%. When I dig into the issue and read the MSDN article on this(http://msdn.microsoft.com/en-us/library/system.threading.tasks.paralleloptions.maxdegreeofparallelism(v=vs.110).aspx), I tried using a specific value of MaximumDegreeOfParallelism as -1. As said in the article thet this value removes the limit on the number of concurrently running processes, the performance of my program improved to a high extent.
But that also doesn't met my requirement for the maximum time taken to process all the records in the database. So I further analyzed it more and found that there are two terms as MinThreads and MaxThreads in the threadpool. By default the values of Min Thread and MaxThread are 10 and 1000 respectively. And on start only 10 threads are created and this number keeps on increasing to a max of 1000 with every new user unless a previous thread has finished its execution.
So I set the initial value of MinThread to 900 in place of 10 using
System.Threading.ThreadPool.SetMinThreads(100, 100);
so that just from the start only minimum of 900 threads are created and thought that it will improve the performance significantly. This did create 900 threads, but it also increased the number of failure on processing each user very much. So I did not achieve much using this logic. So I changed the value of MinThreads to 100 only and found that the performance was much better now.
But I wanted to improve more as my requirement of time boundation was still not met as it was still exceeding the time limit to process all the records. As you may think I was using all the best possible things to get the maximum performance in parallel processing, I was also thinking the same.
But to meet the time limit I thought of giving a shot in the dark. Now I created two different executable files(Slaves) in place of only one and assigned them each half of the users from DB. Both the executable were doing the same thing and were executing concurrently. I created another Master program to start these two Slaves at the same time.
To my surprise, it reduced the time taken to process all the records nearly to the half.
Now my question is as simple as that I do not understand the logic behind Master Slave thing giving better performance compared to a single EXE with all the logic same in both the Slaves and the previous EXE. So I would highly appreciate if someone will explain his in detail.
But to my surprise, the CPU utilization was not even close to 20%.
…
It uses the Http Requests to some Web API's hosted in other networks.
This means that CPU utilization is entirely the wrong thing to look at. When using the network, it's your network connection that's going to be the limiting factor, or possibly some network-related limit, certainly not CPU.
Now I created two different executable files … To my surprise, it reduced the time taken to process all the records nearly to the half.
This points to an artificial, per process limit, most likely ServicePointManager.DefaultConnectionLimit. Try setting it to a larger value than the default at the start of your program and see if it helps.

Speed Up with multithreading

i have a parse method in my program, which first reads a file from disk then, parses the lines and creats an object for every line. For every file a collection with the objects from the lines is saved afterwards. The files are about 300MB.
This takes about 2.5-3 minutes to complete.
My question: Can i expect a significant speed up if i split the tasks up to one thread just reading files from disk, another parsing the lines and a third saving the collections? Or would this maybe slow down the process?
How long is it common for a modern notebook harddisk to read 300MB? I think, the bottleneck is the cpu in my task, because if i execute the method one core of cpu is always at 100% while the disk is idle more then the half time.
greetings, rain
EDIT:
private CANMessage parseLine(String line)
{
try
{
CANMessage canMsg = new CANMessage();
int offset = 0;
int offset_add = 0;
char[] delimiterChars = { ' ', '\t' };
string[] elements = line.Split(delimiterChars);
if (!isMessageLine(ref elements))
{
return canMsg = null;
}
offset = getPositionOfFirstWord(ref elements);
canMsg.TimeStamp = Double.Parse(elements[offset]);
offset += 3;
offset_add = getOffsetForShortId(ref elements, ref offset);
canMsg.ID = UInt16.Parse(elements[offset], System.Globalization.NumberStyles.HexNumber);
offset += 17; // for signs between identifier and data length number
canMsg.DataLength = Convert.ToInt16(elements[offset + offset_add]);
offset += 1;
parseDataBytes(ref elements, ref offset, ref offset_add, ref canMsg);
return canMsg;
}
catch (Exception exp)
{
MessageBox.Show(line);
MessageBox.Show(exp.Message + "\n\n" + exp.StackTrace);
return null;
}
}
}
So this is the parse method. It works this way, but maybe you are right and it is inefficient. I have .NET Framwork 4.0 and i am on Windows 7. I have a Core i7 where every core has HypterThreading, so i am only using about 1/8 of the cpu.
EDIT2: I am using Visual Studio 2010 Professional. It looks like the tools for a performance profiling are not available in this version (according to msdn MSDN Beginners Guide to Performance Profiling).
EDIT3: I changed the code now to use threads. It looks now like this:
foreach (string str in checkedListBoxImport.CheckedItems)
{
toImport.Add(str);
}
for(int i = 0; i < toImport.Count; i++)
{
String newString = new String(toImport.ElementAt(i).ToArray());
Thread t = new Thread(() => importOperation(newString));
t.Start();
}
While the parsing you saw above is called in the importOperation(...).
With this code it was possible to reduce the time from about 2.5 minutes to "only" 40 seconds. I got some concurrency problems i have to track but at least this is much faster then before.
Thank you for your advice.
It's unlikely that you are going to get consistent metrics for laptop hard disk performance as we have no idea how old your laptop is nor do we know if it is sold state or spinning.
Considering you have already done some basic profiling, I'd wager the CPU really is your bottleneck as it is impossible for a single threaded application to use more than 100% of a single cpu. This is of course ignoring your operating system splitting the process over multiple cores and other oddities. If you were getting 5% CPU usage instead, it'd be most likely were bottle necking at IO.
That said your best bet would be to create a new thread task for each file you are processing and send that to a pooled thread manager. Your thread manager should limit the number of threads you are running to either the number of cores you have available or if memory is an issue (you did say you were generating 300MB files after all) the maximum amount of ram you can use for the process.
Finally, to answer the reason why you don't want to use a separate thread for each operation, consider what you already know about your performance bottlenecks. You are bottle necked on cpu processing and not IO. This means that if you split your application into separate threads your read and write threads would be starved most of the time waiting for your processing thread to finish. Additionally, even if you made them process asynchronously, you have the very real risk of running out of memory as your read thread continues to consume data that your processing thread can't keep up with.
Thus, be careful not to start each thread immediately and let them instead be managed by some form of blocking queue. Otherwise you run the risk of slowing your system to a crawl as you spend more time in context switches than processing. This is of course assuming you don't crash first.
It's unclear how many of these 300MB files you've got. A single 300MB file takes about 5 or 6 seconds to read on my netbook, with a quick test. It does indeed sound like you're CPU-bound.
It's possible that threading will help, although it's likely to complicate things significantly of course. You should also profile your current code - it may well be that you're just parsing inefficiently. (For example, if you're using C# or Java and you're concatenating strings in a loop, that's frequently a performance "gotcha" which can be easily remedied.)
If you do opt for a multi-threaded approach, then to avoid thrashing the disk, you may want to have one thread read each file into memory (one at a time) and then pass that data to a pool of parsing threads. Of course, that assumes you've also got enough memory to do so.
If you could specify the platform and provide your parsing code, we may be able to help you optimize it. At the moment all we can really say is that yes, it sounds like you're CPU bound.
That long for only 300 MB is bad.
There's different things that could be impacting performance as well depending upon the situation, but typically it's reading the hard disk is still likely the biggest bottleneck unless you have something intense going on during the parsing, and which seems the case here because it only takes several seconds to read 300MB from a harddisk (unless it's way bad fragged maybe).
If you have some inefficient algorithm in the parsing, then picking or coming up with a better algorithm would probably be more beneficial. If you absolutely need that algorithm and there's no algorithmic improvement available, it sounds like you might be stuck.
Also, don't try to multithread to read and write at the same time with the multithreading, you'll likely slow things way down to increased seeking.
Given that you think this is a CPU bound task, you should see some overall increase in throughput with separate IO threads (since otherwise your only processing thread would block waiting for IO during disk read/write operations).
Interestingly I had a similar issue recently and did see a significant net improvement by running separate IO threads (and enough calculation threads to load all CPU cores).
You don't state your platform, but I used the Task Parallel Library and a BlockingCollection for my .NET solution and the implementation was almost trivial. MSDN provides a good example.
UPDATE:
As Jon notes, the time spent on IO is probably small compared to the time spent calculating, so while you can expect an improvement, the best use of time may be profiling and improving the calculation itself. Using multiple threads for the calculation will speed up significantly.
Hmm.. 300MB of lines that have to be split up into a lot of CAN message objects - nasty! I suspect the trick might be to thread off the message assembly while avoiding excessive disk-thrashing between the read and write operations.
If I was doing this as a 'fresh' requirement, (and of course, with my 20/20 hindsight, knowing that CPU was going to be the problem), I would probably use just one thread for reading, one for writing the disk and, initially at least, one thread for the message object assembly. Using more than one thread for message assembly means the complication of resequencing the objects after processing to prevent the output file being written out-of-order.
I would define a nice disk-friendly sized chunk-class of lines and message-object array instances, say 1024 of them, and create a pool of chunks at startup, 16 say, and shove them onto a storage queue. This controls and caps memory use, greatly reduces new/dispose/malloc/free, (looks like you have a lot of this at the moment!), improves the efficiency of the disk r/w operations as only large r/w are performed, (except for the last chunk which will be, in general, only partly filled), provides inherent flow-control, (the read thread cannot 'run away' because the pool will run out of chunks and the read thread will block on the pool until the write thread returns some chunks), and inhibits excess context-switching because only large chunks are processed.
The read thread opens the file, gets a chunk from the queue, reads the disk, parses into lines and shoves the lines into the chunk. It then queues the whole chunk to the processing thread and loops around to get another chunk from the pool. Possibly, the read thread could, on start or when idle, be waiting on its own input queue for a message class instance that contains the read/write filespecs. The write filespec could be propagated through a field of the chunks, so supplying the the write thread wilth everything it needs via. the chunks. This makes a nice subsystem to which filespecs can be queued and it will process them all without any further intervention.
The processing thread gets chunks from its input queue and splits the the lines up into the message objects in the chunk and then queues the completed, whole chunks to the write thread.
The write thread writes the message objects to the output file and then requeues the chunk to the storage pool queue for re-use by the read thread.
All the queues should be blocking producer-consumer queues.
One issue with threaded subsystems is completion notification. When the write thread has written the last chunk of a file, it probably needs to do something. I would probably fire an event with the last chunk as a parameter so that the event handler knows which file has been completely written. I would probably somethihng similar with error notifications.
If this is not fast enough, you could try:
1) Ensure that the read and write threads cannot be preemepted in favour of the other during chunk-disking by using a mutex. If your chunks are big enough, this probably won't make much difference.
2) Use more than one processing thread. If you do this, chunks may arrive at the write-thread 'out-of-order'. You would maybe need a local list and perhaps some sort of sequence-number in the chunks to ensure that the disk writes are correctly ordered.
Good luck, whatever design you come up with..
Rgds,
Martin

Please tell me what is wrong with my threading!

I have a function where I will compress a bunch of files into a single compressed file..it is taking a long time(to compress),so I tried implementing threading in my application..Say if I have 20 files for compression,I separated that as 5*4=20,inorder to do that I have separate variables(which are used for compression) for all 4 threads in order to avoid locks and I will wait until the 4 thread finishes..Now..the threads are working but i see no improvement in their performance..normally it will take 1 min for 20 files(for example) after implementing threading ...there is only 5 or 3 sec difference., sometimes the same.
here i will show the code for 1 thread(so it is for other3 threads)
//main thread
myClassObject->thread1 = AfxBeginThread((AFX_THREADPROC)MyThreadFunction1,myClassObject);
....
HANDLE threadHandles[4];
threadHandles[0] = myClassObject->thread1->m_hThread;
....
WaitForSingleObject(myClassObject->thread1->m_hThread,INFINITE);
UINT MyThreadFunction(LPARAM lparam)
{
CMerger* myClassObject = (CMerger*)lparam;
CString outputPath = myClassObject->compressedFilePath.GetAt(0);//contains the o/p path
wchar_t* compressInputData[] = {myClassObject->thread1outPath,
COMPRESS,(wchar_t*)(LPCTSTR)(outputPath)};
HINSTANCE loadmyDll;
loadmydll = LoadLibrary(myClassObject->thread1outPath);
fp_Decompress callCompressAction = NULL;
int getCompressResult=0;
myClassObject->MyCompressFunction(compressInputData,loadClient7zdll,callCompressAction,myClassObject->thread1outPath,
getCompressResult,minIndex,myClassObject->firstThread,myClassObject);
return 0;
}
Firstly, you only wait on one of the threads. I think you want WaitForMultipleObjects.
As for the lack of speed up have you considered that your actual bottleneck is NOT the compression but the file loading? File loading is slow and 4 threads contending for time slices of the hard disk "could" even result in lower performance.
This is why premature optimisation is evil. You need to profile, profile and profile again to work out where your REAL bottlenecks are.
Edit: I can't really comment on your WaitForMultipleObjects unless I see the code. I have never had any problems with it myself ...
As for a bottleneck. Its a metaphor if you try to pour a large amount of liquid out of a cylinder by tipping it upside-down then the water leaves at a constant rate. If you try to do this with a bottle you will notice that it can't do it as fast. This is because there is only so much liquid that can flow through the thin part of the bottle (not to mention the air entering into it). Thus the limitation of your water emptying from the container is limited by the neck of the bottle (the thin part).
In programming when you talk about a bottle neck you are talking about the slowest part of the code. In this case if your threads spend most of their time waiting for the disk load to complete then you are going to get very little speed up by multi-threading as you can only load so much at once. In fact when you try to load 4 times as much at once then you will start to find that you have to wait around just as long for the load to complete. In your single threading case you wait around and once its loaded you compress. In the 4 threaded case you are waiting around 4 times as long for all the loads to complete and then you compress all 4 files simultaneously. This is why you get a small speed up. Unfortunately due to the fact you are spending most of your time waiting for the loads to complete you won't see anything approaching a 4x speed up. Hence the limiting factor of your method is not the compression but the loading the file from disk and hence it gets called a bottleneck.
Edit2: In a case such as you are suggesting you will find the best speed up would be had by eliminating the amount of time you are waiting for data to load from disk.
1) If you load a file as multiple of disk pages (usually 2048 byte but you can query windows to get the size) you get best possible load performance. If you load sizes that aren't a multiple of this you will get quite a serious performance hit.
2) Look at asynchronous loading. For example you could be loading all of file 2 (or more) in to memory while you are processing file 1. This means that you aren't waiting around for the load to complete. Its unlikely, though, you'll get a vast speed up here as you'll probably still end up waiting for the load. The other thing to try is to load "chunks" of the audio file asynchronously. ie:
Load chunk 1.
Start chunk 2 loading.
Process chunk 1.
Wait for chunk 2 to load.
Start chunk 3 loading.
Process chunk 2.
(And so on)
3) You could just buy a faster disk drive.

Resources