Multiple thread completion time measurement - multithreading

Use case is: I have a huge log file, which I'm reading on main thread chunk by chunk (equal size, IO read). Every chunk read approximately takes 1s in my test machine. After reading each chunk I'm using a threadpool to create a thread for each chunk to put it in 2 DB instances. Now I have 2 challenges:
I have to alternatively insert chunks into 2 DBS. i.e. odd chunks go to 1st DB and even chunks go to 2nd DB. I don't have anything in the chunk model to denote me the number of chunk on which I can depend. I tried to create a wrapper on that chunk model to have a "chunkCount" but where do I increment the chunkCount?
How do I measure the time for each insert which would be running on different threads from the threadpool?
Following code I tried on experiment basis, but it's not yielding any result:
logEventsChunk = logFetcher.GetNextLogEventsChunk();
chunkModel = new LogEventChunkModel();
stw = new Stopwatch();
chunkModel.ChunkCount = chunkCount;
chunkModel.LogeventChunk = logEventsChunk;
//chunkCount++;
ThreadPool.QueueUserWorkItem(new WaitCallback(delegate(object state)
{ InsertChunk(chunkModel, collection, secondCollection, stw); }), null);
The InsertChunk method is here:
private void InsertChunk(LogEventChunkModel logEventsChunk, MongoCollection<LogEvent> collection, MongoCollection<LogEvent> secondCollection,Stopwatch stw)
{
chunkCount++;
stw.Start();
MongoInsertOptions options = new MongoInsertOptions();
options.WriteConcern = WriteConcern.Unacknowledged;
options.CheckElementNames = true;
string db = string.Empty;
{
//DateTime dtWrite = DateTime.Now;
if (logEventsChunk.ChunkCount % 2 == 0)
{
DateTime dtWrite1 = DateTime.Now;
collection.InsertBatch(logEventsChunk.LogeventChunk.LogEvents, options);
db = "FirstDB";
//Console.WriteLine("Time taken to write the chunk: " + DateTime.Now.Subtract(dtWrite1).TotalSeconds.ToString() + " s. " + db);
}
else
{
DateTime dtWrite2 = DateTime.Now;
secondCollection.InsertBatch(logEventsChunk.LogeventChunk.LogEvents, options);
db = "SecondDB";
//Console.WriteLine("Time taken to write the chunk: " + DateTime.Now.Subtract(dtWrite2).TotalSeconds.ToString() + " s. " + db);
}
Console.WriteLine("Thread Completed: {0} **********", Thread.CurrentThread.GetHashCode() );
stw.Stop();
Console.WriteLine("Time taken to write the chunk: " + stw.ElapsedMilliseconds + " ms. " + db + " Chunk Count: " + logEventsChunk.ChunkCount);
stw.Reset();
//+ "Chunk Count: " + chunkCount.ToString()
//Console.WriteLine("Time taken to write the chunk: " + DateTime.Now.Subtract(dtWrite).TotalSeconds.ToString() + " s. "+db);
//mongoDBInsertionTotalTime += DateTime.Now.Subtract(dtWrite).TotalSeconds;
}
}
Please ignore those commented lines since they are part of some experiments only.

Rather than starting a new thread for each insertion, and trying to make the thread figure out which database to write to, start two persistent threads, each of which writes to a single database. Those threads get their data from queues. This is a pretty standard producer/consumer setup using BlockingCollection<T>.
So, you have:
// Maximum number of items in queue (to avoid out of memory errors)
const int MaxQueueSize = 10000;
BlockingCollection<LogEventChunkModel> Db1Queue = new BlockingCollection<LogEventChunkModel>(MaxQueueSize);
BlockingCollection<LogEventChunkModel> Db2Queue = new BlockingCollection<LogEventChunkModel>(MaxQueueSize);
In your main thread, start the database update threads:
var t1 = new Thread(DbWriteThreadProc);
t1.Start(new Tuple<string, BlockingCollection<LogEventChunkModel>>("FirstDB", Db1Queue));
var t2 = new Thread(DbWriteThreadProc);
t2.Start(new Tuple<string, BlockingCollection<LogEventChunkModel>>("SecondDb", Db2Queue));
Then, begin reading the log file and placing alternate chunks into the queues:
int chunk = 0;
while (!EndOfLogFile)
{
var chunk = GetNextChunk();
if ((chunk % 0) == 0)
Db1Queue.Add(chunk);
else
Db2Queue.Add(chunk);
++chunk;
}
// end of data, so mark the queues as complete
Db1Queue.CompleteAdding();
Db2Queue.CompleteAdding();
// and wait for threads to complete processing the queues
t1.Join();
t2.Join();
Your write thread proc is pretty simple. All it does is service the queue and write to the database:
void DbWriteThreadProc(object state)
{
// passed object is a Tuple<string, BlockingCollection>
// Get the items from it
var threadData = (Tuple<string, BlockingCollection>)state;
string dbName = threadData.Item1;
BlockingCollection<LogEventChunk> queue = threadData.Item2;
// now read the queue and write to the database
foreach (var chunk in queue.GetConsumingEnumerable())
{
var sw = Stopwatch.StartNew();
// write chunk to the database.
sw.Stop();
Console.WriteLine("Time to write = {0:N0} ms", sw.ElapsedMilliseconds);
}
}
GetConsumingEnumerable does a non-busy wait on the queue, so it's not continually polling. The loop will complete when the queue is empty and the queue is marked as complete for adding (which is why the main thread calls CompleteAdding).
This approach has several advantages over what you had. In particular, it simplifies determining which database chunks get written to. In addition, it uses at most three threads and guarantees that chunks are added to the database in the same order in which they were read from the log file. Your approach using QueueUserWorkItem does not guarantee insertion order. It also creates a new thread for each insertion, and could end up with a huge number of concurrent threads.

Related

Writing large amounts of streamed data frequently with writestream

I'm trying to write a live websocket feed line-by-line to a file - I think for this I should be using a writeable stream.
My problem here is that the data received is in the region of 10 lines per second, which quickly fills the buffer.
I understand when using streams from sources you control, you would normally add some sort of backpressure logic here, but what should I do if I do not control the source? Should I be batching up the writes and writing, say 500 lines at a time, instead of per line, or should I be using some other way to save this data?
I'm wondering how big are the lines? 10 lines per second sounds trivial to stream to a disk unless the lines are gigantic or the disk really slow. Ultimately, if you have no ability to apply backpressure logic, the source can overwhelm you if they go fast or your storage goes slow and you'd have to decide how much you can reasonably buffer and eventually just drop some of the data if you get behind.
But, you should be able to write a lot of data. On a my regular hard disk (using the generic stream code below with no additional buffering) I can do sequential writes of 100,000,000 bytes at a speed of 55 MBytes/sec:
So, if you have 10 lines per second coming in, as long as the lines were below 10,000,000 bytes each, my hard drive could keep up.
Here's the code I used to test it:
const fs = require('fs');
const { Bench } = require('../../Github/measure');
const { addCommas } = require("../../Github/str-utils");
const lineData = Buffer.from("012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678\n", 'utf-8');
let stream = fs.createWriteStream("D:\\Temp\\temp.txt");
stream.on('open', function() {
let linesRemaining = 1_000_000;
let b = new Bench();
let bytes = 0;
function write() {
do {
linesRemaining--;
let readyMore;
bytes += lineData.length;
if (linesRemaining === 0) {
readyForMore = stream.write(lineData, done);
} else {
readyForMore = stream.write(lineData);
}
} while (linesRemaining > 0 && readyForMore);
if (linesRemaining > 0) {
stream.once('drain', write);
}
}
function done() {
b.markEnd();
console.log(`Time to write ${addCommas(bytes)} bytes: ${b.formatSec(3)}`);
console.log(`bytes/sec = ${addCommas((bytes/b.sec).toFixed(0))}`);
console.log(`MB/sec = ${addCommas(((bytes/(1024 * 1024))/b.sec).toFixed(1))}`);
stream.end();
}
b.markBegin();
write();
});
Theoretically, it is more efficient for your disk to do fewer writes that are larger, than tons of small writes. In practice, because of the way the writeStream works, as soon as an inefficient write gets slow, the next write will get buffered and it kind of self corrects. If you were really trying to minimize the load on the disk, you would buffer writes until you had at least something like 4k to write. The issue is that each write has potentially allocate some bytes to the file (which involves writing to a table on the disk), then seek to where the bytes should be written on the disk, then write the bytes. Fewer and larger writes that are larger (up to some limit that depends upon internal implementation) will reduce the number of times it has to do the file allocation overhead.
So, I ran a test. I modified the above code (shown below) to buffer into 4k chunks and write them out in 4k chunks. The write through increased from 55 MBytes/sec to 284.2 MBytes/sec.
So, the theory holds true that you will write faster if you buffer into larger chunks.
But, even the simpler, non-buffered version may be plenty fast.
Here's the test code for the buffered version:
const fs = require('fs');
const { Bench } = require('../../Github/measure');
const { addCommas } = require("../../Github/str-utils");
const lineData = Buffer.from("012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678\n", 'utf-8');
let stream = fs.createWriteStream("D:\\Temp\\temp.txt");
stream.on('open', function() {
let linesRemaining = 1_000_000;
let b = new Bench();
let bytes = 0;
let cache = [];
let cacheTotal = 0;
const maxBuffered = 4 * 1024;
stream.myWrite = function(data, callback) {
if (callback) {
cache.push(data);
return stream.write(Buffer.concat(cache), callback);
} else {
cache.push(data);
cacheTotal += data.length;
if (cacheTotal >= maxBuffered) {
let ready = stream.write(Buffer.concat(cache));
cache.length = 0;
cacheTotal = 0;
return ready;
} else {
return true;
}
}
}
function write() {
do {
linesRemaining--;
let readyMore;
bytes += lineData.length;
if (linesRemaining === 0) {
readyForMore = stream.myWrite(lineData, done);
} else {
readyForMore = stream.myWrite(lineData);
}
} while (linesRemaining > 0 && readyForMore);
if (linesRemaining > 0) {
stream.once('drain', write);
}
}
function done() {
b.markEnd();
console.log(`Time to write ${addCommas(bytes)} bytes: ${b.formatSec(3)}`);
console.log(`bytes/sec = ${addCommas((bytes/b.sec).toFixed(0))}`);
console.log(`MB/sec = ${addCommas(((bytes/(1024 * 1024))/b.sec).toFixed(1))}`);
stream.end();
}
b.markBegin();
write();
});
This code uses a couple of my local libraries for measuring the time and formatting the output. If you want to run this yourself, you can substitute your own logic for those.

Node.js memory leak when reading and writing large files

I am currently trying to implement SPIMI index construction method in Node and I have ran into an issue.
The code is the following:
let fs = require("fs");
let path = require("path");
module.exports = {
fileStream: function (dirPath, fileStream) {
return buildFileStream(dirPath, fileStream);
},
buildSpimi: function (fileStream, outDir) {
let invIndex = {};
let sortedInvIndex = {};
let fileNameCount = 1;
let outputTXT = "";
let entryCounter = 0;
let resString = "";
fileStream.forEach((filePath, fileIndex) => {
let data = fs.readFileSync(filePath).toString('utf-8');
data = data.toUpperCase().split(/[^a-zA-Z]/).filter(function (ch) { return ch.length != 0; });
data.forEach(token => {
//CHANGE THE SIZE IF NECESSARY (4e+?)
if (entryCounter > 100000) {
Object.keys(invIndex).sort().forEach((key) => {
sortedInvIndex[key] = invIndex[key];
});
outputTXT = outDir + "block" + fileNameCount;
for (let SItoken in sortedInvIndex) {
resString += SItoken + "," + sortedInvIndex[SItoken].toString();
};
fs.writeFile(outputTXT, resString, (err) => { if (err) console.log(error); });
resString = "";
entryCounter = 0;
sortedInvIndex = {};
invIndex = {};
console.log(outputTXT + " - written;");
fileNameCount++;
};
if (invIndex[token] == undefined) {
invIndex[token] = [];
entryCounter++;
};
if (!invIndex[token].includes(fileIndex)) {
invIndex[token].push(fileIndex);
entryCounter++;
};
});
});
Object.keys(invIndex).sort().forEach((key) => {
sortedInvIndex[key] = invIndex[key];
});
outputTXT = outDir + "block" + fileNameCount;
for (let SItoken in sortedInvIndex) {
resString += SItoken + "," + sortedInvIndex[SItoken].toString();
};
fs.writeFile(outputTXT, resString, (err) => { if (err) console.log(error); });
console.log(outputTXT + " - written;");
}
}
function buildFileStream(dirPath, fileStream) {
fileStream = fileStream || 0;
fs.readdirSync(dirPath).forEach(function (file) {
let filepath = path.join(dirPath, file);
let stat = fs.statSync(filepath);
if (stat.isDirectory()) {
fileStream = buildFileStream(filepath, fileStream);
} else {
fileStream.push(filepath);
}
});
return fileStream;
}
I am using the exported functions in a separate file:
let spimi = require("./spimi");
let outputDir = "/Users/me/Desktop/SPIMI_OUT/"
let inputDir = "/Users/me/Desktop/gutenberg/2/2";
fileStream = [];
let result = spimi.fileStream(inputDir, fileStream);
console.table(result)
console.log("Finished building the filestream");
let t0 = new Date();
spimi.buildSpimi(result, outputDir);
let t1 = new Date();
console.log(t1 - t0);
While this code kind of works when trying on relatively small volumes of data (I tested up to 1.5 GB), there is obviously a memory leak somewhere, as when monitoring the RAM usage I can see it going up as far as to 4-5 GB).
I spent quite a lot of time trying to figure out what might be the cause, but I still couldn't find the issue.
I would appreciate any hints on this!
Thanks!
Something to understand about the language and garbage collection in general is that this:
data = data.toUpperCase().split(/[^a-zA-Z]/).filter(...)
creates three additional copies of your data. First, an uppercase copy. Then, a split array copy. Then, a filtered copy of the split array.
So, at this point, you have four copies of your data all in memory. All, but the filtered array are now eligible for garbage collection when the GC gets a chance to run, but if this data was initially large, you're going to be using at least 3x-4x as much memory as the filesize (depending upon how many array items are removed in your .filter() operation).
None of this is a leak, but it's a very big peak memory usage which can be a problem.
A more memory efficient way to process large files is to process them as a stream (not read them all into memory at once). You read a small size chunk (say 1024 bytes), process it, read a chunk, process it while being careful about chunk boundaries. If your file naturally has line boundaries, there are already pre-built solutions for processing line by line. If not, you can create your own chunk processing mechanism. We would have to see a sample of your data to make more specific chunk processing suggestions.
As another point, if you end up with a lot of keys in invIndex, then this line of code starts to become inefficient and you're doing it in your loop:
Object.keys(invIndex).sort()
This takes your object and gets all the keys in a temporary array which you use only for the purposes of updating the sortedInvIndex which is yet another copy of your data. So, right there alone, this set of code makes three copies of all your keys and two copies of all the values. And, it does it every time through your loop. Again, lots of peak memory usage that the GC won't normally clean up until your function is done.
A redesign to the way you process this data could probably reduce the peak memory usage by a factor of 100x. For memory efficiency, you want only the initial data, the final data representation and then just a little more used for temporary transformations to over be in use at the same time. You don't want to EVER be processing all the data multiple times because each time you do that, it creates yet another entire copy of all the data that contributes to peak memory usage.
If you show what the data input looks like and what data structure you're trying to end up with, I could probably take a crack at a much more efficient implementation.
Mykhailo, adding on to what jfriend said, it's actually not a memory leak. It's working as intended.
Something to consider is that readFile buffers the entire file! This will cause the huge memory bloat. Better alternative is to implement fs.createReadStream() which will only buffer the part of the file you're currently reading. Unfortunately, implementing that solution may require a full rewrite of your code as it returns fs.ReadStream which won't behave the way you're currently handling files Checkout this link and read the bottom of the section to see what I'm referencing

Random last modified timestamps on an NFS file

While creating an application I encountered a strange behaviour. When reading file last modified timestamp I sometimes received random values. I came up with a theory about it, so I decided to perform an experiment.
I ran two threads on ubuntu # AWS EC2 server.
The first thread periodically (with random delays) creates and
deletes a file.
The second thread periodically (with random delays)
checks the file last modified timestamp.
Kotlin code:
fun main() {
Thread(::createAndDelete).start()
Thread(::measure).start()
}
fun createAndDelete() {
println("starting createAndDelete thread")
var i = 0
while(true) {
try {
Files.createFile(Paths.get("$workingDirectory/test"))
} catch (e: FileAlreadyExistsException) {
}
Thread.sleep(nextLong(10, 30)) // A random delay
try {
Files.delete(Paths.get("$workingDirectory/test"))
} catch (e: NoSuchFileException) {
}
if(++i % 100 == 0)
println("created and deleted $i times")
}
}
fun measure() {
println("starting measure thread")
var i = 0
while(true) {
try {
val time = Files.getLastModifiedTime(Paths.get("$workingDirectory/test")).toMillis()
if(!Files.exists(Paths.get("$workingDirectory/test")))
continue
val time2 = Files.getLastModifiedTime(Paths.get("$workingDirectory/test")).toMillis()
if(time != time2)
continue
val difference = abs(Date().time - time)
if (difference > 10000) { // This happens a lot when using AWS EFS
println("time = $time, difference = $difference")
}
} catch (e: NoSuchFileException) {
}
Thread.sleep(nextLong(10, 20)) // A random delay
if(++i % 100 == 0)
println("measured $i times")
}
}
The results are normal as long as I use local storage.
However, when I use a mounted external NFS (AWS EFS) volume, I sometimes get file last modified timestamps of random values. Even with the precaution of measuring twice the timestamp and checking for the file existence in the middle.
starting createAndDelete thread
starting measure thread
time = 747720714000, difference = 817721554205
time = 1167151114000, difference = 398291154309
time = 1485918218000, difference = 79524050386
time = 1636913162000, difference = 71470893579
time = 1771130890000, difference = 205688621545
time = 177360906000, difference = 1388081362585
time = 294801418000, difference = 1270640850615
time = 445796362000, difference = 1119645906648
time = 1217548298000, difference = 347893971133
measured 100 times
What is the exact reason for this behaviour?
Is there any way to make the last modified timestamp reading reliable?

Best way to execute parallel processing in Node.js

I'm trying to write a small node application that will search through and parse a large number of files on the file system.
In order to speed up the search, we are attempting to use some sort of map reduce. The plan would be the following simplified scenario:
Web request comes in with a search query
3 processes are started that each get assigned 1000 (different) files
once a process completes, it would 'return' it's results back to the main thread
once all processes complete, the main thread would continue by returning the combined result as a JSON result
The questions I have with this are:
Is this doable in Node?
What is the recommended way of doing it?
I've been fiddling, but come no further then following example using Process:
initiator:
function Worker() {
return child_process.fork("myProcess.js");
}
for(var i = 0; i < require('os').cpus().length; i++){
var process = new Worker();
process.send(workItems.slice(i * itemsPerProcess, (i+1) * itemsPerProcess));
}
myProcess.js
process.on('message', function(msg) {
var valuesToReturn = [];
// Do file reading here
//How would I return valuesToReturn?
process.exit(0);
}
Few sidenotes:
I'm aware the number of processes should be dependent of the number of CPU's on the server
I'm also aware of speed restrictions in a file system. Consider it a proof of concept before we move this to a database or Lucene instance :-)
Should be doable. As a simple example:
// parent.js
var child_process = require('child_process');
var numchild = require('os').cpus().length;
var done = 0;
for (var i = 0; i < numchild; i++){
var child = child_process.fork('./child');
child.send((i + 1) * 1000);
child.on('message', function(message) {
console.log('[parent] received message from child:', message);
done++;
if (done === numchild) {
console.log('[parent] received all results');
...
}
});
}
// child.js
process.on('message', function(message) {
console.log('[child] received message from server:', message);
setTimeout(function() {
process.send({
child : process.pid,
result : message + 1
});
process.disconnect();
}, (0.5 + Math.random()) * 5000);
});
So the parent process spawns an X number of child processes and passes them a message. It also installs an event handler to listen for any messages sent back from the child (with the result, for instance).
The child process waits for messages from the parent, and starts processing (in this case, it just starts a timer with a random timeout to simulate some work being done). Once it's done, it sends the result back to the parent process and uses process.disconnect() to disconnect itself from the parent (basically stopping the child process).
The parent process keeps track of the number of child processes started, and the number of them that have sent back a result. When those numbers are equal, the parent received all results from the child processes so it can combine all results and return the JSON result.
For a distributed problem like this, I've used zmq and it has worked really well. I'll give you a similar problem that I ran into, and attempted to solve via processes (but failed.) and then turned towards zmq.
Using bcrypt, or an expensive hashing algorith, is wise, but it blocks the node process for around 0.5 seconds. We had to offload this to a different server, and as a quick fix, I used essentially exactly what you did. Run a child process and send messages to it and get it to
respond. The only issue we found is for whatever reason our child process would pin an entire core when it was doing absolutely no work.(I still haven't figured out why this happened, we ran a trace and it appeared that epoll was failing on stdout/stdin streams. It would also only happen on our Linux boxes and would work fine on OSX.)
edit:
The pinning of the core was fixed in https://github.com/joyent/libuv/commit/12210fe and was related to https://github.com/joyent/node/issues/5504, so if you run into the issue and you're using centos + kernel v2.6.32: update node, or update your kernel!
Regardless of the issues I had with child_process.fork(), here's a nifty pattern I always use
client:
var child_process = require('child_process');
function FileParser() {
this.__callbackById = [];
this.__callbackIdIncrement = 0;
this.__process = child_process.fork('./child');
this.__process.on('message', this.handleMessage.bind(this));
}
FileParser.prototype.handleMessage = function handleMessage(message) {
var error = message.error;
var result = message.result;
var callbackId = message.callbackId;
var callback = this.__callbackById[callbackId];
if (! callback) {
return;
}
callback(error, result);
delete this.__callbackById[callbackId];
};
FileParser.prototype.parse = function parse(data, callback) {
this.__callbackIdIncrement = (this.__callbackIdIncrement + 1) % 10000000;
this.__callbackById[this.__callbackIdIncrement] = callback;
this.__process.send({
data: data, // optionally you could pass in the path of the file, and open it in the child process.
callbackId: this.__callbackIdIncrement
});
};
module.exports = FileParser;
child process:
process.on('message', function(message) {
var callbackId = message.callbackId;
var data = message.data;
function respond(error, response) {
process.send({
callbackId: callbackId,
error: error,
result: response
});
}
// parse data..
respond(undefined, "computed data");
});
We also need a pattern to synchronize the different processes, when each process finishes its task, it will respond to us, and we'll increment a count for each process that finishes, and then call the callback of the Semaphore when we've hit the count we want.
function Semaphore(wait, callback) {
this.callback = callback;
this.wait = wait;
this.counted = 0;
}
Semaphore.prototype.signal = function signal() {
this.counted++;
if (this.counted >= this.wait) {
this.callback();
}
}
module.exports = Semaphore;
here's a use case that ties all the above patterns together:
var FileParser = require('./FileParser');
var Semaphore = require('./Semaphore');
var arrFileParsers = [];
for(var i = 0; i < require('os').cpus().length; i++){
var fileParser = new FileParser();
arrFileParsers.push(fileParser);
}
function getFiles() {
return ["file", "file"];
}
var arrResults = [];
function onAllFilesParsed() {
console.log('all results completed', JSON.stringify(arrResults));
}
var lock = new Semaphore(arrFileParsers.length, onAllFilesParsed);
arrFileParsers.forEach(function(fileParser) {
var arrFiles = getFiles(); // you need to decide how to split the files into 1k chunks
fileParser.parse(arrFiles, function (error, result) {
arrResults.push(result);
lock.signal();
});
});
Eventually I used http://zguide.zeromq.org/page:all#The-Load-Balancing-Pattern, where the client was using the nodejs zmq client, and the workers/broker were written in C. This allowed us to scale this across multiple machines, instead of just a local machine with sub processes.

How to improve throughput with FileStream in a single-threaded application

I am trying to get top I/O performance in a data streaming application with eight SSDs in RAID-5 (each SSD advertises and delivers 500 MB/sec reads).
I create FileStream with 64KB buffer and read many blocks in a blocking fashion (pun not intended). Here's what I have now with 80GB in 20K files, no fragments:
Legacy blocking reads are at 1270 MB/sec with single thread, 1556 MB/sec with 6 threads.
What I noticed with single-thread is that a single core's worth of CPU time is spent in kernel (8.3% red in Process Explorer with 12 cores). With 6 threads, approximately 5x CPU time is spent in kernel (41% red in in Process Explorer with 12 cores).
I would really like to avoid complexity of a multi-threaded application in the I/O bound scenario.
Is it possible to achieve these transfer rates in a single-threaded application? That is, what would be a good way to reduce the amount of time in kernel mode?
How, if at all, would the new Async feature in C# help?
For comparison, ATTO disk benchmark shows 2500 MB/sec at these block sizes on this hardware and low CPU utilization. However, ATTO dataset size is mere 2GB.
Using LSI 9265-8i RAID controller, with 64k stripe size, 64k cluster size.
Here's a sketch of the code in use. I don't write production code this way, it's just a proof of concept.
volatile bool _somethingLeftToRead = false;
long _totalReadInSize = 0;
void ProcessReadThread(object obj)
{
TestThreadJob job = obj as TestThreadJob;
var dirInfo = new DirectoryInfo(job.InFilePath);
int chunk = job.DataBatchSize * 1024;
//var tile = new List<byte[]>();
var sw = new Stopwatch();
var allFiles = dirInfo.GetFiles();
var fileStreams = new List<FileStream>();
long totalSize = 0;
_totalReadInSize = 0;
foreach (var fileInfo in allFiles)
{
totalSize += fileInfo.Length;
var fileStream = new FileStream(fileInfo.FullName,
FileMode.Open, FileAccess.Read, FileShare.None, job.FileBufferSize * 1024);
fileStreams.Add(fileStream);
}
var partial = new byte[chunk];
var taskParam = new TaskParam(null, partial);
var tasks = new List<Task>();
int numTasks = (int)Math.Ceiling(fileStreams.Count * 1.0 / job.NumThreads);
sw.Start();
do
{
_somethingLeftToRead = false;
for (int taskIndex = 0; taskIndex < numTasks; taskIndex++)
{
if (_threadCanceled)
break;
tasks.Clear();
for (int thread = 0; thread < job.NumThreads; thread++)
{
if (_threadCanceled)
break;
int fileIndex = taskIndex * job.NumThreads + thread;
if (fileIndex >= fileStreams.Count)
break;
var fileStream = fileStreams[fileIndex];
taskParam.File = fileStream;
if (job.NumThreads == 1)
ProcessFileRead(taskParam);
else
tasks.Add(Task.Factory.StartNew(ProcessFileRead, taskParam));
//tile.Add(partial);
}
if (_threadCanceled)
break;
if (job.NumThreads > 1)
Task.WaitAll(tasks.ToArray());
}
//tile = new List<byte[]>();
}
while (_somethingLeftToRead);
sw.Stop();
foreach (var fileStream in fileStreams)
fileStream.Close();
totalSize = (long)Math.Round(totalSize / 1024.0 / 1024.0);
UpdateUIRead(false, totalSize, sw.Elapsed.TotalSeconds);
}
void ProcessFileRead(object taskParam)
{
TaskParam param = taskParam as TaskParam;
int readInSize;
if ((readInSize = param.File.Read(param.Bytes, 0, param.Bytes.Length)) != 0)
{
_somethingLeftToRead = true;
_totalReadInSize += readInSize;
}
}
There's a number of issues here.
First, I see that you are not trying to use non-cached I/O. This means that the system will try to cache your data in RAM and service reads out of it. SO you get an extra data transfer out of things. Do non-cached I/O.
Next, you appear to be creating/destroying threads inside a loop. This is inefficient.
Lastly, you need to investigate the alignment of the data. Crossing read-block boundaries can add to your costs.
I would advocate using non-cached, async I/O. I'm not sure how to accomplish this in C# (but it should be easy).
EDITED: Also, why are you using RAID 5? Unless the data is write-once, this is likely to have hideous performance on SSDs. Notably, the erase block size is typically 512K, meaning when you write something smaller, the SSD will need to read the 512K in its firmware, change the data, and then write it somewhere else. You might want to make the stripe size = size of erase block. Also, you should check to see what the alignment of the writes are as well.

Resources