How to improve throughput with FileStream in a single-threaded application - windows-server-2008-r2

I am trying to get top I/O performance in a data streaming application with eight SSDs in RAID-5 (each SSD advertises and delivers 500 MB/sec reads).
I create FileStream with 64KB buffer and read many blocks in a blocking fashion (pun not intended). Here's what I have now with 80GB in 20K files, no fragments:
Legacy blocking reads are at 1270 MB/sec with single thread, 1556 MB/sec with 6 threads.
What I noticed with single-thread is that a single core's worth of CPU time is spent in kernel (8.3% red in Process Explorer with 12 cores). With 6 threads, approximately 5x CPU time is spent in kernel (41% red in in Process Explorer with 12 cores).
I would really like to avoid complexity of a multi-threaded application in the I/O bound scenario.
Is it possible to achieve these transfer rates in a single-threaded application? That is, what would be a good way to reduce the amount of time in kernel mode?
How, if at all, would the new Async feature in C# help?
For comparison, ATTO disk benchmark shows 2500 MB/sec at these block sizes on this hardware and low CPU utilization. However, ATTO dataset size is mere 2GB.
Using LSI 9265-8i RAID controller, with 64k stripe size, 64k cluster size.
Here's a sketch of the code in use. I don't write production code this way, it's just a proof of concept.
volatile bool _somethingLeftToRead = false;
long _totalReadInSize = 0;
void ProcessReadThread(object obj)
{
TestThreadJob job = obj as TestThreadJob;
var dirInfo = new DirectoryInfo(job.InFilePath);
int chunk = job.DataBatchSize * 1024;
//var tile = new List<byte[]>();
var sw = new Stopwatch();
var allFiles = dirInfo.GetFiles();
var fileStreams = new List<FileStream>();
long totalSize = 0;
_totalReadInSize = 0;
foreach (var fileInfo in allFiles)
{
totalSize += fileInfo.Length;
var fileStream = new FileStream(fileInfo.FullName,
FileMode.Open, FileAccess.Read, FileShare.None, job.FileBufferSize * 1024);
fileStreams.Add(fileStream);
}
var partial = new byte[chunk];
var taskParam = new TaskParam(null, partial);
var tasks = new List<Task>();
int numTasks = (int)Math.Ceiling(fileStreams.Count * 1.0 / job.NumThreads);
sw.Start();
do
{
_somethingLeftToRead = false;
for (int taskIndex = 0; taskIndex < numTasks; taskIndex++)
{
if (_threadCanceled)
break;
tasks.Clear();
for (int thread = 0; thread < job.NumThreads; thread++)
{
if (_threadCanceled)
break;
int fileIndex = taskIndex * job.NumThreads + thread;
if (fileIndex >= fileStreams.Count)
break;
var fileStream = fileStreams[fileIndex];
taskParam.File = fileStream;
if (job.NumThreads == 1)
ProcessFileRead(taskParam);
else
tasks.Add(Task.Factory.StartNew(ProcessFileRead, taskParam));
//tile.Add(partial);
}
if (_threadCanceled)
break;
if (job.NumThreads > 1)
Task.WaitAll(tasks.ToArray());
}
//tile = new List<byte[]>();
}
while (_somethingLeftToRead);
sw.Stop();
foreach (var fileStream in fileStreams)
fileStream.Close();
totalSize = (long)Math.Round(totalSize / 1024.0 / 1024.0);
UpdateUIRead(false, totalSize, sw.Elapsed.TotalSeconds);
}
void ProcessFileRead(object taskParam)
{
TaskParam param = taskParam as TaskParam;
int readInSize;
if ((readInSize = param.File.Read(param.Bytes, 0, param.Bytes.Length)) != 0)
{
_somethingLeftToRead = true;
_totalReadInSize += readInSize;
}
}

There's a number of issues here.
First, I see that you are not trying to use non-cached I/O. This means that the system will try to cache your data in RAM and service reads out of it. SO you get an extra data transfer out of things. Do non-cached I/O.
Next, you appear to be creating/destroying threads inside a loop. This is inefficient.
Lastly, you need to investigate the alignment of the data. Crossing read-block boundaries can add to your costs.
I would advocate using non-cached, async I/O. I'm not sure how to accomplish this in C# (but it should be easy).
EDITED: Also, why are you using RAID 5? Unless the data is write-once, this is likely to have hideous performance on SSDs. Notably, the erase block size is typically 512K, meaning when you write something smaller, the SSD will need to read the 512K in its firmware, change the data, and then write it somewhere else. You might want to make the stripe size = size of erase block. Also, you should check to see what the alignment of the writes are as well.

Related

Writing large amounts of streamed data frequently with writestream

I'm trying to write a live websocket feed line-by-line to a file - I think for this I should be using a writeable stream.
My problem here is that the data received is in the region of 10 lines per second, which quickly fills the buffer.
I understand when using streams from sources you control, you would normally add some sort of backpressure logic here, but what should I do if I do not control the source? Should I be batching up the writes and writing, say 500 lines at a time, instead of per line, or should I be using some other way to save this data?
I'm wondering how big are the lines? 10 lines per second sounds trivial to stream to a disk unless the lines are gigantic or the disk really slow. Ultimately, if you have no ability to apply backpressure logic, the source can overwhelm you if they go fast or your storage goes slow and you'd have to decide how much you can reasonably buffer and eventually just drop some of the data if you get behind.
But, you should be able to write a lot of data. On a my regular hard disk (using the generic stream code below with no additional buffering) I can do sequential writes of 100,000,000 bytes at a speed of 55 MBytes/sec:
So, if you have 10 lines per second coming in, as long as the lines were below 10,000,000 bytes each, my hard drive could keep up.
Here's the code I used to test it:
const fs = require('fs');
const { Bench } = require('../../Github/measure');
const { addCommas } = require("../../Github/str-utils");
const lineData = Buffer.from("012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678\n", 'utf-8');
let stream = fs.createWriteStream("D:\\Temp\\temp.txt");
stream.on('open', function() {
let linesRemaining = 1_000_000;
let b = new Bench();
let bytes = 0;
function write() {
do {
linesRemaining--;
let readyMore;
bytes += lineData.length;
if (linesRemaining === 0) {
readyForMore = stream.write(lineData, done);
} else {
readyForMore = stream.write(lineData);
}
} while (linesRemaining > 0 && readyForMore);
if (linesRemaining > 0) {
stream.once('drain', write);
}
}
function done() {
b.markEnd();
console.log(`Time to write ${addCommas(bytes)} bytes: ${b.formatSec(3)}`);
console.log(`bytes/sec = ${addCommas((bytes/b.sec).toFixed(0))}`);
console.log(`MB/sec = ${addCommas(((bytes/(1024 * 1024))/b.sec).toFixed(1))}`);
stream.end();
}
b.markBegin();
write();
});
Theoretically, it is more efficient for your disk to do fewer writes that are larger, than tons of small writes. In practice, because of the way the writeStream works, as soon as an inefficient write gets slow, the next write will get buffered and it kind of self corrects. If you were really trying to minimize the load on the disk, you would buffer writes until you had at least something like 4k to write. The issue is that each write has potentially allocate some bytes to the file (which involves writing to a table on the disk), then seek to where the bytes should be written on the disk, then write the bytes. Fewer and larger writes that are larger (up to some limit that depends upon internal implementation) will reduce the number of times it has to do the file allocation overhead.
So, I ran a test. I modified the above code (shown below) to buffer into 4k chunks and write them out in 4k chunks. The write through increased from 55 MBytes/sec to 284.2 MBytes/sec.
So, the theory holds true that you will write faster if you buffer into larger chunks.
But, even the simpler, non-buffered version may be plenty fast.
Here's the test code for the buffered version:
const fs = require('fs');
const { Bench } = require('../../Github/measure');
const { addCommas } = require("../../Github/str-utils");
const lineData = Buffer.from("012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678\n", 'utf-8');
let stream = fs.createWriteStream("D:\\Temp\\temp.txt");
stream.on('open', function() {
let linesRemaining = 1_000_000;
let b = new Bench();
let bytes = 0;
let cache = [];
let cacheTotal = 0;
const maxBuffered = 4 * 1024;
stream.myWrite = function(data, callback) {
if (callback) {
cache.push(data);
return stream.write(Buffer.concat(cache), callback);
} else {
cache.push(data);
cacheTotal += data.length;
if (cacheTotal >= maxBuffered) {
let ready = stream.write(Buffer.concat(cache));
cache.length = 0;
cacheTotal = 0;
return ready;
} else {
return true;
}
}
}
function write() {
do {
linesRemaining--;
let readyMore;
bytes += lineData.length;
if (linesRemaining === 0) {
readyForMore = stream.myWrite(lineData, done);
} else {
readyForMore = stream.myWrite(lineData);
}
} while (linesRemaining > 0 && readyForMore);
if (linesRemaining > 0) {
stream.once('drain', write);
}
}
function done() {
b.markEnd();
console.log(`Time to write ${addCommas(bytes)} bytes: ${b.formatSec(3)}`);
console.log(`bytes/sec = ${addCommas((bytes/b.sec).toFixed(0))}`);
console.log(`MB/sec = ${addCommas(((bytes/(1024 * 1024))/b.sec).toFixed(1))}`);
stream.end();
}
b.markBegin();
write();
});
This code uses a couple of my local libraries for measuring the time and formatting the output. If you want to run this yourself, you can substitute your own logic for those.

How to mmap() a large file without risking the OOM killer?

I've got an embedded ARM Linux box with a limited amount of RAM (512MB) and no swap space, on which I need to create and then manipulate a fairly large file (~200MB). Loading the entire file into RAM, modifying the contents in-RAM, and then writing it back out again would sometimes invoke the OOM-killer, which I want to avoid.
My idea to get around this was to use mmap() to map this file into my process's virtual address space; that way, reads and writes to the mapped memory-area would go out to the local flash-filesystem instead, and the OOM-killer would be avoided since if memory got low, Linux could just flush some of the mmap()'d memory pages back to disk to free up some RAM. (That might make my program slow, but slow is okay for this use-case)
However, even with the mmap() call, I'm still occasionally seeing processes get killed by the OOM-killer while performing the above operation.
My question is, was I too optimistic about how Linux would behave in the presence of both a large mmap() and limited RAM? (i.e. does mmap()-ing a 200MB file and then reading/writing to the mmap()'d memory still require 200MB of available RAM to accomplish reliably?) Or should mmap() be clever enough to page out mmap'd pages when memory is low, but I'm doing something wrong in how I use it?
FWIW my code to do the mapping is here:
void FixedSizeDataBuffer :: TryMapToFile(const std::string & filePath, bool createIfNotPresent, bool autoDelete)
{
const int fd = open(filePath.c_str(), (createIfNotPresent?(O_CREAT|O_EXCL|O_RDWR):O_RDONLY)|O_CLOEXEC, S_IRUSR|(createIfNotPresent?S_IWUSR:0));
if (fd >= 0)
{
if ((autoDelete == false)||(unlink(filePath.c_str()) == 0)) // so the file will automatically go away when we're done with it, even if we crash
{
const int fallocRet = createIfNotPresent ? posix_fallocate(fd, 0, _numBytes) : 0;
if (fallocRet == 0)
{
void * mappedArea = mmap(NULL, _numBytes, PROT_READ|(createIfNotPresent?PROT_WRITE:0), MAP_SHARED, fd, 0);
if (mappedArea)
{
printf("FixedSizeDataBuffer %p: Using backing-store file [%s] for %zu bytes of data\n", this, filePath.c_str(), _numBytes);
_buffer = (uint8_t *) mappedArea;
_isMappedToFile = true;
}
else printf("FixedSizeDataBuffer %p: Unable to mmap backing-store file [%s] to %zu bytes (%s)\n", this, filePath.c_str(), _numBytes, strerror(errno));
}
else printf("FixedSizeDataBuffer %p: Unable to pad backing-store file [%s] out to %zu bytes (%s)\n", this, filePath.c_str(), _numBytes, strerror(fallocRet));
}
else printf("FixedSizeDataBuffer %p: Unable to unlink backing-store file [%s] (%s)\n", this, filePath.c_str(), strerror(errno));
close(fd); // no need to hold this anymore AFAIK, the memory-mapping itself will keep the backing store around
}
else printf("FixedSizeDataBuffer %p: Unable to create backing-store file [%s] (%s)\n", this, filePath.c_str(), strerror(errno));
}
I can rewrite this code to just use plain-old-file-I/O if I have to, but it would be nice if mmap() could do the job (or if not, I'd at least like to understand why not).
After much further experimentation, I determined that the OOM-killer was visiting me not because the system had run out of RAM, but because RAM would occasionally become sufficiently fragmented that the kernel couldn't find a set of physically-contiguous RAM pages large enough to meet its immediate needs. When this happened, the kernel would invoke the OOM-killer to free up some RAM to avoid a kernel panic, which is all well and good for the kernel but not so great when it kills a process that the user was relying on to get his work done. :/
After trying and failing to find a way to convince Linux not to do that (I think enabling a swap partition would avoid the OOM-killer, but doing that is not an option for me on these particular machines), I came up with a hack work-around; I added some code to my program that periodically checks the amount of memory fragmentation reported by the Linux kernel, and if the memory fragmentation starts looking too severe, preemptively orders a memory-defragmentation to occur, so that the OOM-killer will (hopefully) not become necessary. If the memory-defragmentation pass doesn't appear to be improving matters any, then after 20 consecutive attempts, we also drop the VM Page cache as a way to free up contiguous physical RAM. This is all very ugly, but not as ugly as getting a phone call at 3AM from a user who wants to know why their server program just crashed. :/
The gist of the work-around implementation is below; note that DefragTick(Milliseconds) is expected to be called periodically (preferably once per second).
// Returns how safe we are from the fragmentation-based-OOM-killer visits.
// Returns -1 if we can't read the data for some reason.
static int GetFragmentationSafetyLevel()
{
int ret = -1;
FILE * fpIn = fopen("/sys/kernel/debug/extfrag/extfrag_index", "r");
if (fpIn)
{
char buf[512];
while(fgets(buf, sizeof(buf), fpIn))
{
const char * dma = (strncmp(buf, "Node 0, zone", 12) == 0) ? strstr(buf+12, "DMA") : NULL;
if (dma)
{
// dma= e.g.: "DMA -1.000 -1.000 -1.000 -1.000 0.852 0.926 0.963 0.982 0.991 0.996 0.998 0.999 1.000 1.000"
const char * s = dma+4; // skip past "DMA ";
ret = 0; // ret now becomes a count of "safe values in a row"; a safe value is any number less than 0.500, per me
while((s)&&((*s == '-')||(*s == '.')||(isdigit(*s))))
{
const float fVal = atof(s);
if (fVal < 0.500f)
{
ret++;
// Advance (s) to the next number in the list
const char * space = strchr(s, ' '); // to the next space
s = space ? (space+1) : NULL;
}
else break; // oops, a dangerous value! Run away!
}
}
}
fclose(fpIn);
}
return ret;
}
// should be called periodically (e.g. once per second)
void DefragTick(Milliseconds current_time_in_milliseconds)
{
if ((current_time_in_milliseconds-m_last_fragmentation_check_time) >= Milliseconds(1000))
{
m_last_fragmentation_check_time = current_time_in_milliseconds;
const int fragmentationSafetyLevel = GetFragmentationSafetyLevel();
if (fragmentationSafetyLevel < 9)
{
m_defrag_pending = true; // trouble seems to start at level 8
m_fragged_count++; // note that we still seem fragmented
}
else m_fragged_count = 0; // we're in the clear!
if ((m_defrag_pending)&&((current_time_in_milliseconds-m_last_defrag_time) >= Milliseconds(5000)))
{
if (m_fragged_count >= 20)
{
// FogBugz #17882
FILE * fpOut = fopen("/proc/sys/vm/drop_caches", "w");
if (fpOut)
{
const char * warningText = "Persistent Memory fragmentation detected -- dropping filesystem PageCache to improve defragmentation.";
printf("%s (fragged count is %i)\n", warningText, m_fragged_count);
fprintf(fpOut, "3");
fclose(fpOut);
m_fragged_count = 0;
}
else
{
const char * errorText = "Couldn't open /proc/sys/vm/drop_caches to drop filesystem PageCache!";
printf("%s\n", errorText);
}
}
FILE * fpOut = fopen("/proc/sys/vm/compact_memory", "w");
if (fpOut)
{
const char * warningText = "Memory fragmentation detected -- ordering a defragmentation to avoid the OOM-killer.";
printf("%s (fragged count is %i)\n", warningText, m_fragged_count);
fprintf(fpOut, "1");
fclose(fpOut);
m_defrag_pending = false;
m_last_defrag_time = current_time_in_milliseconds;
}
else
{
const char * errorText = "Couldn't open /proc/sys/vm/compact_memory to trigger a memory-defragmentation!";
printf("%s\n", errorText);
}
}
}
}

Nodejs for loop - stream runs out of memory

I'm generating a CSV file that I'd like to save.
It's a bit large, but the code is very simple.
I use streams as to prevent out of memory errors, but it's happening regardless.
Any tips?
const fs = require('fs');
var noOfRows = 2000000000;
var stream = fs.createWriteStream('myFile.csv', {flags: 'a'});
for (var i=0;i<=noOfRows;i++){
var col = '';
col += i;
stream.write(col)
}
add a drain eventlistener.
const fs = require("fs");
var noOfRows = 2000000000;
var stream = fs.createWriteStream("myFile.csv", { flags: "a" });
var i = 0;
function write() {
var ok = true;
do {
var data = i + "";
if (i === noOfRows) {
// last time!
stream.write(data);
} else {
// see if we should continue, or wait
// don't pass the callback, because we're not done yet.
ok = stream.write(data);
}
i++;
} while (i<=noOfRows && ok);
if (i < noOfRows) {
// had to stop early!
// write some more once it drains
stream.once("drain", write);
}
}
write();
And noOfRows is so big, it may cause your .csv file size out of disk size
Your .csv file has too much data to be kept in stream. Streams basically uses your computer's physical memory so it can store only upto the free physical memory. e.g. if your computer has 8GB of RAM of which lets say 6 GB is free then the stream can't store more than 6GB. You can break it up into chunks and then merge it back at the destination later.
There is no hard size limit on .csv files. The limit in any scenario would be the file system / hdd size.
The maximum file size of any file on a filesystem is determined by the
filesystem itself - not by the file type or filename suffix.
To prevent out memory errors check you file size limit as per your filesystem partition.

There's no sleep()/wait for mutex in node.js, so how to deal with large IO tasks?

I have a large array of filenames I need to check, but I also need to respond to network clients. The easiest way is to perform:
for(var i=0;i < array.length;i++) {
fs.readFile(array[i], function(err, data) {...});
}
, but array can be of any length, say 100000, so it's not a good idea to perform 100000 reads at once, on the other hand doing fs.readFileSync() can take too long. Also launching next fs.readFile() in callback, like this:
var Idx = 0;
function checkFile() {
fs.readFile(array[Idx], function (err, data) {
Idx++;
if (Idx < array.length) {
checkFile();
} else {
Idx = 0;
setTimeout(checkFile, 10000); // start checking files in one second
}
});
}
is also not a best option, because array[] gets constantly updated by network clients - some items deleted, new added and so on.
What is the best way to accomplish such a task in node.js?
You should stick to your first solution (fs.readFile). For file I/O, node.js uses a thread pool. The reason is that most unix kernels don't provide efficient asynchronous APIs for the file system. Even if you start 10,000 reads concurrently, only a few reads will actually run and the rest will wait in a queue.
In order to make this answer more interesting, I browsed through node's code again to make sure that things hadn't changed.
Long story short, file I/O uses blocking system calls and is made by a thread pool with at most 4 concurrent threads.
The important code is in libeio, which is abstracted by libuv. All I/O code is wrapped by macros which queue requests. For example:
eio_req *eio_read (int fd, void *buf, size_t length, off_t offset, int pri, eio_cb cb, void *data, eio_channel *channel)
{
REQ (EIO_READ); req->int1 = fd; req->offs = offset; req->size = length; req->ptr2 = buf; SEND;
}
REQ prepares the request and SEND queues it. We eventually end up in etp_maybe_start_thread:
static unsigned int started, idle, wanted = 4;
(...)
static void
etp_maybe_start_thread (void)
{
if (ecb_expect_true (etp_nthreads () >= wanted))
return;
(...)
The queue keeps 4 threads running to process the requests. When our read request is finally executed, eio simply use the block read from unistd.h:
case EIO_READ: ALLOC (req->size);
req->result = req->offs >= 0
? pread (req->int1, req->ptr2, req->size, req->offs)
: read (req->int1, req->ptr2, req->size); break;

Why do arrays take less memory than buffers in node.js?

I'm deciding on the best way to store a lot of timeseries data in memory and I made a simple benchmark to compare buffers vs simple arrays:
var buffers = {};
var started = Date.now();
var before = process.memoryUsage().heapUsed;
for (var i = 0; i < 100000; i++) {
buffers[i] = new Buffer(4);
buffers[i].writeFloatLE(i+1.2, 0);
// buffers[i] = [i+1.2];
}
console.log(Date.now() - started, 'ms');
console.log((process.memoryUsage().heapUsed - before) / 1024 / 1024);
And the results are as follows:
Arrays: 22 'ms'
8.391242980957031
Buffers:
123 'ms'
9.9490966796875
So according to this benchmark arrays are 5+ times faster and take 18% less memory. Is this correct? I certainly expected buffers to take less memory.
There's an overhead (in time and space ) for each Buffer you create.
I expect you'll get better space (and maybe time) performance if you compare
buffers[i] = new Buffer(4*1000);
for(k=0;j<1000;++j)
{
buffers[i].writeFloatLE(i+k+1.2, 4*j);
}
With
buffers[i] = [];
for(k=0;j<1000;++j)
{
buffers[i].push(i+k+1.2);
}

Resources