Is it possible to verify data of multiple sequential writes just by checking the last n bytes written? - io

Just to be clear, my questions are language/OS agnostic (independent).
I am working on a program (support many OSs, currently written in Golang) that receive many chunks of data (like a stream of data chunks) and sequentially write it all down to a pre-specified position (pos >= 0) in a file. Only 1 process with 1 thread accessing the file. I use regular write function that use write system call (which one depends on the OS it runs) internally, not buffered IO.
Supposed that while my program was writing, suddenly system crashed (the most severe kind of crash: power failure).
When the system is turned back on, I need to verify how many chunks is completely written to HDD. (*)
The HDD that my program writes to is just today regular desktop or laptop HDD (not some fancy one (with battery-backed) found in some high-end servers).
Supposed that bit corruption during transfer to and reading from HDD is very highly unlikely to happen and is negligible.
My questions are:
Do I need to do checksum on all of the written chunks to verify (*)?
Or do I just need to check and confirm that the nth chunk is correct and assume all the chunks before it (0 -> n-1) is correct too?
2.1 If 2. is enough, does that means sequential writes order is guaranteed to be preserved by FS of any OS (random writes can still be reordered though)?
Is my way of doing recovery rely on the same principle as append-only log file as seen in many crash-proof databases?

I suspect your best bet is to study up on Cyclic Redundancy Checks (CRC).
As I understand it a CRC would allow you to verify that what was intended to be written actually was.
I also suggest that worrying about the cause of any errors is not very worthwhile (transmission errors vs. errors for any other reason such as a power failure).
Hope this helps.

Related

external multithreading sort

I need to implement external multithreading sort. I dont't have experience in multithreading programming and now I'm not sure if my algorithm is good anoth also I don't know how to complete it. My idea is:
Thread reads next block of data from input file
Sort it using standart algorith(std::sort)
Writes it to another file
After this I have to merge such files. How should I do this?
If I wait untill input file will be entirely processed until merge
I recieve a lot of temporary files
If I try to merge file straight after sort, I can not come up with
an algorithm to avoid merging files with quite different sizes, which
will lead to O(N^2) difficulty.
Also I suppose this is a very common task, however I cannot find good prepared algoritm in the enternet. I would be very grateful for such a link especially for it's c++ implementation.
Well, the answer isn't that simple, and it actually depends on many factors, amongst them the number of items you wish to process, and the relative speed of your storage system and CPUs.
But the question is why to use multithreading at all here. Data too big to be held in memory? So many items that even a qsort algorithm can't sort fast enough? Take advantage of multiple processors or cores? Don't know.
I would suggest that you first write some test routines to measure the time needed to read and write the input file and the output files, as well as the CPU time needed for sorting. Please note that I/O is generally A LOT slower than CPU execution (actually they aren't even comparable), and I/O may not be efficient if you read data in parallel (there is one disk head which has to move in and out, so reads are in effect serialized - even if it's a digital drive it's still a device, with input and output channels). That is, the additional overhead of reading/writing temporary files may more than eliminate any benefit from multithreading. So I would say, first try making an algorithm that reads the whole file in memory, sorts it and writes it, and put in some time counters to check their relative speed. If I/O is some 30% of the total time (yes, that little!), it's definitely not worth, because with all that reading/merging/writing of temporary files, this will rise a lot more, so a solution processing the whole data at once would rather be preferable.
Concluding, don't see why use multithreading here, the only reason imo would be if data are actually delivered in blocks, but then again take into account my considerations above, about relative I/O-CPU speeds and the additional overhead of reading/writing the temporary files. And a hint, your file accessing must be very efficient, eg reading/writing in larger blocks using application buffers, not one by one (saves on system calls), otherwise this may have a detrimental effect if the file(s) are stored on a machine other than yours (eg a server).
Hope you find my suggestions useful.

linux 2.6.43, ext3, 10K RPM SAS disk, 2 sequential write(direct io) on different file acting like random write

I recently stall on this one problem:
"2 sequential write(direct io 4KB alignemnt block) on different file acting like random write, which yield poor write performance in 10K RPM SAS disk".
The thing confuse me most: I got batch of server, all equip with same kind of disk (raid 1 with 2 300GB 10K RPM disk), but response different.
several servers seams ok with this kind of write pattern, disk happy accepted up to 50+MB/s;
(same kernel version, same filesystem, with different lib (libc 2.4))
others not so much, 100 op/s seams reach the limit of underlying disk, which confirm the random write performance of disk;
((same kernel version, same filesystem, with different lib (libc 2.12)))
[NOTE: I check the "pwrite" code of different libc, which tell nothing but simple "syscall"]
I have managed to rule out the possibly:
1. software bug in my own program;
by a simple deamon(compile with no dynamic link), do sequcetial direct io write;
2. disk problem;
switch 2 different version of linux system on one test machine, which perform well on my direct io write pattern, and a couple of day after switch to old lib version, the bad random write;
I try to compare:
/sys/block/sda/queue/*, which may different in both way;
filefrag show nothing but two different file interleaved sequenctial grow physical block id;
there must be some kind of write strategy lead to this problem, but i don't know where to start:
different kernel setting ?, may be related to how ext3 allocate disk block ?
raid cache(write back) or disk cache write strategy?
or underlying disk strategy to mapping logical block into real physical block ?
really appreciate
THE ANS IS:
it's because of /sys/block/sda/queue/schedule setting:
MACHINE A: display schedule: cfq, but undlying, it's deadline;
MACHINE B: the schedule is consistent with cfq;
//=>
SINCE my server is db svr, deadline is my best option;

Writing to a remote file: When does write() really return?

I have a client node writing a file to a hard disk that is on another node (I am writing to a parallel fs actually).
What I want to understand is:
When I write() (or pwrite()), when exactly does the write call return?
I see three possibilities:
write returns immediately after queueing the I/O operation on the client side:
In this case, write can return before data has actually left the client node (If you are writing to a local hard drive, then the write call employs delayed writes, where data is simply queued up for writing. But does this also happen when you are writing to a remote hard disk?). I wrote a testcase in which I write a large matrix (1GByte) to file. Without fsync, it showed very high bandwidth values, whereas with fsync, results looked more realistic. So looks like it could be using delayed writes.
write returns after the data has been transferred to the server buffer:
Now data is on the server, but resides in a buffer in its main memory, but not yet permanently stored away on the hard drive. In this case, I/O time should be dominated by the time to transfer the data over the network.
write returns after data has been actually stored on the hard drive:
Which I am sure does not happen by default (unless you write really large files which causes your RAM to get filled and ultimately get flushed out and so on...).
Additionally, what I would like to be sure about is:
Can a situation occur where the program terminates without any data actually having left the client node, such that network parameters like latency, bandwidth, and the hard drive bandwidth do not feature in the program's execution time at all? Consider we do not do an fsync or something similar.
EDIT: I am using the pvfs2 parallel file system
Option 3. is of course simple, and safe. However, a production quality POSIX compatible parallel file system with performance good enough that anyone actually cares to use it, will typically use option 1 combined with some more or less involved mechanism to avoid conflicts when e.g. several clients cache the same file.
As the saying goes, "There are only two hard things in Computer Science: cache invalidation and naming things and off-by-one errors".
If the filesystem is supposed to be POSIX compatible, you need to go and learn POSIX fs semantics, and look up how the fs supports these while getting good performance (alternatively, which parts of POSIX semantics it skips, a la NFS). What makes this, err, interesting is that the POSIX fs semantics harks back to the 1970's with little to no though of how to support network filesystems.
I don't know about pvfs2 specifically, but typically in order to conform to POSIX and provide decent performance, option 1 can be used together with some kind of cache coherency protocol (which e.g. Lustre does). For fsync(), the data must then actually be transferred to the server and committed to stable storage on the server (disks or battery-backed write cache) before fsync() returns. And of course, the client has some limit on the amount of dirty pages, after which it will block further write()'s to the file until some have been transferred to the server.
You can get any of your three options. It depends on the flags you provide to the open call. It depends on how the filesystem was mounted locally. It also depends on how the remote server is configured.
The following are all taken from Linux. Solaris and others may differ.
Some important open flags are O_SYNC, O_DIRECT, O_DSYNC, O_RSYNC.
Some important mount flags for NFS are ac, noac, cto, nocto, lookupcache, sync, async.
Some important flags for exporting NFS are sync, async, no_wdelay. And of course the mount flags of the filesystem that NFS is exporting are important as well. For example, if you were exporting XFS or EXT4 from Linux and for some reason you used the nobarrier flag, a power loss on the server side would almost certainly result in lost data.

Buffering on top of VFS

the problem I try to deal with it is the saving of big number (millions) of small files (up to 50KB), which are sent via network. The saving is done sequential: server receives a file or a dir (via network), it saves it on disk; the next one arrives, it's saved etc.
Apparently, the performance is not acceptable, if multiple server processes coexist (let's say I have 5 processes which all read from network and write at the same time), because the I/O scheduler doesn't manage to merge efficiently the I/O writes.
A suggested solution is to implement some sort of buffering: each server process should have a 50MB cache, in which it should write the current file, do a chdir etc; when the buffer is full, it should be synced to disk, therefore obtaining an I/O burst.
My questions to you:
1) I know that already exists a buffer mechanism (disk buffer); do you think that the above scenario is going to add some improvement? (the design is much more complicated and it's not easy to implement a simple test case)
2) do you have any suggestions, where to look if I would implement this?
Many thanks.
You're going to need to do better than
"apparently the performance is not acceptable".
Specifically
How are you measuring it? Do you have an exact, reproducible figure
What is your target?
In order to do optimisation, you need two things- a method of measuring it (a metric) and a target (so you know when to stop, or how useful or useless a particular technique is).
Without either, you're sunk, I'm afraid.
How important are those writes? I have three suggestions (which can be combined), but one of them is a lot of work, and one of them is less safe...
Journaling
I'm guessing you're seeing some poor performance due in part to the journaling common to most modern Linux filesystems. The journaling causes barriers to be inserted into the IO queue when file metadata is written. You can try turning down the safety (and maybe turning up the speed) with mount(8) options barrier=0 and data=writeback.
But if there is a crash, the journal might not be able to prevent a lengthy fsck(8). And there's a chance the fsck(8) will wind up throwing away your data when fixing the problem. On the one hand, it's not a step to take lightly, on the other hand, back in the old days, we ran our ext2 filesystems in async mode without a journal both ways in the snow and we liked it.
IO Scheduler elevator
Another possibility is to swap the IO elevator; see Documentation/block/switching-sched.txt in the Linux kernel source tree. The short version is that deadline, noop, as, and cfq are available. cfq is the kernel default, and probably what your system is using. You can check:
$ cat /sys/block/sda/queue/scheduler
noop deadline [cfq]
The most important parts from the file:
As of the Linux 2.6.10 kernel, it is now possible to change the
IO scheduler for a given block device on the fly (thus making it possible,
for instance, to set the CFQ scheduler for the system default, but
set a specific device to use the deadline or noop schedulers - which
can improve that device's throughput).
To set a specific scheduler, simply do this:
echo SCHEDNAME > /sys/block/DEV/queue/scheduler
where SCHEDNAME is the name of a defined IO scheduler, and DEV is the
device name (hda, hdb, sga, or whatever you happen to have).
The list of defined schedulers can be found by simply doing
a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
will be displayed, with the currently selected scheduler in brackets:
# cat /sys/block/hda/queue/scheduler
noop deadline [cfq]
# echo deadline > /sys/block/hda/queue/scheduler
# cat /sys/block/hda/queue/scheduler
noop [deadline] cfq
Changing the scheduler might be worthwhile, but depending upon the barriers inserted into the queue by the journaling requirements, there might not be much reordering possible. Still, it is less likely to lose your data, so it might be the first step.
Application changes
Another possibility is to drastically change your application to bundle files itself, and write fewer, larger, files to disk. I know it sounds strange, but (a) the iD development team packaged their maps, textures, objects, etc., into giant zip files that they would read into the program with a few system calls, unpack, and run with, because they found the performance much better than reading a few hundred or few thousand smaller files. Load times between levels was drastically shorter. (b) The Gnome desktop team and KDE desktop teams took different approaches to loading their icons and resource files: the KDE team packages their many small files into larger packages of some sort, and the Gnome team did not. The Gnome team had longer startup delays and were hoping the kernel could make some efforts to improve their startup time. The kernel team kept suggesting the fewer, larger, files approach.
Creating/renaming a file, syncing it, having lots of files in a directory and having lots of files (with tail waste) are some of the slow operations in your scenario. However to avoid them it would only help to write lesser files (for example writing out archives, concatenated file or similiar). I would actually try a (limited) parallel async or sync approach. The IO scheduler and caches are typically quite good.

How to parallelize file reading and writing

I have a program which reads data from 2 text files and then save the result to another file. Since there are many data to be read and written which cause a performance hit, I want to parallize the reading and writing operations.
My initial thought is, use 2 threads as an example, one thread read/write from the beginning, and another thread read/write from the middle of the file. Since my files are formatted as lines, not bytes(each line may have different bytes of data), seek by byte does not work for me. And the solution I could think of is use getline() to skip over the previous lines first, which might be not efficient.
Is there any good way to seek to a specified line in a file? or do you have any other ideas to parallize file reading and writing?
Environment: Win32, C++, NTFS, Single Hard Disk
Thanks.
-Dbger
Generally speaking, you do NOT want to parallelize disk I/O. Hard disks do not like random I/O because they have to continuously seek around to get to the data. Assuming you're not using RAID, and you're using hard drives as opposed to some solid state memory, you will see a severe performance degradation if you parallelize I/O(even when using technologies like those, you can still see some performance degradation when doing lots of random I/O).
To answer your second question, there really isn't a good way to seek to a certain line in a file; you can only explicitly seek to a byte offset using the read function(see this page for more details on how to use it.
Queuing multiple reads and writes won't help when you're running against one disk. If your app also performed a lot of work in CPU then you could do your reads and writes asynchronously and let the CPU work while the disk I/O occurs in the background. Alternatively, get a second physical hard drive: read from one, write to the other. For modestly sized data sets that's often effective and quite a bit cheaper than writing code.
This isn't really an answer to your question but rather a re-design (which we all hate but can't help doing). As already mentioned, trying to speed up I/O on a hard disk with multiple threads probably won't help.
However, it might be possible to use another approach depending on data sensitivity, throughput needs, data size, etc. It would not be difficult to create a structure in memory that maintains a picture of the data and allows easy/fast updates of the lines of text anywhere in the data. You could then use a dedicated thread that simply monitors that structure and whose job it is to write the data to disk. Writing data sequentially to disk can be extremely fast; it can be much faster than seeking randomly to different sections and writing it in pieces.

Resources