For a peer-to-peer app that can resume file transfers, is it sufficient to check filesize/modified date for changes before resuming a file? - p2p

I'm working on a networked application that has a peer-to-peer file transfer component (think instant messenger), and I'd like to make it able to resume file transfers gracefully.
If there is an ongoing file transfer, and one user drops out, the recipient still knows how much of the file he's successfully received and therefore where to resume the transfer from. However, if the file has changed in the meantime, how can this be detected? With regards to my questions, I'm not focused here on corruption by the network so much as corruption by the source file being altered.
The way I was starting out on this was by having the sender hash the file before sending it, so the recipient has a hash to check the finished file against. However, this only detects corruption at the very end, unless each resume also hashes. This problem could be alleviated by viewing the file in chunks, and hashing each of those. However, the bigger problem with hashing is that it can take a really, really long time, which is just a bad user experience when a user just wants to immediately send something (Ex: Linux ISO on a slow network share is the file to be sent).
I was thinking about changing to simply checking the file size and modified date each time a transfer begins or is resumed. While this is clearly not foolproof, unless I'm missing something (and please correct me if I am), almost every means an end-user would be using to alter files will be well-behaved and at the very least mark the modified date, and even if not, the change in size should catch 99% of cases. Does this seem like an acceptable compromise? Bad idea?
How do the established protocols handle this?

The quick answer to your question is that it will work in most cases, unless files are modified often.
Instead of hashes, use check sums (CRC32 for example). These are much faster to check whether a file has been modified.
If a connection breaks, you only need to send the computed chunk checksums back to the source which can compute whether the current chunks have been modified in between. Then, it can decide which one to resend and send the missing chunks.
Chunk & checksums are the best trade-off over full files and hashes regarding user experience.

Related

node fs.fsync (when to use?)

I want to safely write a file and I wan't to understand the proper use/place for fsync.
https://linux.die.net/man/2/fsync
After reading ^ that, I am puzzled as to where to effectively use it.
Question, do I:
fs.write('temp/file.txt','utf-8',function(error){
if(error){fs.unlink('temp/file.txt',function(){cb(error,undefined);});}
else{
fs.rename('temp/file.txt','real/file.txt',function(){
fs.fsync('real/file.txt',function(){
cb(undefined,true);
});
});
}
});
I'm writing something that performs many file changes. I have looked at modules that write atomic, but I would like to understand the process.
fsync is one of those functions where it's extremely rare that you'll need to use it.
All operating systems mask the fact that storage devices are slow by caching reads and writes. When you write to a file, it doesn't immediately write to the actual storage medium; it'll capture it in a cache, tell your program that the write has completed, and go and write the contents to the storage device in the background instead. The operating system will keep everything consistent though; if another application reads from that file, it'll see the new contents, as the OS will serve the contents from cache.
Note for a moment that this isn't universal; I believe Windows disables caching for removable storage devices to prevent data loss when people pull the drive out. There's also some set of flags you can pass to open() to disable the cache.
For almost all use cases, you don't need to care that this happens. The only upshot for you is that your program can continue faster. There are some cases where this is problematic though:
If power is lost, the contents of the cache are lost, so the disk won't have all the new contents of the file.
If the drive is removed, writes will equally be lost. This is pretty typical for removable storage devices, and I'm pretty sure 90% of people ignore the "safely remove" prompt ;).
I think doing direct reads directly from a device (i.e. /dev/sdX in Linux) will bypass this cache, but I'm not 100% sure.
Examples of where it is needed are, say, databases. When you run an update query, the database will normally update its in-memory state, and write the mutation to a transaction log. Reliability is a good thing for a database though, so it will write to the transaction log and do an fsync on that file before responding to the user (or will have opened the transaction log as unbuffered) so there's some level of guarantee that the transaction has been persisted.
In your example, the fsync will ensure that the rename has actually taken place and has been flushed to disk.

Prevent data corruption

I'm working on an Embedded linux running on ARM9.
The filesystem is ext4 type (rw, sync, noatime, data=writeback)
I implemented a process that writes/reads to a SQLite3 database in a Write-Ahead-Loggin (WAL) mode, with unsync enabled. When a powerloss is happening, I have around two seconds to save all data by syncing and checkpointing the DB. But, still, I see that sometimes the DB is being corrupted which is really not good in my case.
I would like to write a new DB engine for my purpose, In a similar way to SQLite, where the DB will be hold in one file. But in this case, I'm thinking of writing the header data to one sector and the rest of the data at least two sectors after , so the size of the DB will be larger but when writing the data, It will not ruin the header of the file, which holds the indexes and etc. That way, only the last data will be corrupted and not all the file, as SQLite behaves.
My question is if my approach is right?
you can use ping pong technique.
In ping pong technique you use 2 separate files and write alternatively to one and another. If a power loss occurs in the worst case you have at most 1 single corrupted file and you can safely use the other one. In the best case none of them is corrupted and you can continue using the latest one.
A corrupted file is easily detected if you use hashing functions or other CRC schemes
Obviously this scheme doesn't save you from write-cache or other disk caching mechanism which could be working under the hood.
Alternatively, you can use a journaled file system which features data integrity protection on it's own
Be aware that ping-pong and journaling schemes ensure only data integrity. Data loss could still occur. Data integrity and data loss are two completely different things

Can file size be used to detect a partial append?

I'm thinking about ways for my application to detect a partially-written record after a program or OS crash. Since records are only ever appended to a file (never overwritten), is a crash while writing guaranteed to yield a file size that is shorter than it should be? Is this guaranteed even if the file was opened in read-write mode instead of append mode, so long as writes are always at the end of the file? This would greatly simplify crash recovery, since comparing the last record's expected size and position with the actual file size would be enough to detect a partial write.
I understand that random-access writes can be reordered by the filesystem, but I'm having trouble finding information on whether this can happen when appending. I imagine an out-of-order append would require the filesystem to create a "hole" at the tail of the (sparse) file, write blocks beyond the hole, and then fill in the blocks in between, but I'm hoping that such an approach would be so inefficient that nobody would ever implement their filesystem that way.
I suppose another problem might be a filesystem updating the directory entry's file size field before appending the new blocks to to the file, and the OS crashing in between. Does this ever happen in practice? (ext4, perhaps?) Is there a quick way to detect it? (And what happens when trying to read the unwritten blocks that should exist according to the file's size?)
Is there anything else, such as write reordering performed by a disk/flash drive, that would get in the way of using file size as a way to detect a partial append? I don't expect to be able to compensate for this sort of drive trickery in my application, but it would be good to know about.
If you want to be SURE that you're never going to lose records, you need a consistent journaling or transactional system for your files.
There is absolutely no guarantee that a write will have been fulfilled unless you either set O_DIRECT [which you probably do not want to do], or you use markers to indicate aht "this has been fully committed", that are only written when the file is closed. You can either do that in the mainfile, or, for example, have a file that records, externally, "last written record". If you open & close that file, it should be safe as long as the APP is what is crashing - if the OS crashes [or is otherwise abruptly stopped - e.g. power cut, disk unplugged, etc], all bets are off.
Write reordering and write caching is/can be done at all levels - the C library, the OS, the filesystem module and the hard disk/controller itself are all ABLE to reorder writes.

using files as IPC on linux

I have one writer which creates and sometimes updates a file with some status information. The readers are implemented in lua (so I got only io.open) and possibly bash (cat, grep, whatever). I am worried about what would happen if the status information is updated (which means a complete file rewrite) while a reader has an open handle to the file: what can happen? I have also read that if the write/read operation is below 4KB, it is atomic: that would be perfectly fine for me, as the status info can fit well in such dimension. Can I make this assumption?
A read or write is atomic under 4Kbytes only for pipes, not for disk files (for which the atomic granularity may be the file system block size, usually 512 bytes).
In practice you could avoid bothering about such issues (assuming your status file is e.g. less than 512 bytes), and I believe that if the writer is opening and writing quickly that file (in particular, if you avoid open(2)-ing a file and keeping the opened file handle for a long time -many seconds-, then write(2)-ing later -once, a small string- inside it), you don't need to bother.
If you are paranoid, but do assume that readers are (like grep) opening a file and reading it quickly, you could write to a temporary file and rename(2)-ing it when written (and close(2)-ed) in totality.
As Duck suggested, locking the file in both readers and writers is also a solution.
I may be mistaken, in which case someone will correct me, but I don't think the external readers are going to pay any attention to whether the file is being simultaneously updated. They are are going to print (or possibly eof or error out) whatever is there.
In any case, why not avoid the whole mess and just use file locks. Have the writer flock (or similar) and the readers check the lock. If they get the lock they know they are ok to read.

Should AspBufferLimit ever need to be increased from the default of 4 MB?

A fellow developer recently requested that the AspBufferLimit in IIS 6 be increased from the default value of 4 MB to around 200 MB for streaming larger ZIP files.
Having left the Classic ASP world some time ago, I was scratching my head as to why you'd want to buffer a BinaryWrite and simply suggested setting Response.Buffer = false. But is there any case where you'd really need to make it 50x the default size?
Obviously, memory consumption would be the biggest worry. Are there other concerns with changing this default setting?
Increasing the buffer like that is a supremely bad idea. You would allow every visitor to your site to use up to that amount of ram. If your BinaryWrite/Response.Buffer=false solution doesn't appease him, you could also suggest that he call Response.Flush() now and then. Either would be preferable to increasing the buffer size.
In fact, unless you have a very good reason you shouldn't even pass this through the asp processor. Write it to a special place on disk set aside for such things and redirect there instead.
One of the downsides of turning off the buffer (you could use Flush but I really don't get why you'd do that in this scenario) is that the Client doesn't learn what the Content length at the start of the download. Hence the browsers dialog at the other end is less meaningfull, it can't tell how much progress has been made.
A better (IMO) alternative is to write the desired content to a temporary file (perhaps using GUID for the file name) then sending a Redirect to the client pointing at this temporary file.
There are a number of reasons why this approach is better:-
The client gets good progress info in the save dialog or application receiving the data
Some applications can make good use of byte range fetches which only work well when the server is delivering "static" content.
The temporary file can be re-used to satisify requests from other clients
There are a number of downside though:-
If takes sometime to create the file content, writing to a temporary file can therefore leave some latency before data is received and increasing the download time.
If strong security is needed on the content having a static file lying around may be a concern although the use of a random GUID filename mitigates that somewhat
There is need for some housekeeping on old temporary files.

Resources