Writing to a file at once or appending a line by line - disk writes - io

I'm concerned about disc drive life and I'd like to know which way takes minimum amount of writes, so the HDD/SSD won't wear out too soon:
with open('file.txt', 'w') as f:
for line in lines:
f.write(line + '\n')
or
with open('file.txt', 'w') as f:
f.write('\n'.join(lines))
f.write('\n') # just to keep the same content as in the previous example
I know HDD drivers have a flushing mechanism, so in theory, both should equally wear the drive out, but I'd like to verify my understanding by asking.

Whether you want to enforce some precautionary actions or not, as long as there's other programs installed in your computer, there's always going to be read/write operation going on and this operation can either leave behind memory leak or not, so there's no way to say you can prevent the hard-drive or SSD from wearing out.
Although if you're more concern about your applications not doing any damage to the HDD/SSD, you can maintain a healthy read/write operation. Usually this kind of operation include making sure any read/write resource is freed or close after any operation is carried out. Avoid writing large data at once into memory, either break it down to the maximum limit allowed by your memory or HDD/SSD read/write cycle and also avoid redundant code that perform repetitive actions on a file such as IO operations running in a looping construct like for, while and others and if you must have this kind of read/write operation on a loop, have some delay, and ensure the read operation is done and closed or freed after all the reading or writing is complete.

Related

Hey could someone help me understand sync syscall usage?

like said in the title, I don't really understand the usage of this syscall. I was writing some program that write some data in a file, and the tutorial I've seen told me to use sys_sync syscall. But my problem is why and when should we use this? The data isn't already written on the file?
The manual says:
sync - Synchronize cached writes to persistent storage
So it is written to the file cache in memory, not on disk.
You rarely have to use sync unless you are writing really important data and need to make sure that data is on disk before you go on. One example of systems that use sync a lot are databases (such as MySQL or PostgreSQL).
So in other words, it is theoretically in your file, just not on disk and therefore if you lose electricity, you could lose the data, especially if you have a lot of RAM and many writes in a raw, it may privilege the writes to cache for a long while, increasing the risk of data loss.
But how can a file be not on the disk? I understand the concept of cache but if I wrote in the disk why would it be in a different place?
First, when you write to a file, you send the data to the Kernel. You don't directly send it to the disk. Some kernel driver is then responsible to write the data to disk. In my days on Apple 2 and Amiga computers, I would actually directly read/write to disk. And at least the Amiga had a DMA so you could setup a buffer, then tell the disk I/O to do a read or a write and it would send you an interrupt when done. On the Apple 2, you had to write loops in assembly language with precise timings to read/write data on floppy disks... A different era!
Although you could, of course, directly access the disk (but with a Kernel like Linux, you'd have to make sure the kernel gives you hands free to do that...).
Cache is primarily used for speed. It is very slow to write to disk (as far as a human is concerned, it looks extremely fast, but compared to how much data the CPU can push to the drive, it's still slow).
So what happens is that the kernel has a task to write data to disk. That task wakes up as soon as data appears in the cache and ends once all the caches are transferred to disk. This task works in parallel. You can have one such task per drive (which is especially useful when you have a system such as RAID 1).
If your application fills up the cache, then a further write will block until some of the cache can be replaced.
and the tutorial I've seen told me to use sys_sync syscall
Well that sounds silly, unless you're doing filesystem write benchmarking or something.
If you have one really critical file that you want to make sure is "durable" wrt. power outages before you do something else (like sent a network packet to acknowledge a complete transfer), use fsync(fd) to sync just that one file's data and metadata.
(In asm, call number SYS_fsync from sys/syscall.h, with the file descriptor as the first register arg.)
But my problem is why and when should we use this?
Generally never use the sync system call in programs you're writing.
There are interactive use-cases where you'd normally use the wrapper command of the same name, sync(1). e.g. with removable media, to get the kernel started doing write-back now, so unmount will take less time once you finish typing it. Or for some benchmarking use-cases.
The system shutdown scripts may run sync after unmounting filesystems (and remounting / read-only), before making a reboot(2) system call.
Re: why sync(2) exists
No, your data isn't already on disk right after echo foo > bar.txt.
Most OSes, including Linux, do write-back caching, not write-through, for file writes.
You don't want write() system calls to wait for an actual magnetic disk when there's free RAM, because the traditional way to do I/O is synchronous so simple single-threaded programs wouldn't be able to do anything else (like reading more data or computing anything) while waiting for write() to return. Blocking for ~10 ms on every write system call would be disastrous; that's as long as a whole scheduler timeslice. (It would still be bad even with SSDs, but of course OSes were designed before SSDs were a thing.) Even just queueing up the DMA would be slow, especially for small file writes that aren't a whole number of aligned sectors, so even letting the disk's own write-back write caching work wouldn't be good enough.
Therefore, file writes do create "dirty" pages of kernel buffers that haven't yet been sent to the disk. Sometimes we can even avoid the IO entirely, e.g. for tmp files that get deleted before anything triggers write-back. On Linux, dirty_writeback_centisecs defaults to 1500 (15 seconds) before the kernel starts write-back, unless it's running low on free pages. (Heuristics for what "low" means use other tunable values).
If you really want writes to flush to disk immediately and wait for data to be on disk, mount with -o sync. Or for one program, have it use open(O_SYNC) or O_DSYNC (for just the data, not metadata like timestamps).
See Are file reads served from dirtied pages in the page cache?
There are other advantages to write-back, including delayed allocation even at the filesystem level. The FS can wait until it knows how big the file will be before even deciding where to put it, allowing better decisions that reduce fragmentation. e.g. a small file can go into a gap that would have been a bad place to start a potentially-large file. (It just have to reserve space to make sure it can put it somewhere.) XFS was one of the first filesystems to do "lazy" delayed allocation, and ext4 has also had the feature for a while.
https://en.wikipedia.org/wiki/XFS#Delayed_allocation
https://en.wikipedia.org/wiki/Allocate-on-flush
https://lwn.net/Articles/323169/

Fastest way to copy a large file locally

I was asked this in an interview.
I said lets just use cp. Then I was asked to mimic implementation cp itself.
So I thought okay, lets open the file, read one by one and write it to another file.
Then I was asked to optimize it further. I thought lets do chunks of read and write those chunks. I didn't have a good answer about what would be good chunk size. Please help me out with that.
Then I was asked to optimize even further. I thought may be we could read from different threads in parallel and write it in parallel.
But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.
So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.
Does that even improve performance? (I mean not for small files. it would be more overhead but for large files)
Also, is there like an OS trick where I could just point two files to the same data in disk? I mean I know there are symlinks but apart from that?
"The fastest way to copy a file" is going to depend on the system - all the way from the storage media to the CPUs. The most likely bottleneck will be the storage media - but it doesn't have to be. Imagine high-end storage that can move data faster than your system can create physical page mappings to read the data into...
In general, the fastest way to move a lot of data is to make as few copies of it as possible, and to avoid any extra operations, especially S-L-O-W ones such as physical disk head seeks.
So for a local copy on a common single-rotating-disk workstation/desktop/laptop system, the biggest thing to do is minimize physical disk seeks. That means read and write single-threaded, in large chunks (1 MB, for example) so the system can do whatever optimization it can, such as read-ahead or write coalescing.
That will likely get you to 95% or even better of the system's maximum copy performance. Even standard C buffered fopen()/fread()/fwrite() probably gets at least 80-90% of the best possible performance.
You can get the last few percentage points in a few ways. First, by matching your IO block size to a multiple of the file system's block size so that you're always reading full blocks from the filesystem. Second, you can use direct IO to bypass copying your data through the page cache. It will be faster to go disk->userspace or userspace->disk than it is to go disk->page cache->userspace and userspace->page cache->disk, but for single-spinning-disk copy that's not going to matter much, if it's even measurable.
You can use various dd options to test copying a file like this. Try using direct, or notrunc.
You can also try using sendfile() to avoid copying data into userspace entirely. Depending on the implementation, that might be faster than using direct IO.
Pre-allocating the destination file may or may not improve copy performance - that will depend on the filesystem. If the filesystem doesn't support sparse files, though, preallocating the file to a specific length might very well be very, very slow.
There just isn't all that much you can do to dramatically improve performance of a copy from and to the same single spinning physical disk - those disk heads will dance, and that will take time.
SSDs are much easier - to get maximal IO rates, just use parallel IO via multiple threads. But again, the "normal" IO will probably be at 80-90% of maximal.
Things get a lot more interesting and complex optimizing IO performance for other types of storage systems such as large RAID arrays and/or complex filesystems that can stripe single files across multiple underlying storage devices. Maximizing IO on such systems involves matching the software's IO patterns to the characteristics of the storage, and that can be quite complex.
Finally, one important part of maximizing IO rates is not doing things that dramatically slow things down. It's really easy to drag a physical disk down to a few KB/sec IO rates - read/write small chunks from/to random locations all over the disk. If your write process drops 16-byte chunks to random locations, the disk will spend almost all its time seeking and it won't move much data at all while doing that.
In fact, not "killing yourself" with bad IO patterns is a lot more important than spending a lot of effort attempting to get a four or five percentage points faster in optimal cases.
Because if IO is a bottleneck on a simple system, just go buy a faster disk.
But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.
Multithreading is not normally going to speed up a process like this. Any performance benefit you may gain could be wiped out by the synchronization overhead.
So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.
That's only going to give an advantage on a system that supports asychronous I/O.
To get the maximum speed you'd want to write in buffer sizes that are increments of the cluster factor of the disk (assuming a hard file system). This could be sped up on systems that permit queuing asynchronous I/O (as does, say, Windoze).
You'd also want to create the output file with its initial size being the same as the input file. That ways your write operations never have to extend the file.
Probably the fastest file copy possible would be to memory map the input and output files and did a memory copy. This is especially efficient in systems that treat mapped files as page files.

EXT4 slower performance successive writes

I'm working with a C program that does a bunch of things but in some part it writes a whole HDD with multiple calls to fwrite on a single file with a fixed size.
The calls are something like this:
fwrite(some_memory,size_element,total_elements,file);
When I measure the wall time of this calls each call takes a bit longer than the previous one. So for example, if want to write in chunks of 900MB of data, the first call (with empty disk) ends within 7 seconds but the last ones takes somewhere between 10~11 secs (with the disk almost at full capacity).
Is this an expected behavior? Is there any way of getting consistent write times independently of disk current capacity?
I'm using an EXT4 wd green 2TB volume.
I'd say this is to be expected as your early calls are most likely satisfied by the kernel's writeback cache and thus returning slightly quicker because at the time fwrite returns not all the data has reached the disk yet. However I don't know how much memory your system has compared to the 900MBytes of data you're trying to write so this is a guess...
If the kernel's cache becomes full (e.g. because the disk can't keep up) then your userspace program is made to block until it is sufficiently empty and able to accept a bit more data leaky bucket style. Only once all data has gone in to the cache can the fwrite complete. However at that point you are likely doing another fwrite call that tops the cache up again and your subsequent call is forced to wait a bit longer because the cache has not been fully emptied. I'd imagine you would reach a fixed point though...
To see if caches really were behind the behaviour you could issue an fsync after each fwrite (destroying your performance) and take the time from the fwrite submission to fsync completion and see if the variance was so large.
Another thing you could do that might help is to preallocate the full size of the file up front so that the filesystem isn't forced to keep regrowing it as new data is appended to the end (this should cut down on metadata operations and fight fragmentation too).
The dirty_* knobs in https://www.kernel.org/doc/Documentation/sysctl/vm.txt are also likely comming in to play given the large amounts of data you are writing.

How to prioritize write() over mmap updates (or delay mmap page cache flush)

I'm running a specialized DB daemon on a debian-64 with 64G of RAM and lots of disk space. It uses an on-disk hashtable (mmaped) and writes the actual data into a file with regular write() calls. When doing really a lot of updates, a big part of the mmap gets dirty and the page cache tries to flush it to disk, producing lots of random writes which in turn slows down the performance of the regular (sequential) writes to the data file.
If it were possible to delay the page cache flush of the mmaped area performance would improve (I assume), since several (or all) changes to the dirty page would be written at once instead of once for every update (worst case, in reality of course it aggregates a lot of changes anyway).
So my question: Is it possible to delay page cache flush for a memory-mapped area? Or is it possible to prioritze the regular write? Or does anyone have any other ideas? madvise and posix_fadvise don't seem to make any difference...
You could play with the tuneables in /proc/sys/vm. For example, increase the value in dirty_writeback_centisecs to make pdflush wake up somewhat less often, increase dirty_expire_centiseconds so data is allowed to stay dirty for longer until it must be written out, and increase dirty_background_ratio to allow more dirty pages to stay in RAM before something must be done.
See here for a somewhat comprehensive description of what all the values do.
Note that this will affect every process on your machine, but seeing how you're running a huge database server, chances are that this is no problem since you don't want anything else to run on the same machine anyway.
Now of course this delays writes, but it still doesn't fully solve the problem of dirty page writebacks competing with write (though it will likely collapse a few writes if there are many updates).
But: You can use the sync_file_range syscall to force beginning write-out of pages in a given range on your "write" file descriptor (SYNC_FILE_RANGE_WRITE). So while the dirty pages will be written back at some unknown time later (and with greater grace periods), you manually kick off writeback on the ones you're interested.
This doesn't give any guarantees, but it should just work.
Be sure to absolutely positively read the documentation, better read it twice. sync_file_range can very easily corrupt or lose data if you use it wrong. In particular, you must be sure metadata is up-to-date and flushed if you appended to a file, or data that has been "successfully written" will just be "gone" in case of a crash.
I would try mlock. If you mlock the relevant memory range, it may prevent the flush from occurring. You could munlock when you're done.

unbuffered I/O in Linux

I'm writing lots and lots of data that will not be read again for weeks - as my program runs the amount of free memory on the machine (displayed with 'free' or 'top') drops very quickly, the amount of memory my app uses does not increase - neither does the amount of memory used by other processes.
This leads me to believe the memory is being consumed by the filesystems cache - since I do not intend to read this data for a long time I'm hoping to bypass the systems buffers, such that my data is written directly to disk. I dont have dreams of improving perf or being a super ninja, my hope is to give a hint to the filesystem that I'm not going to be coming back for this memory any time soon, so dont spend time optimizing for those cases.
On Windows I've faced similar problems and fixed the problem using FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH - the machines memory was not consumed by my app and the machine was more usable in general. I'm hoping to duplicate the improvements I've seen but on Linux. On Windows there is the restriction of writing in sector sized pieces, I'm happy with this restriction for the amount of gain I've measured.
is there a similar way to do this in Linux?
The closest equivalent to the Windows flags you mention I can think of is to open your file with the open(2) flags O_DIRECT | O_SYNC:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The O_DIRECT flag on its
own makes at an effort to transfer data synchronously, but does not
give the guarantees of the O_SYNC that data and necessary metadata are
transferred. To guarantee synchronous I/O the O_SYNC must be used in
addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is
described in raw(8).
Granted, trying to do research on this flag to confirm it's what you want I found this interesting piece telling you that unbuffered I/O is a bad idea, Linus describing it as "brain damaged". According to that you should be using madvise() instead to tell the kernel how to cache pages. YMMV.
You can use O_DIRECT, but in that case you need to do the block IO yourself; you must write in multiples of the FS block size and on block boundaries (it is possible that it is not mandatory but if you do not its performance will suck x1000 because every unaligned write will need a read first).
Another much less impacting way of stopping your blocks using up the OS cache without using O_DIRECT, is to use posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED). Under Linux 2.6 kernels which support it, this immediately discards (clean) blocks from the cache. Of course you need to use fdatasync() or such like first, otherwise the blocks may still be dirty and hence won't be cleared from the cache.
It is probably a bad idea of fdatasync() and posix_fadvise( ... POSIX_FADV_DONTNEED) after every write, but instead wait until you've done a reasonable amount (50M, 100M maybe).
So in short
after every (significant chunk) of writes,
Call fdatasync followed by posix_fadvise( ... POSIX_FADV_DONTNEED)
This will flush the data to disc and immediately remove them from the OS cache, leaving space for more important things.
Some users have found that things like fast-growing log files can easily blow "more useful" stuff out of the disc cache, which reduces cache hits a lot on a box which needs to have a lot of read cache, but also writes logs quickly. This is the main motivation for this feature.
However, like any optimisation
a) You're not going to need it so
b) Do not do it (yet)
as my program runs the amount of free memory on the machine drops very quickly
Why is this a problem? Free memory is memory that isn't serving any useful purpose. When it's used to cache data, at least there is a chance it will be useful.
If one of your programs requests more memory, file caches will be the first thing to go. Linux knows that it can re-read that data from disk whenever it wants, so it will just reap the memory and give it a new use.
It's true that Linux by default waits around 30 seconds (this is what the value used to be anyhow) before flushing writes to disk. You can speed this up with a call to fsync(). But once the data has been written to disk, there's practically zero cost to keeping a cache of the data in memory.
Seeing as you write to the file and don't read from it, Linux will probably guess that this data is the best to throw out, in preference to other cached data. So don't waste effort trying to optimise unless you've confirmed that it's a performance problem.

Resources