How to ensure that data written from application is synced to disk with ext4 in OpenEBS? - openebs

The data is not visible after the restart. I am guessing that data is passed for replica to write, which in turn puts into ext4. But before ext4 can sync the data to disk, node reboots and data is not pushed down to EBS disks.
Is there a way out of this? I am using openebs with jiva. I have my MySQL -> ext4 (iscsi volume) -> Replica -> ext4 (block disks - say Amazon EBS).
At times, I am observing that if node where Replica is running restarts...

Here is an article on lwn.net that discusses potential data loss when a program fails to do adequate syncing (otherwise known as crash-consistency) on ext4 at length (the comments discussion is enlightening as well).
ext3 apparently achieves a better crash consistency when using data=ordered because it forces data to disk before metadata changes are committed to journal. Also, a default commit period of 5 seconds is used. In case of ext4, a trade-off for performance is done which uses a delayed physical block allocation model thus causing uncommitted data to continue to living in the cache for some time. A quote from the article:
The kernel doesn't like to let file data sit unwritten for too long, but it can still take a minute or so (with the default settings) for that data to be flushed - far longer than the five seconds normally seen with ext3
So unwritten data can theoretically exist only in a volatile cache until it is forced to disk by a system wide sync OR an application's explicit fsync of its own data (as Jeffery has pointed out). If the application/client doesn't do this we are more prone to data loss.
One way of mitigating this issue is to mount the required filesystem with the sync option (refer this "ext4 and data loss" discussion thread) and to do so we have to mandate it in two places:
The mount into the pod
The OpenEBS storage pool OR the backend store
(In case of 1, we could have the target convert all writes to sync, as explained by Jeffery)
While the mount(8) documentation specifically states that using -o sync is only supported until ext3 (among the ext family of filesystems), a manual filesystem mount with this option is accepted. In an attempt to check whether it is something that the mount protocol allows but is ignored by ext4, I conducted a small fio-based random write performance test for a data sample size of 256M with a disk mounted with the sync option and the same with one without it. To ensure that the writes themselves were not SYNC writes, the libaio ioengine was selected with direct=1 and iodepth=4 (asynchronous multithreaded unbuffered I/O). The results showed a difference of around 300+ IOPS (of course, with the non sync mount performing better). This result suggests that the sync mount flag does seem to play a role but I'm still looking for more proof on this.

Related

Hey could someone help me understand sync syscall usage?

like said in the title, I don't really understand the usage of this syscall. I was writing some program that write some data in a file, and the tutorial I've seen told me to use sys_sync syscall. But my problem is why and when should we use this? The data isn't already written on the file?
The manual says:
sync - Synchronize cached writes to persistent storage
So it is written to the file cache in memory, not on disk.
You rarely have to use sync unless you are writing really important data and need to make sure that data is on disk before you go on. One example of systems that use sync a lot are databases (such as MySQL or PostgreSQL).
So in other words, it is theoretically in your file, just not on disk and therefore if you lose electricity, you could lose the data, especially if you have a lot of RAM and many writes in a raw, it may privilege the writes to cache for a long while, increasing the risk of data loss.
But how can a file be not on the disk? I understand the concept of cache but if I wrote in the disk why would it be in a different place?
First, when you write to a file, you send the data to the Kernel. You don't directly send it to the disk. Some kernel driver is then responsible to write the data to disk. In my days on Apple 2 and Amiga computers, I would actually directly read/write to disk. And at least the Amiga had a DMA so you could setup a buffer, then tell the disk I/O to do a read or a write and it would send you an interrupt when done. On the Apple 2, you had to write loops in assembly language with precise timings to read/write data on floppy disks... A different era!
Although you could, of course, directly access the disk (but with a Kernel like Linux, you'd have to make sure the kernel gives you hands free to do that...).
Cache is primarily used for speed. It is very slow to write to disk (as far as a human is concerned, it looks extremely fast, but compared to how much data the CPU can push to the drive, it's still slow).
So what happens is that the kernel has a task to write data to disk. That task wakes up as soon as data appears in the cache and ends once all the caches are transferred to disk. This task works in parallel. You can have one such task per drive (which is especially useful when you have a system such as RAID 1).
If your application fills up the cache, then a further write will block until some of the cache can be replaced.
and the tutorial I've seen told me to use sys_sync syscall
Well that sounds silly, unless you're doing filesystem write benchmarking or something.
If you have one really critical file that you want to make sure is "durable" wrt. power outages before you do something else (like sent a network packet to acknowledge a complete transfer), use fsync(fd) to sync just that one file's data and metadata.
(In asm, call number SYS_fsync from sys/syscall.h, with the file descriptor as the first register arg.)
But my problem is why and when should we use this?
Generally never use the sync system call in programs you're writing.
There are interactive use-cases where you'd normally use the wrapper command of the same name, sync(1). e.g. with removable media, to get the kernel started doing write-back now, so unmount will take less time once you finish typing it. Or for some benchmarking use-cases.
The system shutdown scripts may run sync after unmounting filesystems (and remounting / read-only), before making a reboot(2) system call.
Re: why sync(2) exists
No, your data isn't already on disk right after echo foo > bar.txt.
Most OSes, including Linux, do write-back caching, not write-through, for file writes.
You don't want write() system calls to wait for an actual magnetic disk when there's free RAM, because the traditional way to do I/O is synchronous so simple single-threaded programs wouldn't be able to do anything else (like reading more data or computing anything) while waiting for write() to return. Blocking for ~10 ms on every write system call would be disastrous; that's as long as a whole scheduler timeslice. (It would still be bad even with SSDs, but of course OSes were designed before SSDs were a thing.) Even just queueing up the DMA would be slow, especially for small file writes that aren't a whole number of aligned sectors, so even letting the disk's own write-back write caching work wouldn't be good enough.
Therefore, file writes do create "dirty" pages of kernel buffers that haven't yet been sent to the disk. Sometimes we can even avoid the IO entirely, e.g. for tmp files that get deleted before anything triggers write-back. On Linux, dirty_writeback_centisecs defaults to 1500 (15 seconds) before the kernel starts write-back, unless it's running low on free pages. (Heuristics for what "low" means use other tunable values).
If you really want writes to flush to disk immediately and wait for data to be on disk, mount with -o sync. Or for one program, have it use open(O_SYNC) or O_DSYNC (for just the data, not metadata like timestamps).
See Are file reads served from dirtied pages in the page cache?
There are other advantages to write-back, including delayed allocation even at the filesystem level. The FS can wait until it knows how big the file will be before even deciding where to put it, allowing better decisions that reduce fragmentation. e.g. a small file can go into a gap that would have been a bad place to start a potentially-large file. (It just have to reserve space to make sure it can put it somewhere.) XFS was one of the first filesystems to do "lazy" delayed allocation, and ext4 has also had the feature for a while.
https://en.wikipedia.org/wiki/XFS#Delayed_allocation
https://en.wikipedia.org/wiki/Allocate-on-flush
https://lwn.net/Articles/323169/

What does O_DIRECT really mean?

If I open a file with O_DIRECT flag, does it mean that whenever a write(blocking mode) to that file returns, the data is on disk?
(This answer pertains to Linux - other OSes may have different caveats/semantics)
Let's start with the sub-question:
If I open a file with O_DIRECT flag, does it mean that whenever a write(blocking mode) to that file returns, the data is on disk?
No (as #michael-foukarakis commented) - if you need a guarantee your data made it to non-volatile storage you must use/add something else.
What does O_DIRECT really mean?
It's a hint that you want your I/O to bypass the Linux kernel's caches. What will actually happen depends on things like:
Disk configuration
Whether you are opening a block device or a file in a filesystem
If using a file within a filesystem
The exact filesystem used and the options in use on the filesystem and the file
Whether you've correctly aligned your I/O
Whether a filesystem has to do a new block allocation to satisfy your I/O
If the underlying disk is local, what layers you have in your kernel storage stack before you reach the disk block device
Linux kernel version
...
The list above is not exhaustive.
In the "best" case, setting O_DIRECT will avoid making extra copies of data while transferring it and the call will return after transfer is complete. You are more likely to be in this case when directly opening block devices of "real" local disks. As previously stated, even this property doesn't guarantee that data of a successful write() call will survive sudden power loss. IF the data is DMA'd out of RAM to non-volatile storage (e.g. battery backed RAID controller) or the RAM itself is persistent storage THEN you may have a guarantee that the data reached stable storage that can survive power loss. To know if this is the case you have to qualify your hardware stack so you can't assume this in general.
In the "worst" case, O_DIRECT can mean nothing at all even though setting it wasn't rejected and subsequent calls "succeed". Sometimes things in the Linux storage stack (like certain filesystem setups) can choose to ignore it because of what they have to do or because you didn't satisfy the requirements (which is legal) and just silently do buffered I/O instead (i.e. write to a buffer/satisfy read from already buffered data). It is unclear whether extra effort will be made to ensure that the data of an acknowledged write was at least "with the device" (but in the O_DIRECT and barriers thread Christoph Hellwig posts that the O_DIRECT fallback will ensure data has at least been sent to the device). A further complication is that using O_DIRECT implies nothing about file metadata so even if write data is "with the device" by call completion, key file metadata (like the size of the file because you were doing an append) may not be. Thus you may not actually be able to get at the data you thought had been transferred after a crash (it may appear truncated, or all zeros etc).
While brief testing can make it look like data using O_DIRECT alone always implies data will be on disk after a write returns, changing things (e.g. using an Ext4 filesystem instead of XFS) can weaken what is actually achieved in very drastic ways.
As you mention "guarantee that the data" (rather than metadata) perhaps you're looking for O_DSYNC/fdatasync()? If you want to guarantee metadata was written too, you will have to look at O_SYNC/fsync().
References
Ext4 Wiki: Clarifying Direct IO's Semantics. Also contains notes about what O_DIRECT does on a few non-Linux OSes.
The "[PATCH 1/1 linux-next] ext4: add compatibility flag check to the patch" LKML thread has a reply from Ext4 lead dev Ted Ts'o talking about how filesystems can fallback to buffered I/O for O_DIRECT rather than failing the open() call.
In the "ubifs: Allow O_DIRECT" LKML thread Btrfs lead developer Chris Mason states Btrfs resorts to buffered I/O when O_DIRECT is requested on compressed files.
ZFS on Linux commit message discussing the semantics of O_DIRECT in different scenarios. Also see the (at the time of writing mid-2020) proposed new O_DIRECT semantics for ZFS on Linux (the interactions are complex and defy a brief explanation).
Linux open(2) man page (search for O_DIRECT in the Description section and the Notes section)
Ensuring data reaches disk LWN article
Infamous Linus Torvalds O_DIRECT LKML thread summary (for even more context you can see the full LKML thread)

Using a hard disk without filesystem for big data

I'm working on a web crawler and have to handle big data (about 160 TB raw data in trillions of data files).
The data should be stored sequencial as one big bz2 file on the magnetic hard disk. A SSD is used to hold the meta data. THe most important operation on the hard disk is a squential read over all of the 4 TB off the disk, which should happen with full maximum speed of 150 MB/s.
I want to not waste the overhead of a file system an instead use the "/dev/file" devices directly. Does this access use the os block buffer? Are the access operations queued or synchronous in a FIFO style?
Is it better to use /dev/file or write your own user level file system?
Has anyone experience with it.
If you don't use any file system but read your disk device (e.g. /dev/sdb) directly, you are losing all the benefit of file system cache. I am not at all sure it is worthwhile.
Remember that you could use syscalls like readahead(2) or posix_fadvise(2) or madvise(2) to give hints to the kernel to improve performance.
Also, you might when making your file system use a larger than usual block size. And don't forget to use big blocks (e.g. of 64 to 256 Kbytes) when read(2)-ing data. You could also use mmap(2) to get the data from disk.
I would recommend against "coding your own file system". Existing file systems are quite tuned (and some are used on petabytes of storage). You may want to chose big blocks when making them (e.g. -b with mke2fs(8)...)
BTW, choosing between filesystem and raw disk data is mostly a configuration issue (you specify a /dev/sdb path if you want raw disk, and /home/somebigfile if you want a file). You could code a webcrawler to be able to do both, then benchmark both approaches. Very likely, performance could depend upon actual system and hardware.
As a case in point, relational database engines used often raw disk partitions in the previous century (e.g. 1990s) but seems to often use big files today.
Remember that the real bottleneck is the hardware (i.e. disk): CPU time used by filesystems is often insignificant and cannot even be measured.
PS. I have not much real recent experience with these issues.

Linux file system automatically backed by disk but hosted entirely in memory?

I have to do lots of small random accesses to a whole bunch of files. I have more than enough main memory to hold all of the data.
When I copy the data over to a temporary ramfs filesystem and process it there, this takes only a small fraction of the time that waiting for disk access would take.
Is there a Linux file system which holds all of its data in main memory, writes any changes to a backing disk, but never touches the disk for any reads?
If not, can, say, ext3 caches be tuned so that they are guaranteed to hold 100% of data and metadata?
If you are only reading data, then you can indeed tune caching such that all data will be cached in RAM - see /usr/src/linux/Documentation/sysctl/fs.txt vm.txt for details of what you can tweak here. The problem arises when you write data, particularly if you use fsync() or similar to ensure the data has been commited to the actual disk.
As the OS has to update the disk in the case of a fsync(), there's not much you can do if you still want to ensure your data is consistant and wouldn't be lost in a power cut.
One problem you might be running into is the atime or access time - by default every time a file is accessed the access time is updated in the inode. This will cause disk writes even when you think you are just performing reads. This can be a particular problem in your scenario where you are accessing many small files. If you don't care about tracking the access time you can mount your filesystem with the noatime to disable this 'feature'.
Why don't you try to create a RAID mirror between a ramdisk and a physical disk ?
Not sure if it's efficient though. If the mirror must always be synchronized, it will have to wait for the disk anyway when you write, but for reading you should gain something.
But yeah, to me it looks very much a complicated, wheel reinvented square IO caching :)
Would be a nice experiment, though.
take a look at this :
http://freecode.com/articles/virtual-filesystem-building-a-linux-filesystem-from-an-ordinary-file
You can mount a file as a FS into a RAMdisk, then backup it as a file.
Don't sure you want to backup frequently, but it is a good solution to save all the virtual disk in only one time.

SD card write performance

I am writing a little application, which is writing jpeg images at a constant rate on a SD card.
I choose an EXT3 filesystem, but the same behaviour was observed with an EXT2 filesystem.
My writing loop looks like this :
get_image()
fwrite()
fsync()
Or like this :
get_image()
fopen()
fwrite()
fsync()
fclose()
I also display some timing statistics, and I can see my program is sometime blocked for several seconds.
The average rate is still good, because if I keep the incoming images into a fifo, then I will write many image in a short period of time after such a stall. Do you know if it is a problem with the OS or if it is related to the SD card itself ?
How could I move closer to realtime ? I don't need strong realtime, but being stalled for several seconds is not acceptable.
Some precision :
Yes it is necessary to fsync after every file, because I want the image to be on disk, not in some user or kernel buffer. Without fsyncing, I have much better throughoutput,
but still unacceptable stall. I don't think it is a buffer problem, since the first stall happens after 50 Mbytes have been written. And according to the man page, fsync is here precisely to ensure there is no data buffered.
Precision regarding the average write rate :
I am writing at a rate that is sustainable by the card I am using. If I pile incoming image while waiting for an fsync to complete, then after this stall the write transfer rate will increase and I will quickly go back to the average rate.
The average transfer rate is around 1.4 MBytes /s.
The systeme is a modern laptop running ubuntu 8.04 with stock kee (2.6.24.19)
Try to open the file with O_DIRECT and do the caching in application level.
We met the similar issue when we were implementing a PVR (Personal Video Record) feature in STB Box. The O_DIRECT trick satisfied our need finaly.(*)
Without O_DIRECT. The data of write() will firstly be cached in the kernel buffer and then be flushed to the media when you call fsync or when the kernel cache buffer is full.(**).
With O_DIRECT.Th kernel will do DMA directly to the physical memory pointed by the userspace buffer passed as parameter to the write syscalls. So there will be no CPU and mem bandwidth spent in the copies between userspace memory and kernel cache, and there will be no CPU time spent in kernel in the management of the cache (like cache lookups, per-page locks etc..).( copied from here )
Not sure it can also solve your problem, but you might want to have a try.
*Despite of Linus's critize of O_DIRECT ,it did solve our problems.
** suppose you did not open the file with O_DSYNC or O_SYNC
Is it necessary to fsync() after every file? You may have better luck letting the OS decide when a good time is to write out all enqueued images to the SD card (amortizing the startup cost of manipulating the SD card filesystem over many images, rather than incurring it for every image).
Can you provide some more details about your platform? Slow I/O times may have to do with other processes on the system, a slow I/O controller, etc..
You might also consider using a filesystem more suited to how flash memory works. FAT32 is more common than extN, but a filesystem specifically built for SD may be in order as well. JFFS is a good example of this. You will probably get better performance with a filesystem designed for flash (as opposed to spinning magnetic media), and you get better wear-leveling (and thus device lifetime/reliability) properties as well.
AFAIK some flash disks have really bad write performance (esp. cheap brands). So if you measure the write speed of your application (including the time required for fsync), what do you get? It might easily be in the order of very few megabytes per second - just because the hardware doesn't do better.
Also, apparently writing can be much slower if you write many small blocks instead of one big block (the flash disk might only get about 10 writes per second done, in bad cases). This is probably something that can be mitigated by the kernel buffer, so using fsync frequently might slow down the writing...
Btw. did you measure write performance on FAT32? I would guess it is about the same, but if not, maybe there's some optimization still available?
I'm not very knowledgeable in this area, but the symptoms you describe sound an awful lot like filling up a buffer. You may be filling a buffer in the file writer or in the I/O device communicating with the SD card itself. You then have to wait until it actually writes the data to the card (thus emptying the buffer) before you can write more. SD cards are not particularly fast writers. If you can find a way to check if data is actually being written to the card during these pauses, that would verify my theory. Some card readers have an LED that blinks when data is being accessed -- that would probably be a good indicator.
Just a hunch... take it with some salt :)
May be this will help - Benchmarking Filesystems:
...I was quite surprised how slow ext3 was overall, as many distributions use this file system as their default file system...
And "ext3 fsync batching":
...This patch measures the time it takes to commit a transaction to the disk, and sleeps based on the speed of the underlying disk.
For anyone reading this and using a kernel above 2.6.28, the recommendation is to use ext4 instead of ext3, which is a filesystem that you can tune for better performance. The best performance is obtained in data=writeback mode, where data is not journaled. Read the Data Mode section from https://www.kernel.org/doc/Documentation/filesystems/ext4.txt.
If you have a partition already created, say /dev/sdb1, then these are some steps that can be used to format it with ext4 without journaling:
mkfs.ext4 /dev/sdb1 -L jp # Creates the ext4 filesystem
tune2fs -o journal_data_writeback /dev/sdb1 # Set to writeback mode
tune2fs -O ^has_journal /dev/sdb1 # Disable journaling
sudo e2fsck -f /dev/sdb1 # Filesystem check is required
Then, you can mount this partition (or set an entry /etc/fstab if you know what you're doing) with the corresponding flags:
mount -t ext4 -O noatime,nodirame,data=writeback /dev/mmcblk0p1 /mnt/sd
Moving from ext3 to an optimized ext4 filesystem should be a drastic difference. And, of course, if your SD card is quicker that should help (i.e. class 10).
See also https://developer.ridgerun.com/wiki/index.php/High_performance_SD_card_tuning_using_the_EXT4_file_system
Might also consider the SD Card, is it NOR or NAND? This page shows an order of magnitude between sd cards (2M/s vs 20M/s).
http://www.robgalbraith.com/bins/camera_multi_page.asp?cid=6007-9597
I think ZFS is optimized for flash memory.

Resources