Is it possible that keep SQLite in normal locking mode and wal journal mode without mmap? - multithreading

I am using SQLite on iOS developing.
Now I need the better performance of multithread-reading-and-writing and the robust of WAL journal mode without mmap.
As I know, WAL journal mode without mmap requires EXCLUSIVE locking mode, which may prevent multithread-reading-and-writing.
So, I wonder whether it is possible that keep SQLite in normal locking mode and wal journal mode without mmap? If yes, can you tell me a general idea for implementing this. No matter it needs to modify the source code or how difficult it is.

Related

Does O_DIRECT bypass filesystem journaling?

The man page for open(2) only suggests that O_DIRECT bypasses the page cache, but many descriptions around the net describe it as causing the user buffer to be DMA'd straight to the drive. If this is the case I imagine it would also bypass journaling done by the filesystem (e.g. xfs, ext4, etc.). Is this the case?
I can't find anyone claiming one way or the other. It seems to me this would be consistent with O_DIRECT being used by databases -- the common example use for O_DIRECT is when an application like a database is doing its own caching in userspace, and similarly I can imagine databases doing their own transaction logs.
Does O_DIRECT bypass filesystem journaling?
Usually it does. However, file data usually doesn't go into a filesystem's journal anyway. More details below (but note this answer doesn't try to account for CoW filesystems):
Most Linux journaling filesystems (Ext4 when journal is set to writeback or ordered (the default), XFS, JFS etc) are not journalling the data within files - they are journaling the consistency of the filesystem's data structures (metadata).
Filesystem journals only metadata (typical case): Well data within the files doesn't go into the journal anyway so using O_DIRECT doesn't change this and the data continues to not to go into the journal. However, O_DIRECT operations can still trigger metadata updates just like normal but the initiating operation can return before the metadata has been updated. See the Ext4 wiki Clarifying Direct IO's Semantics page for details.
Ext4 in journal=data mode: This is trickier - there's a warning that the desired outcome with O_DIRECT in journal=data mode might not be what is expected . From the "data=journal" section of ext4.txt:
Enabling this mode [journal=data] will disable delayed allocation and O_DIRECT support.
In this scenario O_DIRECT is looked at as only a hint and the filesystem silently falls back to stuffing the data into the page cache (making it no longer direct!). So in this case yes the data will end up going into the journal and the journal won't be bypassed. See the "Re: [PATCH 1/1 linux-next] ext4: add compatibility flag check" thread for where Ted Ts'o articulates this. There are patches floating around ("ext4: refuse O_DIRECT opens for mode where DIO doesn't work") to make the filesystem return an error at open instead but from what I can see these were rejected from the mainline kernel.

Writing consistent data to file in Linux

I want to write a library which logs data to a file. Unfortunately, my system suffers from unexpected reboots and power loss.
Does Linux write operation to a file guaranties that my file will always contain consistent data? Does it guarantee "all or nothing"?
If so, is there a limitation on the size of the data being written?
Thanks.
When you mount the file system you can specify one of the below options. It seems like the third one suits my requirements.
This is what I found at
http://lxr.free-electrons.com/source/Documentation/filesystems/ext3.txt
Data Mode
There are 3 different data modes:
writeback mode
In data=writeback mode, ext3 does not journal data at all. This mode provides
a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
mode - metadata journaling. A crash+recovery can cause incorrect data to
appear in files which were written shortly before the crash. This mode will
typically provide the best ext3 performance.
ordered mode
In data=ordered mode, ext3 only officially journals metadata, but it logically
groups metadata and data blocks into a single unit called a transaction. When
it's time to write the new metadata out to disk, the associated data blocks
are written first. In general, this mode performs slightly slower than
writeback but significantly faster than journal mode.
journal mode
data=journal mode provides full data and metadata journaling. All new data is
written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state. This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
outperforms all other modes.
You can never predict where a physical write operations stops on a power outage. Even if you use journaling features of some filesystems. Note that the journal needs to get written too.

How to make the linux filesystem power failure safe?

I need to have a robust filesystem on an debian or ubuntu linux. The problem is, that the system can be "shutdown" by just cut the power ( without a real shutdown ).
After such a scenario I don't want to have the filesystem check or corrupted data on the root filesystem. I need to make sure that the system starts without problems again the next time. What could be the solution?
Along with journaling , Soft Updates can also be a method.Regardless of the method you use, please keep in mind that these methods can only guarantee you the consistency of the file system and neither of them can guarantee the safety of your data.Disk can be mounted more quickly in case of soft updates.

is write() with O_DIRECT ACID compliant?

My database engine writes records of 64 bytes by issuing write() syscall of the entire disk block. The device is opened with O_DIRECT mode. For example third record within a block starts at byte 128 and ends at position 192, when I do an UPDATE the entire disk block (which is by default 512 bytes) is written.
My question is, can I calim ACID compliance if I am writing the record over itself every time UPDATE occurs? Usually database engines do this in 2 steps by writing modified disk block to another (free) place and then updating an index to new block with one (atomic) write immediately after first write returned success. But I am not doing this, I am overwriting current data with new one expecting the write to be successful. Does my method has any potential problems? Is it ACID compliant? What if the hardware writes only half of the block and my record is exactly in the middle? Or does the hardware already does the 2 step write process I described , but at block level, so I don't need to repeat the same in software?
(note: no record is larger than physical disk block (512 bytes by default) and fsync goes after each write(), this is for Linux only)
ACID anticipates failures, and suggests ways to deal with them. Two-phase commits and three-phase commits are two fairly common and well-understood approaches.
Although I'm a database guy, the dbms frees me from having to think about this kind of thing very much. But I'd say overwriting a record without taking any other precautions is liable to fail the "C" and "D" properties ("consistent" and "durable").
To build really good code, imagine that your dbms server has no battery-backed cache, only one power supply, and that during a transaction there's a catastrophic failure in that one power supply. If your dbms can cope with that kind of failure fairly cleanly, I think you can call it ACID compliant.
Later . . .
I read Tweedie's transcript. He's not talking about database direct disk access; he's talking about a journaling filesystem. A journaling filesystem also does a two-phase commit.
It sounds like you're trying to reach ACID compliance (in the database sense) with a single-phase commit. I don't think you can get away with that.
Opening with O_DIRECT means "Try to minimize cache effects of the I/O to and from this file" (emphasis added). I think you'll also need O_SYNC. (But the linked kernel docs caution that most Linux filesystems don't implement POSIX semantics of O_SYNC. And both filesystems and disks have been known to lie about whether a write has hit a platter.)
There are two more cautions in the kernel docs. First, "It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default." You're not doing that. You're trying to use it to achieve ACID compliance.
Second,
"The thing that has always disturbed me about O_DIRECT is that the
whole interface is just stupid, and was probably designed by a
deranged monkey on some serious mind-controlling substances." -- Linus
SQLite has a readable paper on how they handle atomic commits. Atomic Commit in SQLite
No.
You cannot assume the disk write will be successful. And you cannot assume that the disk will leave the existing data in place. Here is some QNX documentation also stating this.
If you got really, really unlucky the disk power will fail while it is writing, leaving the block with corrupt checksums and half-written data.
This is why ACID systems use at least two copies of the data.
is write() with O_DIRECT ACID compliant?
No, this is not guaranteed in the general case. Here are some counterexamples for Durability:
O_DIRECT makes no guarantees that acknowledged data made it out of a volatile cache that is part of the device
O_DIRECT makes no guarantees about persistence of filesystem metadata that might be required to actually read back the (acknowledged) write data (e.g. in the case of appending writes)
My question is, can I calim [sic] ACID compliance if I am writing the record over itself every time UPDATE occurs?
In the general case no. For example a spec compliant SCSI disk doesn't have to guarantee the semantics of only getting only the old or only the new data if a crash happens mid-write (it's legal for it to return an error reading that data until the region is unconditionally overwritten). If you're doing a write to a file in a filesystem then things are even more complicated. Having a successful fsync() after the write() before you issue new I/O will help you to know the write was stable but is not enough to ensure Atomicity (only old or new data) in the general case of awkwardly timed power loss.
Does my method [assuming overwrites are perfectly atomic] has [sic] any potential problems?
Yes, see above. What you are doing may work as you wish in certain setups but there's no guarantee it should work in all (even though they are "non-faulty" per their spec).
See this answer on "What does O_DIRECT really mean?" for further discussion.

Possible to implement journaling with a single fsync per commit?

Let's say you're building a journaling/write-ahead-logging storage system. Can you simply implement this by (for each transaction) appending the data (with write(2)), appending a commit marker, and then fsync-ing?
The scenario to consider is if you do a large set of writes to this log then fsync it, and there's a failure during the fsync. Are the inode direct/indirect block pointers flushed only after all data blocks are flushed, or are there no guarantees that blocks are being flushed in order? If the latter, then during recovery, if you see a commit marker at the end of the file, you can't trust that the data between it and the previous commit marker is meaningful. Thus you have to rely on another mechanism (involving at least another fsync) to determine what extent of the log file is consistent (e.g., writing/fsyncing the data, then writing/fsyncing the commit marker).
If it makes a difference, mainly wondering about ext3/ext4 as the context.
Note that linux's and mac os's fsync and fdatasync are incorrect by default. Windows is correct by default, but can emulate linux for benchmarking purposes.
Also, fdatasync issues multiple disk writes if you append to the end of a file, since it needs to update the file inode with the new length. If you want to have one write per commit, your best bet is to pre-allocate log space, store a CRC of the log entries in the commit marker, and issue a single fdatasync() at commit. That way, no matter how much the OS / hardware reorder behind your back, you can find a prefix of the log that actually hit disk.
If you want to use the log for durable commits or write ahead, things get harder, since you need to make sure that fsync actually works. Under Linux, you'll want to disable the disk write cache with hdparm, or mount the partition with barrier set to true. [Edit: I stand corrected, barrier doesn't seem to give the correct semantics. SATA and SCSI introduce a number of primitives, such as write barriers and native command queuing, that make it possible for operating systems to export primitives that enable write-ahead logging. From what I can tell from manpages and online, Linux only exposes these to filesystem developers, not to userspace.]
Paradoxically, disabling the disk write cache sometimes leads to better performance, since you get more control over write scheduling in user space; if the disk queues up a bunch of synchronous write requests, you end up exposing strange latency spikes to the application. Disabling write cache prevents this from happening.
Finally, real systems use group commit, and do < 1 sync write per commit with concurrent workloads.
There's no guarantee on the order in which blocks are flushed to disk. These days even the drive itself can re-order blocks on their way to the platters.
If you want to enforce ordering, you need to at least fdatasync() between the writes that you want ordered. All a sync promises is that when it returns, everything written before the sync has hit storage.

Resources