mmap file shared via nfs?

mmap file shared via nfs? - linux

Scenario A:
To share a read/write block of memory between two processes running on the same host, Joe mmaps the same local file from both processes.
Scenario B:
To share a read/write block of memory between two processes running on two different hosts, Joe shares a file via nfs between the hosts, and then mmaps the shared file from both processes.
Has anyone tried Scenario B? What are the extra problems that arise in Scenario B that do not apply to Scenario A?.

Mmap will not share data without some additional actions.
If you change data in mmaped part of file, changes will be stored only in memory. They will not be flushed to the filesystem (local or remote) until msync or munmap or close or even decision of OS kernel and its FS.
When using NFS, locking and storing of data will be slower than if using local FS. Timeouts of flushing and time of file operations will vary too.
On the sister site people says that NFS may have poor caching policy, so there will be much more I/O requests to the NFS server comparing I/O request count to local FS.

You will need byte-range-lock for correct behavior. They are available in NFS >= v4.0.

I'd say scenario B has all kinds of problems (assuming it works as suggested in the comments). The most obvious is the standards concurrency issues - 2 processes sharing 1 resource with no form of locking etc. That could lead to problems... Not sure whether NFS has its own peculiar quirks in this regard or not.
Assuming you can get around the concurrency issues somehow, you are now reliant on maintaining a stable (and speedy) network connection. Obviously if the network drops out, you might miss some changes. Whether this matters depends on your architecture.
My thought is it sounds like an easy way to share a block of memory on different machines, but I can't say I've heard of it being done which makes me think it isn't so good. When I think sharing data between procs, I think DBs, messaging or a dedicated server. In this case if you made one proc the master (to handle concurrency and owning the concept -i.e. whatever this guy says is the best copy of the data) it might work...

Related

flock and NFS -- what happens upon unexpected shutdown?

I am using flock within an HPC application on a file system shared among many machines via NFS. Locking works fine as long as all machines behave as expected (Quote from http://en.wikipedia.org/wiki/File_locking: "Kernel 2.6.12 and above implement flock calls on NFS files using POSIX byte-range locks. These locks will be visible to other NFS clients that implement fcntl-style POSIX locks").
I would like to know what is expected to happen if one of the machines that has acquired a certain lock unexpectedly shuts down, e.g. due to a power outage. I am not sure where to look this up. My guess is that this is entirely up to NFS and its way to deal with NFS handles of non-responsive machines. I could imagine that the other clients will still see the lock until a timeout occurs and the NFS server declares all NFS handles of the machine that timed out as invalid. Is that correct? What would that timeout be? What happens if the machine comes up again within the timeout? Can you recommend a definite reference to look all of this up?
Thanks!

When you use NFS v4 (!) the file will be unlocked when the server hasn't heard from the client for a certain amount of time. This lease period defaults to 90s.

There is a good explanation in the O'Reilly book about NFS and NIS, chapter 11.2. To sum up quickly: As NFS is stateless, the server has no way of knowing the client has crashed. The client is responsible for clearing the lock after it reboots.

Sharing large mmaped data file across machines in Linux Cluster

Say I have a 1 TB data file mmapped read/write from the localy mounted hdd filesystem of a "master" linux system into the virtual address space of a process running on this same "master" system.
I have 20 dedicated "slave" linux servers connected across a gigabit switch to the "master" system. I want to give random read access to this 1 TB on these "slave" servers by mmaping it read-only into their process address spaces.
My question is what is the most efficient way of synchronizing (perhaps lazily) the dataset from the master system to the slave systems? (for example is it possible to mount the file over NFS and then mmap it from there? if yes, is this the best solution? if no, what is a solution?)

I have been playing around with an idea like this at work recently (Granted this was with significantly smaller file sizes). I believe NFS would be fine for reads but you might hit problems with concurrent writes. Providing you have only one "writer" then your idea should work reasonably well. If the data file is structured, I'd recommend going for a distributed cache of some description and allowing multiple copies of the data spread across the cluster (for redundancy).
In the end we went for a SAN and clustered file system solution (in our case Symantec VCS, but any generic clustered filesystem would do). The reason we did this is because we couldn't get the performance we required from using pure NFS. The clustered file system you choose would need to support mmap properly and a distributed cache.

Do network file systems pre-fetch ? (Or: Do Internet File System make optimizations to reduce round trips)

Take the following code snippit:
f = open("/mnt/remoteserver/bar/foo.bin", O_RDONNLY);
while (true)
{
byteseread = read(f, buffer, 1000);
if (bytesread > 0)
ProcessBytes(buffer, bytesread);
else
break;
}
If the example above, let's say the remote file, foo.bin is 1MB and has never been accessed by the client before. So, that's approximately 1000 calls to "read" to get the entire file.
Further, let's say the server with the directory mounted on the client is over the internet and not local. Fast bandwidth to the client, but with long latency.
Does every "read" call invoke a round trip back to the server to ask for more data? Or does the client/server protocol recognize that subsequent reads on a remote file are often sequential, and as such, subsequent blocks are pushed down before the application has actually made a read() call for it. Hence, subsequent read calls return faster because the data was pre-fetched and cached.
Do modern network file system protocols (NFS, SMB/Samba, any others?) make any optimizations like this. Are there network file system protocols tuned for the internet that have optimizations like this?
I'm investigating a personal project that may involve implementation of a network file system over the internet. It struck me that performance may be faster if the number of round trips could be reduced for file i/o.

This is going to be very protocol implementation dependent. In general, I don't think most client implementations prefetch, but most savvy storage admins use large blocksizes (32+kb see the rsize/wsize mount options), which effectively results in the same thing. Network file systems are typically going to be cached via the systems buffer cache as well, so you'll definitely not be translating read() calls directly to network IO.
My advice would be to be to write your program naively (or a simple test case) and get comfortable reading the network stats via nfsstat, etc, and then optimize from there. There's far too many variables to get the answer any other way.
I'm no expert, but from what I can tell NFS4 has more WAN optimizations than the older protocols (nfs2,3,cifs) so I'd definitely factor it into your mix. That said, most remote filesystem protocols aren't really designed for high latency access which is why we end up with systems like S3, which are.

Track changes in nfs / sync nfs over multiple datacenters

We have two datacenters, each with a number of Linux servers that share a large EMC-based nfs.
The challenge is to keep the two nfs' in sync. For the moment assume that writes will only occur to nfs1, which then has to propagate the changes to nfs2.
Periodic generic rsyncs have proved too slow - each rsync takes several hours to complete, even with -az. We need to do specific syncs when a file or directory actually changes.
So then the problem is, how do we know when a file or directory has changed? inotify is the obvious answer, but it famously does not work with nfs. (There is some chatter about inotify possibly working if it is installed on the nfs server, but that isn't an option for us - we only have control of the clients, not the server.)
Does the linux nfs client allow you to capture all the changes it sends to the server, in a logfile or otherwise? Or could we hack the client to do this? We could then collect the changes from each client and periodically kick off targeted rsyncs.
Any other ideas welcome. Thanks!

If you need to keep the 2 EMC servers in sync, it might be bettering to look into EMC specific mirroring capabilities to achieve this. Typically these are block-based updates for high performance and low bandwidth utilization. For example, using SnapMirror on NetApp could achieve this. I'm not as familiar with EMC but a quick google search revealed EMC MirrorView or EMC SRDF as possible options.

Should I fsck ext3 on embedded system?

We have a number of embedded systems requiring r/w access to the filesystem which resides on flash storage with block device emulation. Our oldest platform runs on compact flash and these systems have been in use for over 3 years without a single fsck being run during bootup and so far we have no failures attributed to the filesystem or CF.
On our newest platform we used USB-flash for the initial production and are now migrating to Disk-on-Module for r/w storage. A while back we had some issues with the filesystem on a lot of the devices running on USB-storage so I enabled e2fsck in order to see if that would help. As it turned out we had received a shipment of bad flash memories so once those were replaced the problem went away. I have since disabled e2fsck since we had no indication that it made the system any more reliable and historically we have been fine without it.
Now that we have started putting in Disk-on-Module units I've started seeing filesystem errors again. Suddenly the system is unable to read/write certain files and if I try to access the file from the emergency console I just get "Input/output error". I enabled e2fsck again and all the files were corrected.
O'Reilly's "Building Embedded Linux Systems" recommends running e2fsck on ext2 filesystems but does not mention it in relation to ext3 so I'm a bit confused to whether I should enable it or not.
What are your takes on running fsck on an embedded system? We are considering putting binaries on a r/o partition and only the files which has to be modified on a r/w partition on the same flash device so that fsck can never accidentally delete important system binaries, does anyone have any experience with that kind of setup (good/bad)?

I think the answer to your question more relates to what types of coherency requirements you application has relative to its data. That is, what has to be guaranteed if power is lost without a formal shutdown of the system? In general, none of the desktop operating system type file systems handle this all that well without specific application closing/syncing of files and flushing of the disk caches, etc. at key transaction points in the application to ensure what you need to maintain is in fact committed to the media.
Running fsck fixes the file-system but without the above care, there is no guarantees about what changes you made will actually be kept. ie: It's not exactly deterministic what you'll lose as a result of the power failure.
I agree that putting your binaries or other important read-only data on a separate read-only partition does help ensure that they can't erroneously get tossed due to an fsck correction to file-system structures. As a minimum, putting them in a different sub-directory off the root than where the R/W data is held will help. But in both cases, if you support software updates, you still need to have scheme to deal with writing the "read-only" areas anyway.
In our application, we actually maintain a pair of directories for things like binaries and the system is setup to boot from either one of the two areas. During software updates, we update the first directory, sync everything to the media and verify the MD5 checksums on disk before moving onto the second copy's update. During boot, they are only used if the MD5 checksum is good. This ensures that you are booting a coherent image always.

Dave,
I always recommend running the fsck after a number of reboots, but not every time.
The reason is that, the ext3 is journal-ed. So unless you enable the writeback (journal-less), then most of the time, your metadata/file-system table should be in sync with your data (files).
But like Jeff mentioned, it doesn't guarantee the layer above the file-system. It means, you still get "corrupted" files, because some of the records probably didn't get written to the file system.
I'm not sure what embedded device you're running on, but how often does it get rebooted?
If it's controlled reboot, you can always do "sync;sync;sync" before restart.
I've been using the CF myself for years, and very rare occasion I got file-system errors.
fsck does help on that case.
And about separating your partition, I doubt the advantage of it. For every data/files on the file-system, there's a metadata associated with it. Most of the time, if you don't change the files, eg. binary/system files, then this metadata shouldn't change. Unless you have a faulty hardware, like cross-talking write & read, those read-only files should be safe.
Most problems arises when you have something writable, and regardless where you put this, it can cause problems if the application doesn't handle it well.
Hope that helps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string