I want to setup a DRBD active/active configuration with two nodes. My application will be doing I/Os directly on the DRBD device. I haven't seen any option to enable caching within DRBD.
Is there any linux module that would allow me to setup a cache in between DRBD and the disk module? Any caching above DRBD module can lead to stale data being read by the nodes.
DRBD itself has 3 protocol types with varying guarantees. You could try using B or even A. However, all types block until the local write has been successful.
As for caching writes to the disk explicitly, this SO question might provide further pointers on what is possible to do. Especially dannysauer's answer looks interesting.
Related
If I browse the Distributed File System (DFS) shared folder I can create a file and watch it replicate almost immediately across to the other office DFS share. Accessing the shares is pretty instant even across the broadband links.
I would like to improve the read/write speed. Any tips much appreciated.
Improving hardware always help but keep in mind that in any distributed file system the performance of the parent host will influence besides that in many cases you can't touch the hardware and you need to optimize network or tune your systems to best fit your current provider architecture.
An example of this, mainly in virtualized environments, is the case when disabling the TCP segmentation offload from the network cards, ifconfig_DEFAULT="SYNCDHCP -tso", it will considerably improve the throughput but at a cost of more CPU usage.
Depending on how far you want to go you can start all these optimizations from the very bottom:
creating your custom lean kernel/image
test/benchmark network settings (iperf)
fine tune your FS, if using ZFS here are some guides:
http://open-zfs.org/wiki/Performance_tuning
https://wiki.freebsd.org/ZFSTuningGuide
performance impact when using Solaris ZFS lz4 compression
Regarding moosefs there are some threads about how the block size affects I/O performance and how in many cases by disabling cache allow blocks > 4k.
Mainly for FreeBSD we added special cache option for MooseFS client
called DIRECT.
This option is available in MooseFS client since version 3.0.49.
To disable local cache and enable DIRECT communication please use this
option during mount:
mfsmount -H mfsmaster.your.domain.com -o mfscachemode=DIRECT /mount/point
In most filesystems speed factors are: type of access (sequential or random) and block size. Hardware performance is also the factor on MooseFS. You can improve speed by improving hard drives performance (for example you can switch to SSD), network topology (network latency) and network capacity.
I understand a general cache-coherence protocol is to maintain consistency between multiple local copies (caches) of shared data.
What I don't understand is what it means to be a directory-based cache coherence protocol?
Thanks.
In simplified terms, a directory based cache coherence system means that cache coherence management is centrelized, meaning it is managed by a single unit - the directory.
The directory holds the state for all memory blocks and manages request for these blocks from the nodes (processors). For instance, if a node would like read a block into its cache, it must ask permission from the directory. The directory then checks to see if any other nodes hold the block and forces them to update it if necessary.
I heard that there's a huge improvement when Cassandra can write it's logfiles to one disk, and the SS Tables to another. I have two disks, and if I was running Linux I would mount each in a different path and configure Cassandra to write on those.
What I would like to know is how to do that in ZFS and SmartOS.
I'm a complete newbie in SmartOS, and from what I understood I add the disks to the storage pool, are they then managed as being one ?
psanford explained how to use two disks, but that's probably not what you want here. That's usually recommended to work around deficiencies in the operating system's I/O scheduling. ZFS has a write throttle to avoid saturating disks[0], and SmartOS can be configured to throttle I/Os to ensure that readers see good performance when some users (possibly the same user) are doing heavy writes[1]. I'd be surprised if the out-of-the-box configuration wasn't sufficient, but if you're seeing bad performance, it would be good to quantify that.
[0] http://dtrace.org/blogs/ahl/2014/02/10/the-openzfs-write-throttle/
[1] http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/
By default SmartOS aggregates all your disks together into a single ZFS pool (SmartOS names this pool "zones"). From this pool you create ZFS datasets which can either look like block devices (used for KVM virtual machines) or as filesystems (used for SmartOS zones).
You can setup more than one pool in SmartOS, but you will have to do it manually. The Solaris documentation is still quite good and applicable to modern Illumos distributions (including SmartOS). Chapter 4 has all the relevant information for creating a new ZFS pool, but it can be as simple as:
zpool create some_new_pool_name c1t0d0 c1t1d0
This assumes that you have access to the global zone.
If I were running a Cassandra cluster on bare metal and I wanted to benefit from things like ZFS and DTrace I would probably use OmniOS instead of SmartOS. I don't want any contention for resources with my database machines, so I wouldn't run any other zones or VMs on that hardware (which is what SmartOS is really good at).
Scenario A:
To share a read/write block of memory between two processes running on the same host, Joe mmaps the same local file from both processes.
Scenario B:
To share a read/write block of memory between two processes running on two different hosts, Joe shares a file via nfs between the hosts, and then mmaps the shared file from both processes.
Has anyone tried Scenario B? What are the extra problems that arise in Scenario B that do not apply to Scenario A?.
Mmap will not share data without some additional actions.
If you change data in mmaped part of file, changes will be stored only in memory. They will not be flushed to the filesystem (local or remote) until msync or munmap or close or even decision of OS kernel and its FS.
When using NFS, locking and storing of data will be slower than if using local FS. Timeouts of flushing and time of file operations will vary too.
On the sister site people says that NFS may have poor caching policy, so there will be much more I/O requests to the NFS server comparing I/O request count to local FS.
You will need byte-range-lock for correct behavior. They are available in NFS >= v4.0.
I'd say scenario B has all kinds of problems (assuming it works as suggested in the comments). The most obvious is the standards concurrency issues - 2 processes sharing 1 resource with no form of locking etc. That could lead to problems... Not sure whether NFS has its own peculiar quirks in this regard or not.
Assuming you can get around the concurrency issues somehow, you are now reliant on maintaining a stable (and speedy) network connection. Obviously if the network drops out, you might miss some changes. Whether this matters depends on your architecture.
My thought is it sounds like an easy way to share a block of memory on different machines, but I can't say I've heard of it being done which makes me think it isn't so good. When I think sharing data between procs, I think DBs, messaging or a dedicated server. In this case if you made one proc the master (to handle concurrency and owning the concept -i.e. whatever this guy says is the best copy of the data) it might work...
Say I have a 1 TB data file mmapped read/write from the localy mounted hdd filesystem of a "master" linux system into the virtual address space of a process running on this same "master" system.
I have 20 dedicated "slave" linux servers connected across a gigabit switch to the "master" system. I want to give random read access to this 1 TB on these "slave" servers by mmaping it read-only into their process address spaces.
My question is what is the most efficient way of synchronizing (perhaps lazily) the dataset from the master system to the slave systems? (for example is it possible to mount the file over NFS and then mmap it from there? if yes, is this the best solution? if no, what is a solution?)
I have been playing around with an idea like this at work recently (Granted this was with significantly smaller file sizes). I believe NFS would be fine for reads but you might hit problems with concurrent writes. Providing you have only one "writer" then your idea should work reasonably well. If the data file is structured, I'd recommend going for a distributed cache of some description and allowing multiple copies of the data spread across the cluster (for redundancy).
In the end we went for a SAN and clustered file system solution (in our case Symantec VCS, but any generic clustered filesystem would do). The reason we did this is because we couldn't get the performance we required from using pure NFS. The clustered file system you choose would need to support mmap properly and a distributed cache.