I understand a general cache-coherence protocol is to maintain consistency between multiple local copies (caches) of shared data.
What I don't understand is what it means to be a directory-based cache coherence protocol?
Thanks.
In simplified terms, a directory based cache coherence system means that cache coherence management is centrelized, meaning it is managed by a single unit - the directory.
The directory holds the state for all memory blocks and manages request for these blocks from the nodes (processors). For instance, if a node would like read a block into its cache, it must ask permission from the directory. The directory then checks to see if any other nodes hold the block and forces them to update it if necessary.
Related
A prerequisite to my question is the assumption, that one uses a Kubernetes cluster with multiple pods accessing a file system storage, for example Azure File (ReadWriteMany). According to Microsoft this relies on SMB3 and should be safe for concurrent use.
Within one pod GO is used (with the GIN framework) as the programming language for a server that accesses the the file system. Since requests are handled in parallel using goroutines the file system access is concurrent.
Does SMB3 also ensure correctness for concurrent ReadWrites within one pod, or do I need to manually manage file system operations with something like a Mutex or a Worker Pool?
More specifically: I want to use git2go which is a GO wrapper around libgit2 written in C. According to libgit2s threading.md it is not thread safe:
Unless otherwise specified, libgit2 objects cannot be safely accessed by multiple threads simultaneously.
Since git2go does not specify otherwise I assume, that different references to the same repository on the file system cannot be used concurrently from different goroutines.
To summarise my question:
Do network communication protocols (like SMB3) also "make" file system operations thread safe locally?
If not: how would I ensure thread safe file system access in GO? lockedfile seems to be a good option, but it is also internal.
The documentation says that the CP Subsystem is used for implementing distributed coordination use cases, such as leader election (Raft consensus algorithm), distributed locking, synchronization, and metadata management. By default, it operates in the "unsafe mode" and it even prints a warning to the console saying that strong consistency cannot be guaranteed.
On the other hand, it also says that when it comes to dealing with distributed data structures like IMap, the data will always be written and read from the primary replica by default.
So, if I have the CP Subsystem disabled and I use hazelcastInstance.getMap("accounts").lock("123"), would it be safe to assume no other cluster member will be able to do the same unless this one is released? Or do I have to actually configure the CP part just for this? I also use only a single replica without any backups if that makes any difference.
I think that it should be fine since all members will have to go to the same place for the lock. And also, it seems to me that the "distributed locking" part in the CP Subsystem actually means its own FencedLock that is accessible via hazelcastInstance.getCPSubsystem().getLock("myLock") and so the lock on the map is different.
Here is an answer that explains your question
Hazelcast 3.12 IMap.lock() on Map better than ILock.lock() which is deprecated?
IMap.lock(key) creates a lock object for that key on the same partition while other keys are still available. So it does not use CP Subsystem
According to https://slurm.schedmd.com/quickstart_admin.html#HA high availability of SLURM is achieved by deploying a second BackupController which takes over when the primary fails and retrieves the current state from a shared file system (probably NFS).
In my opinion this has a number of drawbacks. E.g. this limits the total number of server to two and the second server is probably barely used.
Is this the only way to get a highly available head node with SLURM?
What I would like to do is a classic 3-tiered setup: A load balancer in the first tier which spreads all requests evenly across the nodes in the seconds tier. This requires the head node(s) to be stateless. The third tier is the database tier where all information is stored or read. I don't know anything about the internals of SLURM and I'm not sure if this is even remotely possible.
In the current design, the controller internal state is in-memory, and Slurm saves it to a set of files in the directory pointed to by the StateSaveLocation configuration parameter regularly. Only one instance of slurmctld can write to that directory at a time.
One problem with storing the state in the database would be a terrible latency in resource allocation with a lot of synchronisations needed, because optimal resource allocation can only be done with full information. The infrastructure needed to support the same level of throughput as Slurm can handle now with in-memory state would be very costly compared with the current solution implying only bitwise operations on arrays in memory.
Is this the only way to get a highly available head node with SLURM?
You can also have a single MasterController managed with Corosync. But indeed Slurm only has active/passive options available for HA.
In my opinion this has a number of drawbacks. E.g. this limits the
total number of server to two and the second server is probably barely
used.
The load on the controller is often very reasonable with respect to the current processing power, and the resource allocation problem cannot be trivially parallelised (or made stateless). Often, the backup controller is co-located on a machine running another service. For instance, on small deployments, one machine runs the Slurm primary controller, and other services (NFS, LDAP, etc.), etc. while another is the user login node, that also acts as a secondary Slurm controller.
I want to setup a DRBD active/active configuration with two nodes. My application will be doing I/Os directly on the DRBD device. I haven't seen any option to enable caching within DRBD.
Is there any linux module that would allow me to setup a cache in between DRBD and the disk module? Any caching above DRBD module can lead to stale data being read by the nodes.
DRBD itself has 3 protocol types with varying guarantees. You could try using B or even A. However, all types block until the local write has been successful.
As for caching writes to the disk explicitly, this SO question might provide further pointers on what is possible to do. Especially dannysauer's answer looks interesting.
Scenario A:
To share a read/write block of memory between two processes running on the same host, Joe mmaps the same local file from both processes.
Scenario B:
To share a read/write block of memory between two processes running on two different hosts, Joe shares a file via nfs between the hosts, and then mmaps the shared file from both processes.
Has anyone tried Scenario B? What are the extra problems that arise in Scenario B that do not apply to Scenario A?.
Mmap will not share data without some additional actions.
If you change data in mmaped part of file, changes will be stored only in memory. They will not be flushed to the filesystem (local or remote) until msync or munmap or close or even decision of OS kernel and its FS.
When using NFS, locking and storing of data will be slower than if using local FS. Timeouts of flushing and time of file operations will vary too.
On the sister site people says that NFS may have poor caching policy, so there will be much more I/O requests to the NFS server comparing I/O request count to local FS.
You will need byte-range-lock for correct behavior. They are available in NFS >= v4.0.
I'd say scenario B has all kinds of problems (assuming it works as suggested in the comments). The most obvious is the standards concurrency issues - 2 processes sharing 1 resource with no form of locking etc. That could lead to problems... Not sure whether NFS has its own peculiar quirks in this regard or not.
Assuming you can get around the concurrency issues somehow, you are now reliant on maintaining a stable (and speedy) network connection. Obviously if the network drops out, you might miss some changes. Whether this matters depends on your architecture.
My thought is it sounds like an easy way to share a block of memory on different machines, but I can't say I've heard of it being done which makes me think it isn't so good. When I think sharing data between procs, I think DBs, messaging or a dedicated server. In this case if you made one proc the master (to handle concurrency and owning the concept -i.e. whatever this guy says is the best copy of the data) it might work...