Does IMap.lock() require CP subsystem? - hazelcast

The documentation says that the CP Subsystem is used for implementing distributed coordination use cases, such as leader election (Raft consensus algorithm), distributed locking, synchronization, and metadata management. By default, it operates in the "unsafe mode" and it even prints a warning to the console saying that strong consistency cannot be guaranteed.
On the other hand, it also says that when it comes to dealing with distributed data structures like IMap, the data will always be written and read from the primary replica by default.
So, if I have the CP Subsystem disabled and I use hazelcastInstance.getMap("accounts").lock("123"), would it be safe to assume no other cluster member will be able to do the same unless this one is released? Or do I have to actually configure the CP part just for this? I also use only a single replica without any backups if that makes any difference.
I think that it should be fine since all members will have to go to the same place for the lock. And also, it seems to me that the "distributed locking" part in the CP Subsystem actually means its own FencedLock that is accessible via hazelcastInstance.getCPSubsystem().getLock("myLock") and so the lock on the map is different.

Here is an answer that explains your question
Hazelcast 3.12 IMap.lock() on Map better than ILock.lock() which is deprecated?
IMap.lock(key) creates a lock object for that key on the same partition while other keys are still available. So it does not use CP Subsystem

Related

Azure backup - File system consistent , Application consistent and Crash consistency

I am trying to understand the difference between File system consistent and crash consistent backups provided by Azure. The majority of the information that I find is from this link. I see Application consistent backup is to ensure that all memory data and pending I/O are accounted for perhaps by using a quiescing process so proper snapshot can be taken. However bit confused between the other two. I see Crash consistent is one which doesn't consider the in-memory, pending I/Os and only considers backing up what has been written. But then what exactly would be meant by file-consistent backup? I don't find any definition. As a result when the docs mention that by default Linux VM backups are File system consistent if not using pre/post scripts, I am not understanding the implications. Any help much appreciated.
Simple example to demark the difference is : when a recovery point is file-system consistent, there won't be any file system check performed to make sure that file system is not corrupted. In case of crash consistency, after a VM boots up, a file-system check may be performed and based on that there can be potentially a data loss because of corruption of file system. So, it is always better to strive for file system consistency.

hazelcast - read-backup-data vs near cache

In IMap configuration there is an attribute read-backup-data that can be set as true which enables a member to read the value from the backup copy, if available, in case the owner of the key is some other member.
http://docs.hazelcast.org/docs/latest-development/manual/html/Distributed_Data_Structures/Map/Backing_Up_Maps.html#page_Enabling+Backup+Reads
Then there is nearcache which will start caching results for a few datastructures locally.
http://docs.hazelcast.org/docs/latest-development/manual/html/Performance/Near_Cache/Hazelcast_Data_Structures_with_Near_Cache_Support.html
If we have 2 kinds of cluster setup:
2 members, and async-backup-count for a map is 1, and read-backup-data is true
2 members, nearcache enabled for this map
Would there be differences in these 2 approaches?
1st setup will probably use less memory, and will not be configurable. But in terms of read performance?
For two member cluster setup; enabling backup reads will provide you to access all the data locally, since both members hold all the entries as either primary or backup. This setup is not much different than using a Replicated Map (see here for details: http://docs.hazelcast.org/docs/latest-development/manual/html/Distributed_Data_Structures/Replicated_Map.html). So; when your cluster have only two members (also no clients), enabling backup reads can be more advantageous in terms of performance.
However; near cache has a bunch of configuration options, and you can decide how much data you need to access locally at any type of setup (including client-server topology). You can also decide the in-memory data format in near cache. These options can provide you more performance than enabling backup reads.
Both options are not so much different in single entry read performance (I assume near cache contains valid entry), since both don't perform a remote operation.

How to make Cassandra use two disks on ZFS in SmartOS?

I heard that there's a huge improvement when Cassandra can write it's logfiles to one disk, and the SS Tables to another. I have two disks, and if I was running Linux I would mount each in a different path and configure Cassandra to write on those.
What I would like to know is how to do that in ZFS and SmartOS.
I'm a complete newbie in SmartOS, and from what I understood I add the disks to the storage pool, are they then managed as being one ?
psanford explained how to use two disks, but that's probably not what you want here. That's usually recommended to work around deficiencies in the operating system's I/O scheduling. ZFS has a write throttle to avoid saturating disks[0], and SmartOS can be configured to throttle I/Os to ensure that readers see good performance when some users (possibly the same user) are doing heavy writes[1]. I'd be surprised if the out-of-the-box configuration wasn't sufficient, but if you're seeing bad performance, it would be good to quantify that.
[0] http://dtrace.org/blogs/ahl/2014/02/10/the-openzfs-write-throttle/
[1] http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/
By default SmartOS aggregates all your disks together into a single ZFS pool (SmartOS names this pool "zones"). From this pool you create ZFS datasets which can either look like block devices (used for KVM virtual machines) or as filesystems (used for SmartOS zones).
You can setup more than one pool in SmartOS, but you will have to do it manually. The Solaris documentation is still quite good and applicable to modern Illumos distributions (including SmartOS). Chapter 4 has all the relevant information for creating a new ZFS pool, but it can be as simple as:
zpool create some_new_pool_name c1t0d0 c1t1d0
This assumes that you have access to the global zone.
If I were running a Cassandra cluster on bare metal and I wanted to benefit from things like ZFS and DTrace I would probably use OmniOS instead of SmartOS. I don't want any contention for resources with my database machines, so I wouldn't run any other zones or VMs on that hardware (which is what SmartOS is really good at).

mmap file shared via nfs?

Scenario A:
To share a read/write block of memory between two processes running on the same host, Joe mmaps the same local file from both processes.
Scenario B:
To share a read/write block of memory between two processes running on two different hosts, Joe shares a file via nfs between the hosts, and then mmaps the shared file from both processes.
Has anyone tried Scenario B? What are the extra problems that arise in Scenario B that do not apply to Scenario A?.
Mmap will not share data without some additional actions.
If you change data in mmaped part of file, changes will be stored only in memory. They will not be flushed to the filesystem (local or remote) until msync or munmap or close or even decision of OS kernel and its FS.
When using NFS, locking and storing of data will be slower than if using local FS. Timeouts of flushing and time of file operations will vary too.
On the sister site people says that NFS may have poor caching policy, so there will be much more I/O requests to the NFS server comparing I/O request count to local FS.
You will need byte-range-lock for correct behavior. They are available in NFS >= v4.0.
I'd say scenario B has all kinds of problems (assuming it works as suggested in the comments). The most obvious is the standards concurrency issues - 2 processes sharing 1 resource with no form of locking etc. That could lead to problems... Not sure whether NFS has its own peculiar quirks in this regard or not.
Assuming you can get around the concurrency issues somehow, you are now reliant on maintaining a stable (and speedy) network connection. Obviously if the network drops out, you might miss some changes. Whether this matters depends on your architecture.
My thought is it sounds like an easy way to share a block of memory on different machines, but I can't say I've heard of it being done which makes me think it isn't so good. When I think sharing data between procs, I think DBs, messaging or a dedicated server. In this case if you made one proc the master (to handle concurrency and owning the concept -i.e. whatever this guy says is the best copy of the data) it might work...

Should I fsck ext3 on embedded system?

We have a number of embedded systems requiring r/w access to the filesystem which resides on flash storage with block device emulation. Our oldest platform runs on compact flash and these systems have been in use for over 3 years without a single fsck being run during bootup and so far we have no failures attributed to the filesystem or CF.
On our newest platform we used USB-flash for the initial production and are now migrating to Disk-on-Module for r/w storage. A while back we had some issues with the filesystem on a lot of the devices running on USB-storage so I enabled e2fsck in order to see if that would help. As it turned out we had received a shipment of bad flash memories so once those were replaced the problem went away. I have since disabled e2fsck since we had no indication that it made the system any more reliable and historically we have been fine without it.
Now that we have started putting in Disk-on-Module units I've started seeing filesystem errors again. Suddenly the system is unable to read/write certain files and if I try to access the file from the emergency console I just get "Input/output error". I enabled e2fsck again and all the files were corrected.
O'Reilly's "Building Embedded Linux Systems" recommends running e2fsck on ext2 filesystems but does not mention it in relation to ext3 so I'm a bit confused to whether I should enable it or not.
What are your takes on running fsck on an embedded system? We are considering putting binaries on a r/o partition and only the files which has to be modified on a r/w partition on the same flash device so that fsck can never accidentally delete important system binaries, does anyone have any experience with that kind of setup (good/bad)?
I think the answer to your question more relates to what types of coherency requirements you application has relative to its data. That is, what has to be guaranteed if power is lost without a formal shutdown of the system? In general, none of the desktop operating system type file systems handle this all that well without specific application closing/syncing of files and flushing of the disk caches, etc. at key transaction points in the application to ensure what you need to maintain is in fact committed to the media.
Running fsck fixes the file-system but without the above care, there is no guarantees about what changes you made will actually be kept. ie: It's not exactly deterministic what you'll lose as a result of the power failure.
I agree that putting your binaries or other important read-only data on a separate read-only partition does help ensure that they can't erroneously get tossed due to an fsck correction to file-system structures. As a minimum, putting them in a different sub-directory off the root than where the R/W data is held will help. But in both cases, if you support software updates, you still need to have scheme to deal with writing the "read-only" areas anyway.
In our application, we actually maintain a pair of directories for things like binaries and the system is setup to boot from either one of the two areas. During software updates, we update the first directory, sync everything to the media and verify the MD5 checksums on disk before moving onto the second copy's update. During boot, they are only used if the MD5 checksum is good. This ensures that you are booting a coherent image always.
Dave,
I always recommend running the fsck after a number of reboots, but not every time.
The reason is that, the ext3 is journal-ed. So unless you enable the writeback (journal-less), then most of the time, your metadata/file-system table should be in sync with your data (files).
But like Jeff mentioned, it doesn't guarantee the layer above the file-system. It means, you still get "corrupted" files, because some of the records probably didn't get written to the file system.
I'm not sure what embedded device you're running on, but how often does it get rebooted?
If it's controlled reboot, you can always do "sync;sync;sync" before restart.
I've been using the CF myself for years, and very rare occasion I got file-system errors.
fsck does help on that case.
And about separating your partition, I doubt the advantage of it. For every data/files on the file-system, there's a metadata associated with it. Most of the time, if you don't change the files, eg. binary/system files, then this metadata shouldn't change. Unless you have a faulty hardware, like cross-talking write & read, those read-only files should be safe.
Most problems arises when you have something writable, and regardless where you put this, it can cause problems if the application doesn't handle it well.
Hope that helps.

Resources