Cassandra with SSD - two disks or one? - cassandra

It's generally recommended that cassandra use two separate disks: one for the commit log and the other for everything else.
However, in what appears to be a recent update to the configuration guidelines, the following phrase appears:
For SSDs it is recommended that both commit logs and SSTables are on
the same mount point.
Can anyone explain why it's recommended to only use one disk if it's SSD?
Thanks.

The reason you use separate disk for the commit log when using regular hard drives is so you only do sequential writes on the commit log hard drive.
Other cassandra activity, such as reading/compaction etc, will affect cause random access on the other disk, not your commit log meaning that writes to your commit log will be very fast.
For SSD random access is as performant as sequential access so there is no need to prevent it to keep your commit log happy.

I think that you still would want separate disks in many cases due to SSD Write Amplification. The idea is that a separate commit log SSD would suffer no WA, which is good since it should be the fastest write location you have.

Related

Azure backup - File system consistent , Application consistent and Crash consistency

I am trying to understand the difference between File system consistent and crash consistent backups provided by Azure. The majority of the information that I find is from this link. I see Application consistent backup is to ensure that all memory data and pending I/O are accounted for perhaps by using a quiescing process so proper snapshot can be taken. However bit confused between the other two. I see Crash consistent is one which doesn't consider the in-memory, pending I/Os and only considers backing up what has been written. But then what exactly would be meant by file-consistent backup? I don't find any definition. As a result when the docs mention that by default Linux VM backups are File system consistent if not using pre/post scripts, I am not understanding the implications. Any help much appreciated.
Simple example to demark the difference is : when a recovery point is file-system consistent, there won't be any file system check performed to make sure that file system is not corrupted. In case of crash consistency, after a VM boots up, a file-system check may be performed and based on that there can be potentially a data loss because of corruption of file system. So, it is always better to strive for file system consistency.

Is it safe to compact a CouchDB database that has continuous replication?

We have a couple of production couchdb databases that have blown out to 30GB and need to be compacted. These are used by a 24/7 operations website and are replicated with another server using continuous replication.
From tests I've done it'll take about 3 mins to compact these databases.
Is it safe to compact one side of the replication while the production site and replication are still running?
Yes, this is perfectly safe.
Compaction works by constructing the new compacted state in memory, then writing that new state to a new database file and updating pointers. This is because CouchDB has a very firm rule that the internals of the database file never gets updated, only appended to with an fsync. This is why you can rudely kill CouchDB's processes and it doesn't have to recover or rebuild the database like you would in other solutions.
This means that you need extra disk space available to re-write the file. So, trying to compact a CouchDB database to prevent full disk warnings is usually a non-starter.
Also, replication uses the internal representation of sequence trees (b+trees). The replicator is not streaming the entire database file from disk onto the network pipe.
Lastly, there will of course be an increase in system resource utilization. However, your tests should have shown you roughly how much this costs on your system vs an idle CouchDB, which you can use to determine how closely you're pushing your system to the breaking point.
I have been working with CouchDB since a while; replicating databases and writing Views to fetch data.
I have seen its replication behavior and observed this, which can answer your question:
In the replication process previous revisions of the documents are not replicated to the destination, only current revision is replicated.
Compacting database only removes the previous revisions. So it will not cause any problem.
Compaction will be done on the database on which you are currently logged in. So it should not affect its replica which is continuously listening for changes in it. Because it listens for the current revision changes not the previous revisions. To verify it you can see this:
Firing this query will show you changes of all the sequences of database. It only works on the basis of latest revision changes not the previous ones(So I think compaction will not make any harm):
curl -X GET $HOST/db/_changes
The result is simple:
{"results":[
],
"last_seq":0}
More info can be found here: CouchDB Replication Basics
This might help you to understand it. In short answer of your question is YES, It is safe to compact database in continuous replication.

What is the difference in the "Host Cache Preference" settings when adding a disk to an Azure VM?

When adding a VHD data disk to a VM I am asked for a "Host Cache Preference" (None, Read Only, Read/write).
Can someone tell me the effect of choosing one over the other?
Specifically, I am using a VM as a Build Server so the disks are used for compiling .Net source code. Which setting would be best in this scenario?
Just as the settings mention this setting turns on caching preferences for I/O. The effect of changing them is that reads, writes or both read/writes can be cached for performance. For example, if you have read-only database/Lucene index/read-only files it would be optimal to turn on read-cache for the drive.
I have not seen dramatic performance changes in changing this setting (until I used SQL Server/Lucene) on the drives. High I/O will be improved by stripping disks...in your case if you have millions of lines of code across 10,000s of files then you could see performance improvement in reading/writing. The default IOPs max for a single drive is 500 IOPs (which is about 2x15k SAS drives or a high-end SSD). If you need more than that, add more disks and stripe them...
For example, on an extra large VM you can attach 16 drives * 500 IOPs (~8,000 IOPs):
http://msdn.microsoft.com/en-us/library/windowsazure/dn197896.aspx
(there are some good write-ups/whitepapers for people who did this and netted optimal performance by adding the max amount of smaller drives..rather than just one massive one).
Short summary: leave the defaults for caching. Test with an I/O tools for specific performance. Single drive performance will not likely matter, if I/O is your bottleneck striping drives will be MUCH better than the caching setting on the VHD drive.

Track changes in nfs / sync nfs over multiple datacenters

We have two datacenters, each with a number of Linux servers that share a large EMC-based nfs.
The challenge is to keep the two nfs' in sync. For the moment assume that writes will only occur to nfs1, which then has to propagate the changes to nfs2.
Periodic generic rsyncs have proved too slow - each rsync takes several hours to complete, even with -az. We need to do specific syncs when a file or directory actually changes.
So then the problem is, how do we know when a file or directory has changed? inotify is the obvious answer, but it famously does not work with nfs. (There is some chatter about inotify possibly working if it is installed on the nfs server, but that isn't an option for us - we only have control of the clients, not the server.)
Does the linux nfs client allow you to capture all the changes it sends to the server, in a logfile or otherwise? Or could we hack the client to do this? We could then collect the changes from each client and periodically kick off targeted rsyncs.
Any other ideas welcome. Thanks!
If you need to keep the 2 EMC servers in sync, it might be bettering to look into EMC specific mirroring capabilities to achieve this. Typically these are block-based updates for high performance and low bandwidth utilization. For example, using SnapMirror on NetApp could achieve this. I'm not as familiar with EMC but a quick google search revealed EMC MirrorView or EMC SRDF as possible options.

Should I fsck ext3 on embedded system?

We have a number of embedded systems requiring r/w access to the filesystem which resides on flash storage with block device emulation. Our oldest platform runs on compact flash and these systems have been in use for over 3 years without a single fsck being run during bootup and so far we have no failures attributed to the filesystem or CF.
On our newest platform we used USB-flash for the initial production and are now migrating to Disk-on-Module for r/w storage. A while back we had some issues with the filesystem on a lot of the devices running on USB-storage so I enabled e2fsck in order to see if that would help. As it turned out we had received a shipment of bad flash memories so once those were replaced the problem went away. I have since disabled e2fsck since we had no indication that it made the system any more reliable and historically we have been fine without it.
Now that we have started putting in Disk-on-Module units I've started seeing filesystem errors again. Suddenly the system is unable to read/write certain files and if I try to access the file from the emergency console I just get "Input/output error". I enabled e2fsck again and all the files were corrected.
O'Reilly's "Building Embedded Linux Systems" recommends running e2fsck on ext2 filesystems but does not mention it in relation to ext3 so I'm a bit confused to whether I should enable it or not.
What are your takes on running fsck on an embedded system? We are considering putting binaries on a r/o partition and only the files which has to be modified on a r/w partition on the same flash device so that fsck can never accidentally delete important system binaries, does anyone have any experience with that kind of setup (good/bad)?
I think the answer to your question more relates to what types of coherency requirements you application has relative to its data. That is, what has to be guaranteed if power is lost without a formal shutdown of the system? In general, none of the desktop operating system type file systems handle this all that well without specific application closing/syncing of files and flushing of the disk caches, etc. at key transaction points in the application to ensure what you need to maintain is in fact committed to the media.
Running fsck fixes the file-system but without the above care, there is no guarantees about what changes you made will actually be kept. ie: It's not exactly deterministic what you'll lose as a result of the power failure.
I agree that putting your binaries or other important read-only data on a separate read-only partition does help ensure that they can't erroneously get tossed due to an fsck correction to file-system structures. As a minimum, putting them in a different sub-directory off the root than where the R/W data is held will help. But in both cases, if you support software updates, you still need to have scheme to deal with writing the "read-only" areas anyway.
In our application, we actually maintain a pair of directories for things like binaries and the system is setup to boot from either one of the two areas. During software updates, we update the first directory, sync everything to the media and verify the MD5 checksums on disk before moving onto the second copy's update. During boot, they are only used if the MD5 checksum is good. This ensures that you are booting a coherent image always.
Dave,
I always recommend running the fsck after a number of reboots, but not every time.
The reason is that, the ext3 is journal-ed. So unless you enable the writeback (journal-less), then most of the time, your metadata/file-system table should be in sync with your data (files).
But like Jeff mentioned, it doesn't guarantee the layer above the file-system. It means, you still get "corrupted" files, because some of the records probably didn't get written to the file system.
I'm not sure what embedded device you're running on, but how often does it get rebooted?
If it's controlled reboot, you can always do "sync;sync;sync" before restart.
I've been using the CF myself for years, and very rare occasion I got file-system errors.
fsck does help on that case.
And about separating your partition, I doubt the advantage of it. For every data/files on the file-system, there's a metadata associated with it. Most of the time, if you don't change the files, eg. binary/system files, then this metadata shouldn't change. Unless you have a faulty hardware, like cross-talking write & read, those read-only files should be safe.
Most problems arises when you have something writable, and regardless where you put this, it can cause problems if the application doesn't handle it well.
Hope that helps.

Resources