How to make Cassandra use two disks on ZFS in SmartOS? - cassandra

I heard that there's a huge improvement when Cassandra can write it's logfiles to one disk, and the SS Tables to another. I have two disks, and if I was running Linux I would mount each in a different path and configure Cassandra to write on those.
What I would like to know is how to do that in ZFS and SmartOS.
I'm a complete newbie in SmartOS, and from what I understood I add the disks to the storage pool, are they then managed as being one ?

psanford explained how to use two disks, but that's probably not what you want here. That's usually recommended to work around deficiencies in the operating system's I/O scheduling. ZFS has a write throttle to avoid saturating disks[0], and SmartOS can be configured to throttle I/Os to ensure that readers see good performance when some users (possibly the same user) are doing heavy writes[1]. I'd be surprised if the out-of-the-box configuration wasn't sufficient, but if you're seeing bad performance, it would be good to quantify that.
[0] http://dtrace.org/blogs/ahl/2014/02/10/the-openzfs-write-throttle/
[1] http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/

By default SmartOS aggregates all your disks together into a single ZFS pool (SmartOS names this pool "zones"). From this pool you create ZFS datasets which can either look like block devices (used for KVM virtual machines) or as filesystems (used for SmartOS zones).
You can setup more than one pool in SmartOS, but you will have to do it manually. The Solaris documentation is still quite good and applicable to modern Illumos distributions (including SmartOS). Chapter 4 has all the relevant information for creating a new ZFS pool, but it can be as simple as:
zpool create some_new_pool_name c1t0d0 c1t1d0
This assumes that you have access to the global zone.
If I were running a Cassandra cluster on bare metal and I wanted to benefit from things like ZFS and DTrace I would probably use OmniOS instead of SmartOS. I don't want any contention for resources with my database machines, so I wouldn't run any other zones or VMs on that hardware (which is what SmartOS is really good at).

Related

How to improve read/write speed when using distributed file system?

If I browse the Distributed File System (DFS) shared folder I can create a file and watch it replicate almost immediately across to the other office DFS share. Accessing the shares is pretty instant even across the broadband links.
I would like to improve the read/write speed. Any tips much appreciated.
Improving hardware always help but keep in mind that in any distributed file system the performance of the parent host will influence besides that in many cases you can't touch the hardware and you need to optimize network or tune your systems to best fit your current provider architecture.
An example of this, mainly in virtualized environments, is the case when disabling the TCP segmentation offload from the network cards, ifconfig_DEFAULT="SYNCDHCP -tso", it will considerably improve the throughput but at a cost of more CPU usage.
Depending on how far you want to go you can start all these optimizations from the very bottom:
creating your custom lean kernel/image
test/benchmark network settings (iperf)
fine tune your FS, if using ZFS here are some guides:
http://open-zfs.org/wiki/Performance_tuning
https://wiki.freebsd.org/ZFSTuningGuide
performance impact when using Solaris ZFS lz4 compression
Regarding moosefs there are some threads about how the block size affects I/O performance and how in many cases by disabling cache allow blocks > 4k.
Mainly for FreeBSD we added special cache option for MooseFS client
called DIRECT.
This option is available in MooseFS client since version 3.0.49.
To disable local cache and enable DIRECT communication please use this
option during mount:
mfsmount -H mfsmaster.your.domain.com -o mfscachemode=DIRECT /mount/point
In most filesystems speed factors are: type of access (sequential or random) and block size. Hardware performance is also the factor on MooseFS. You can improve speed by improving hard drives performance (for example you can switch to SSD), network topology (network latency) and network capacity.

SLURM Highly Availability Head Node

According to https://slurm.schedmd.com/quickstart_admin.html#HA high availability of SLURM is achieved by deploying a second BackupController which takes over when the primary fails and retrieves the current state from a shared file system (probably NFS).
In my opinion this has a number of drawbacks. E.g. this limits the total number of server to two and the second server is probably barely used.
Is this the only way to get a highly available head node with SLURM?
What I would like to do is a classic 3-tiered setup: A load balancer in the first tier which spreads all requests evenly across the nodes in the seconds tier. This requires the head node(s) to be stateless. The third tier is the database tier where all information is stored or read. I don't know anything about the internals of SLURM and I'm not sure if this is even remotely possible.
In the current design, the controller internal state is in-memory, and Slurm saves it to a set of files in the directory pointed to by the StateSaveLocation configuration parameter regularly. Only one instance of slurmctld can write to that directory at a time.
One problem with storing the state in the database would be a terrible latency in resource allocation with a lot of synchronisations needed, because optimal resource allocation can only be done with full information. The infrastructure needed to support the same level of throughput as Slurm can handle now with in-memory state would be very costly compared with the current solution implying only bitwise operations on arrays in memory.
Is this the only way to get a highly available head node with SLURM?
You can also have a single MasterController managed with Corosync. But indeed Slurm only has active/passive options available for HA.
In my opinion this has a number of drawbacks. E.g. this limits the
total number of server to two and the second server is probably barely
used.
The load on the controller is often very reasonable with respect to the current processing power, and the resource allocation problem cannot be trivially parallelised (or made stateless). Often, the backup controller is co-located on a machine running another service. For instance, on small deployments, one machine runs the Slurm primary controller, and other services (NFS, LDAP, etc.), etc. while another is the user login node, that also acts as a secondary Slurm controller.

Is it a good idea to run Cassandra inside an LXC or Docker, in production?

I know it runs just fine, so it's ok for development which is great, but won't it have considerably worse disk and/or network IO performance because of AuFS ?
If you put Cassandra data on a volume, disk I/O performance will be exactly the same as outside of containers, since AUFS will be bypassed entirely.
And even if you don't use a volume, performance will be fine as long as you don't commit Cassandra data into a new image to run that image later. And even if you do that, performance will be affected only during the first writes on each file; after that, it will be native.
You will not see any different in Network I/O performance, unless your containers are dealing with 100s of Mb/s of network traffic and/or 1000s of connections per second. In that case, you can use tools like Pipework to assign MAC VLAN interfaces or even native physical interfaces to your containers.
We are actually running Cassandra in Docker in production and have had to work through a lot of performance issues.
Networking: you should this as --net=host to use the host networking. Otherwise you will take a substantial hit to your network speeds. See this article for more information on recommend best practices.
Data volume: you should expose your data volume to the physical host. If you're operating in the cloud note that where you place your data volume may limit your iops.
JVM: just because you run Cassandra in a container doesn't mean you can get away from tuning your jvm. You still need to modify it to account for the system resources on the host machine.
Cluster Name/Seeds: these need to be configured and need to be changed from hard coded values to find and replace with environment variables using sed.
The big take away is that like any software you need to do some configuration. It's not 100% plug and play.
Looking into the same thing, Just found this on slideshare:
"Docker uses Linux Ethernet Bridges for basic software routing. This will hose your network throughput. (50% hit)
Use the host network stack instead (10% hit)"

Simulating Disk I/O

We are currently evaluating storage for a virtualization environment (Xen). The storage is an active-active cluster and I need to test some stuff there like split brain scenarios etc.
I'm looking for a tool that simulates a lot of small disk I/O, like a virtual machine would read/write to its image file.
I don't need performance testing tools but more something like data integrity. Is there anything around?
btest[1] is able to do such things, it can read/write in random/sequential and can also verify the data afterwards.
[1] http://sourceforge.net/projects/btest/

Sharing large mmaped data file across machines in Linux Cluster

Say I have a 1 TB data file mmapped read/write from the localy mounted hdd filesystem of a "master" linux system into the virtual address space of a process running on this same "master" system.
I have 20 dedicated "slave" linux servers connected across a gigabit switch to the "master" system. I want to give random read access to this 1 TB on these "slave" servers by mmaping it read-only into their process address spaces.
My question is what is the most efficient way of synchronizing (perhaps lazily) the dataset from the master system to the slave systems? (for example is it possible to mount the file over NFS and then mmap it from there? if yes, is this the best solution? if no, what is a solution?)
I have been playing around with an idea like this at work recently (Granted this was with significantly smaller file sizes). I believe NFS would be fine for reads but you might hit problems with concurrent writes. Providing you have only one "writer" then your idea should work reasonably well. If the data file is structured, I'd recommend going for a distributed cache of some description and allowing multiple copies of the data spread across the cluster (for redundancy).
In the end we went for a SAN and clustered file system solution (in our case Symantec VCS, but any generic clustered filesystem would do). The reason we did this is because we couldn't get the performance we required from using pure NFS. The clustered file system you choose would need to support mmap properly and a distributed cache.

Resources