How can GlusterFS move thin disks between bricks? - glusterfs

In a thin provisioning environment, if a file grows bigger and bigger, how can GlusterFS move it between bricks?
The original bricks are full of space, but newly added bricks are nearly empty, How can DHT translator deals that situation?

After adding new brick(s) you can carry out rebalance operation to move files from one brick to other.
But, you won't be able to do this on a per file basis.
Check the below doc.:
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/administration_guide/#sect-Rebalancing_Volumes

Related

How to put files bigger than the brick in GlusterFS?

I need to have a GlusterFS architecture that lets me to put large files (bigger than the brick) in volumes. I'm not going to use striped type because it has performance issue and makes my volumes slower.
You can check sharding volume type in GlusterFS.
check the documentation and try it out:
Because sharding distributes files across the bricks in a volume, it lets you store files with a larger aggregate size than any individual brick in the volume.
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/administration_guide/#sect-Managing_Sharding

FreeNAS/ZFS with both RAID-Z and Mirror?

I'm considering switching to FreeNAS at the same time I'm acquiring some new disks for my home server. The end configuration will have a 1.5TB drive (currently the largest disk in the set) and two 3TB drives.
The "obvious" way to structure this (to me) would be to create partitions on the 3TB drives equal in size to the full 1.5TB drive, then RAID-Z those partitions together for 3TB of redundant storage. The remainder of the 3TB drives could be mirrored together for another 1.5TB of redundant storage. This seems like it gives me no wasted space, and a full 4.5TB of redundant storage to work with.
The problem is that I can't find anything that would let me treat these two segments as a single pool. I don't really care if any given data is written to parity vs. mirrored space, so long as it's all resilient to a single disk failure.
Am I stuck with two virtual spaces and allocating data between them, or is there a ZFS option I'm not finding that would let me pool the whole thing?
Technically you should be able to build a pool with two vdevs -- one with RAID-Z with 3 partitions and another a mirror with 2 partitions.
Something like this should work:
zpool create tank raidz da0p0 da1p0 da2p0 mirror da0p1 da1p1
That said, you don't want to do that for performance reasons. Reads and writes will be distributed across all vdevs, and, as the result, across *all your partitions for every chunk of data ZFS needs to write out. In the end your 3GB hard drives will have to do two seeks access data on different partitions each time ZFS writes out each transaction group. Once data is written, similar seeks will be needed to read data that's not in ARC yet. At 10-20ms per seek performance will be rather terrible.

moving Cassandra snapshots to a different disk/server/datacenter

I have Cassandra 1.2.6 cluster running on datacenter A, each node has a solid state drive with somewhat limited space (aprox 50% of disk space is free).
Now I need to implement somehow a way of having automatic backups of each node. Ideally I want to have a way of moving all of the cluster's datafiles to a different disk (standard cheaper disks), or even to a different server in the same datacenter A and possibly moving all the data once in a while to a datacenter B in a different location.
From what I've read I can use snapshots on each node to get the files to copy using whatever tool I want and in this case I have the option to move the data to a different disk/server/datacenter.
My question is, since each of my nodes is about 50% full, taking a snapshot will consume all that space? or the hard links will consume way less space than I anticipate?, if so, is there a better way of doing this, maybe with an already made tool, or everything should be custom made when it comes to this type of backups in Cassandra?
Thanks in advance!
A hard link just creates a new directory entry for the same file (http://en.wikipedia.org/wiki/Hard_link). So a snapshot takes up effectively zero space, but you'll want to clean it up after you're done copying it off to whatever your archive is, because when the "original" sstable is deleted (typically post-compaction), space won't be reclaimed as long as the snapshot reference is still there.
My impression is that tablesnap is the most popular tool for automating backups to s3. It also supports Cassandra incremental backups. If you want more control over where you're backing up to, DataStax OpsCenter supports running a custom script when it takes snapshots.

How does cassandra split keyspace data when multiple directories are configured?

I have configured three separate data directories in cassandra.yaml file as given below:
data_file_directories:
- E:/Cassandra/data/var/lib/cassandra/data
- K:/Cassandra/data/var/lib/cassandra/data
when I create keyspace and insert data my key space got created in both two directories and data got scattered. what I want to know is how cassandra splits the data between multiple directories?. And what is the rule behind this?
You are using the JBOD feature of Cassandra when you add multiple entries under data_file_directories. Data is spread evenly over the configured drives proportionate to their available space.
This also let's you take advantage of the disk_failure_policy setting. You can read about the details here:
http://www.datastax.com/dev/blog/handling-disk-failures-in-cassandra-1-2
In short, you can configure Cassandra to keep going, doing what it can if the disk becomes full or fails completely. This has advantages over RAID0 (where you would effectively have the same capacity as JBOD) in that you do not have to replace the whole data set from backup (or full repair) but just run a repair for the missing data. On the other hand, RAID0 provides higher throughput (depending how well you know how to tune RAID arrays to match filesystem and drive geometry).
If you have the resources for fault-tolerant/more performant RAID setup (like RAID10 for example), you may want to just use a single directory for simplicity. Most deployments are starting to lean towards the density route, using JBOD rather than systems-level tolerance though.
You can read about the thought process behind the development of this issue here:
https://issues.apache.org/jira/browse/CASSANDRA-4292
Some what I am able to guess how the keyspace is split between multiple data directories. Based on the maximum available space and load on directories, SSTables of same column family written to the different data directories..

Multiple applications using copies of a directory on a SAN

I have an application (Endeca) that is a file-based search engine. A customer has Linux 100 servers, all attached to the same SAN (very fast, fiber-channel). Currently, each of those 100 servers uses the same set of files. Currently, each server has their own copy of the index (approx 4 gigs, thus 400 gigs in total).
What I would like to do is to have one directory, and 100 virtual copies of that directory. If the application needs to make changes to any of the files in that directory, only then would is start creating a distinct copy of the original folder.
So my idea is this: All 100 start using the same directory (but they each think they have their own copy, and don't know any better). As changes come in, Linux/SAN would then potentially have up to 100 copies (now slightly different) of that original.
Is something like this possible?
The reason I'm investigating this approach would be to reduce file transfer times and disk space. We would only have to copy the 4 gig index files once to the SAN and create virtual copies. If no changes came in, we'd only use 4 gigs instead of 400.
Thanks in advance!
The best solution here is to utilise the "de-dupe" functionality at the SAN level. Different vendors may call it differently, but this is what I am talking about:
https://communities.netapp.com/community/netapp-blogs/drdedupe/blog/2010/04/07/how-netapp-deduplication-works--a-primer
All 100 "virtual" copies will utilise the same physical disk blocks on the SAN. SAN will only need to allocate new blocks if there are changes made to a specific copy of a file. Then a new block will be allocated for this copy but the remaining 99 copies will keep using the old block - thus dramatically reducing the disk space requirements.
What version of Endeca are you using? MDEX7 engine has the clustering ability where the leader and follower nodes are all reading from the same set of files, so as long as the files are shared (say over NAS) then you can have multiple engines running on different machines backed by the same set of index files. Only the leader node will have ability to change the files keeping the changes consistent, the follower nodes will then be notified by the cluster coordinator when the changes are ready to be "picked up".
In MDEX 6 series you could probably achieve something similiar provided that the index files are read-only. The indexing in V6 would usually happen on another machine and the destination set of index files would usually be replaced once the new index is ready. This though won't help you if you need to have partial updates.
Netapp deduplication sounds interesting, Endeca has never tested the functionality, so I am not sure what kinds of problems you will run into.

Resources