I have an application (Endeca) that is a file-based search engine. A customer has Linux 100 servers, all attached to the same SAN (very fast, fiber-channel). Currently, each of those 100 servers uses the same set of files. Currently, each server has their own copy of the index (approx 4 gigs, thus 400 gigs in total).
What I would like to do is to have one directory, and 100 virtual copies of that directory. If the application needs to make changes to any of the files in that directory, only then would is start creating a distinct copy of the original folder.
So my idea is this: All 100 start using the same directory (but they each think they have their own copy, and don't know any better). As changes come in, Linux/SAN would then potentially have up to 100 copies (now slightly different) of that original.
Is something like this possible?
The reason I'm investigating this approach would be to reduce file transfer times and disk space. We would only have to copy the 4 gig index files once to the SAN and create virtual copies. If no changes came in, we'd only use 4 gigs instead of 400.
Thanks in advance!
The best solution here is to utilise the "de-dupe" functionality at the SAN level. Different vendors may call it differently, but this is what I am talking about:
https://communities.netapp.com/community/netapp-blogs/drdedupe/blog/2010/04/07/how-netapp-deduplication-works--a-primer
All 100 "virtual" copies will utilise the same physical disk blocks on the SAN. SAN will only need to allocate new blocks if there are changes made to a specific copy of a file. Then a new block will be allocated for this copy but the remaining 99 copies will keep using the old block - thus dramatically reducing the disk space requirements.
What version of Endeca are you using? MDEX7 engine has the clustering ability where the leader and follower nodes are all reading from the same set of files, so as long as the files are shared (say over NAS) then you can have multiple engines running on different machines backed by the same set of index files. Only the leader node will have ability to change the files keeping the changes consistent, the follower nodes will then be notified by the cluster coordinator when the changes are ready to be "picked up".
In MDEX 6 series you could probably achieve something similiar provided that the index files are read-only. The indexing in V6 would usually happen on another machine and the destination set of index files would usually be replaced once the new index is ready. This though won't help you if you need to have partial updates.
Netapp deduplication sounds interesting, Endeca has never tested the functionality, so I am not sure what kinds of problems you will run into.
Related
I often copy raw data from HDD's with FAT32 partitions at the file level. I would like to switch to bitwise cloning this raw data that consists of thousands of 10MiB files that are sequentially written across a single FAT32 partition.
The idea is on the large archival HDD, have a small partition which contains a shadow directory structure with symbolic links to separate raw data image partitions. Each additional partition being the aforenoted raw data, but sized to only the size consumed on the source drive. The number of raw data files on each source drive can be in the tens up through the tens of thousands.
i.e.: [[sdx1][--sdx2--][-------------sdx3------------][--------sdx4--------][-sdx5-][...]]
Where 'sdx1' = directory of symlinks to sdx2, sdx3, sdx4, ... such that the user can browse to multiple partitions but it appears to them as if they're just in subfolders.
Optimally I'd like to find both a Linux and a Windows solution. If the process can be scripted or a software solution that exists can step through a standard workflow, that'd be best. The process is almost always 1) Insert 4 HDD's with raw data 2) Copy whatever's in them 3) Repeat. Always the same drive slots and process.
AFAIK, in order to clone a source partition without cloning all the free space, one conventionally must resize the source HDD partition first. Since I can't alter the source HDD in any way, how can I get around that?
One way would be clone the entire source partition (incl. free space) and resize the target backup partition afterward, but that's not going to work out because of all the additional time that would take.
The goal is to retain bitwise accuracy and to save time (dd runs about 200MiB/s whereas rsync runs about 130MiB/s, however also needing to copy a ton of blank space every time makes the whole perk moot). I'd also like to be running with some kind of --rescue flag so when bad clusters are hit on the source drive it just behaves like clonezilla and just writes ???????? in place of the bad clusters. I know I said "retain bitwise accuracy" but a bad cluster's a bad cluster.
If you think one of the COTS or GOTS software like EaseUS, AOMEI, Paragon and whatnot are able to clone partitions as I've described please point me in the right direction. If you think there's some way I can dd it up with some script which sizes up the source, makes the right size target partition, then modifies the target FAT to its correct size, chime in I'd love many options and so would future people with a similar use case to mine that stumble on this thread :)
Not sure if this will fit you, but is very simple.
Syncthing https://syncthing.net/ will sync the content of 2 or more folders, works on Linux and Windows.
I am writing software in C, on Linux running on AWS, that has to handle 240 terabytes of data, in 72 million files.
The data will be spread across 24 or more nodes, so there will only be 10 terabytes on each node, and 3 million files per node.
Because I have to append data to each of these three million files every 60 seconds, the easiest and fastest thing to do would to be able to keep each of these files open at one time.
I can't store the data in a database, because the performance in reading/writing the data will be too slow. I need to be able to read the data back very quickly.
My questions:
1) is it even possible to keep open 3 million files
2) if it is possible, how much memory would it consume
3) if it is possible, would performance be terrible
4) if it is not possible, I will need to combine all of the individual files into a couple of dozen large files. Is there a maximum file size in Linux?
5) if it is not possible, what technique should I use to append data every 60 seconds, and keep track of it?
The following is a very coarse description of an architecture that can work for your problem, assuming that the maximum number of file descriptors is irrelevant when you have enough instances.
First, take a look at this:
https://aws.amazon.com/blogs/aws/amazon-elastic-file-system-shared-file-storage-for-amazon-ec2/
https://aws.amazon.com/efs/
EFS provides a shared storage that you can mount as a filesystem.
You can store ALL your files in a single storage unit of EFS. Then, you will need a set of N worker-machines running at full capacity of filehandlers. You can then use a Redis queue to distribute the updates. Each worker has to dequeue a set of updates from Redis, and then will open necessary files and perform the updates.
Again: the maximum number of open filehandlers will not be a problem, because if you hit a maximum, you only need to increase the number of worker machines until you achieve the performance you need.
This is scalable, though I'm not sure if this is the cheapest way to solve your problem.
I have Cassandra 1.2.6 cluster running on datacenter A, each node has a solid state drive with somewhat limited space (aprox 50% of disk space is free).
Now I need to implement somehow a way of having automatic backups of each node. Ideally I want to have a way of moving all of the cluster's datafiles to a different disk (standard cheaper disks), or even to a different server in the same datacenter A and possibly moving all the data once in a while to a datacenter B in a different location.
From what I've read I can use snapshots on each node to get the files to copy using whatever tool I want and in this case I have the option to move the data to a different disk/server/datacenter.
My question is, since each of my nodes is about 50% full, taking a snapshot will consume all that space? or the hard links will consume way less space than I anticipate?, if so, is there a better way of doing this, maybe with an already made tool, or everything should be custom made when it comes to this type of backups in Cassandra?
Thanks in advance!
A hard link just creates a new directory entry for the same file (http://en.wikipedia.org/wiki/Hard_link). So a snapshot takes up effectively zero space, but you'll want to clean it up after you're done copying it off to whatever your archive is, because when the "original" sstable is deleted (typically post-compaction), space won't be reclaimed as long as the snapshot reference is still there.
My impression is that tablesnap is the most popular tool for automating backups to s3. It also supports Cassandra incremental backups. If you want more control over where you're backing up to, DataStax OpsCenter supports running a custom script when it takes snapshots.
I have configured three separate data directories in cassandra.yaml file as given below:
data_file_directories:
- E:/Cassandra/data/var/lib/cassandra/data
- K:/Cassandra/data/var/lib/cassandra/data
when I create keyspace and insert data my key space got created in both two directories and data got scattered. what I want to know is how cassandra splits the data between multiple directories?. And what is the rule behind this?
You are using the JBOD feature of Cassandra when you add multiple entries under data_file_directories. Data is spread evenly over the configured drives proportionate to their available space.
This also let's you take advantage of the disk_failure_policy setting. You can read about the details here:
http://www.datastax.com/dev/blog/handling-disk-failures-in-cassandra-1-2
In short, you can configure Cassandra to keep going, doing what it can if the disk becomes full or fails completely. This has advantages over RAID0 (where you would effectively have the same capacity as JBOD) in that you do not have to replace the whole data set from backup (or full repair) but just run a repair for the missing data. On the other hand, RAID0 provides higher throughput (depending how well you know how to tune RAID arrays to match filesystem and drive geometry).
If you have the resources for fault-tolerant/more performant RAID setup (like RAID10 for example), you may want to just use a single directory for simplicity. Most deployments are starting to lean towards the density route, using JBOD rather than systems-level tolerance though.
You can read about the thought process behind the development of this issue here:
https://issues.apache.org/jira/browse/CASSANDRA-4292
Some what I am able to guess how the keyspace is split between multiple data directories. Based on the maximum available space and load on directories, SSTables of same column family written to the different data directories..
Rackspace cloud files uses a flat storage system using 'containers' to store files. According to Rackspace there is no limit to the number of files per container.
My question is whether there is a best/most efficient number of files per container to optimize write/fetch performance.
If I have tens of thousands of files to store, should they all go in a single giant container or partitioned into many smaller containers? And if so, what is the optimal container size?
FYI:
[Snippets taken from rackspace support]
long story short, the containers are databases, and the more rows in a table, the more time it takes to write them on standard hardware. When a write hasn't been committed to disk, it sits in a queue, and it subject to data loss. It's something we noticed with large containers, and the more objects, the more likely it was, so we instituted the limits to protect the data.
because of the rate limits, your data is safe, it just slows down the writes a bit
the limits starts as low as 50,000 objects, and at that level it limits you to 100 writes per second
by 1,000,000 objects in a container, it's 25 per second
and at 5 million and above, you're down to 4 writes per second
We apologize for the limitations, and will be updating our documentation to more clearly express this.
-This has recently hurt us quite badly. Thought I'd share until they get there API doc's upto date, so others can plan around this issue.
We recommend no more than 1 million objects per container. The system will return a maximum of 10,000 object names per list request by default.
Update 9/20/2013 from Cloud Files development: The 1 million object per container recommendation is no longer accurate since Cloud Files switched to all SSD container servers. Also, the list is limited to 10,000 containers at a time.