Good distributed general purpose filesystem in my case?

Good distributed general purpose filesystem in my case? - linux

I've been researching the idea of using distributed file system along with my dedicated servers instead of going with Amazon S3 and the results are nothing but massive headaches!
My project have the following characteristics/requirements:
User files are stored in dedicated servers. Each file is stored in 2 separate machines, located in different data centers (150-200 miles away from each other)
I'm using Amazon RDS to host the associated mysql database (*). It's fairly compact (only hold IDs/files metadata)
Files/data is around 50TB. Naturally, data does change and will definitely grow with time
My question is: is there a good general-purpose, distributed parallel fault-tolerant file system that have the following characteristics:
Stable & reasonably fast (upload/download)
Fairly easy to setup & maintain
Handle data storage so that I only have to care about removing/adding new servers if the need arise (ie. add new servers to the filesystem's server pool by editing a simple config, or something like that)
I've read about OpenStack, GlusterFS, MogileFS, XtreemFS, etc...but the more I read, the more I get confused!
(*) Yes, I realize the contradiction. Cost-wise it does make sense to host the database on RDS. But storing (up to) 50TB of users files on amazon is way too expensive compared to using dedicated servers (provided it's good enough).
PS. my app isn't live yet, so I'm open to suggestion if someone have a good idea that fits well in my case.
EDIT I'm not trying to make a S3 clone, I just need to use an existing hosting infrastructure to build small-scale cloud solution, my question is about finding the right distributed file system to handle/automate this.

We recently switched from an expensive storage solution to the opensource Lizardfs for our Distributed storage solution. It is quite simple to set up and scale once your understand the basic concept.
Check out https://docs.lizardfs.com/introduction.html#architecture for a quick overview. But forget about shadow master en meta loggers for now. What you need to know is that there are
a master: that regulates the traffic (make sure that has enough cpu)
chunkservers: which actually store the data. Use any kind of off the shelf hardware with a bunch of harddisks attached.
Clients: which are just simple mount points. So you can get a giant 50TB mount if you want. The master will tell the client where to find/store the files. The actual data is being transfered straight from the client->chunkserver and back.
You can add as many chunkservers as you want, the master will automatically try to balance your storage usage across them. Adding storage is a matter of adding harddrives, or adding servers. They don't have to be actual bare metal machines, but that is probably the cheapest.
There are 2 amazing features in lizardfs that allow georeplication.
Goals (see https://docs.lizardfs.com/adminguide/replication.html#standard-goals): How important are files to you. You can define, on a file level/folder level how many times a file needs to be replicated. Do you want 2 copies 3? 10? You could define a goal of 2 copies for old files that are simply there for archiving purposes. And define a goal of 4 copies on SSD drives for all new files.
Those same goals can also be used to do georeplication. You define that your data has to be stored it least two different locations by labeling your chunkservers accordingly. (e.g. DC1 and DC2)
Rack awareness (see https://docs.lizardfs.com/adminguide/advanced_configuration.html#configuring-rack-awareness-network-topology): you basically define IP ranges to teach the system how your network looks like. This way, clients will try to serve files from the closest server.
The ease of setting it up is what sold lizardfs for me. I've heard very good things about Ceph, but setting it up is another matter...
What worried me at first was how proven the technology is/was. So I spent quite a lot of research on figuring out who uses it.
Orange Poland (A large telecom provider) is one of the users.
And Cloudweavers/opennebula actualy built a business around it selling complete solutions.

Won't it take more than one person a few months a year to manage these servers? That will cost some $, then you have the cost of hosting the data yourself, then you have the added huge cost that the business / system you are building is not obviously scalable? In addition any likely investor will be turned away by a complex home grown data hosting system. How will you ensure integrity/security on par with Amazon? Your max savings per year look like $30,000 or so.
You could save money by doing a de-duplicated storage system where you just store all the unique chunks of data - also see rsync. Don't know how redundant your data is though.

I recommend LizardFS and GfarmFS.
IMHO Ceph is a major disappointment and so is XtreemFS.

Related

Creating a file server in Azure

Our company has an on-prem file server that I'd like to move to the cloud. I followed these directions and was successfully able to map a drive on my local work computer to connect to an Azure File Share. Our company has about 20 locations, ~5 TB of data (mostly "office" type of files) in total, and about 500 users accessing them.
There are two issues I would like to improve but I'm not sure how:
There's somewhat of a lag when opening files. Other than increasing our office's internet speed, is there anything to be done to make it faster? Would some kind of site-to-site VPN help? Would adding some type of server or VM in the "middle" (maybe one per location?) that would perhaps somehow cache the files reduce the lag?
Also, we have and use an Office 365 subscription. What's the easiest way to use our existing AD structure to transfer over the NTFS permissions that are currently in place?
I Googled around and found a bunch of companies advertising their services, notable among them was Talon Storage. But it seems like something that could be done without hiring a company. What I'm hoping for is a DIY direction to optimally solve these issues. Perhaps there's a standard or commonly recommended solution for such issues. Any guidance would be greatly appreciated.

L-A-T-E-N-C-Y. The number one enemy for any cloud-based file server attempt. It ranges from annoying to down right unusable, depending on how far you are from the Azure datacenter of choice.
Imagine a poor soul trying to "stream" a large 20-meg Excel file with 20 references to external files. What used to take maybe 8 seconds on-prem will now take 40 in the cloud (on a good day). It's game over for productivity. Your marketing department that sometimes used to cut video in iMovie over the network? Those days are over.
I understand this is not the answer you were after, but it's the crude reality.
Do not panic, there are solutions, here's a good one - https://azure.microsoft.com/en-us/services/storsimple/
I'm sure you wanted to get rid of boxes not buy more, but it is what it is.

Why backup azure storage account if it's locally or geo redundant?

So I've come across this AzCopy tool, and multiple tutorials that say it's good for backing up my storage blobs and whatnot.
Isn't Azure Storage automatically backed up? Isn't that what locally redundant means?
I just want to make sure I'm not missing something and putting my application in jeopardy by not running some external backup.

Redundancy is different from back-ups. Redundancy means that all your changes are replicated to another location. In case of a failover your slave can theoretically function as a master and serve the (hopefully) latest state of your file system. However, the fact that everything is replicated also means that your accidental delete actions, file corruptions, etc. are replicated. Back-ups are meant to prevent this. In case you accidentally mess something up and perform some delete requests, you still have the back-ups and you can usually go back to any point in time (if you made a backup at that time of course).
And of course it's not a bad idea to be not fully dependent on Azure.

The most important thing about any backup policy is that before you create it you decide what you are protecting against, and what sort of data are you backing up.
If the data you backing up is an offsite backup of working data. If access to that data is restricted to admin personnel and they all know what the data is. Then replication could well be all you need to protect from a hardware failure on Azure.
If however you are backing up customer data, or file data that fred in accounts randomly deletes when he falls asleep at the keyboard then you have a different threat model and you should consider your backups accordingly.
Where you back it up is very much a matter of personal requirements and philosophy. I have known customers who will keep backups on Azure and AWS (even though their only compute workload was Azure) If in your threat model you want to protect against MS going bust and selling all of their kit on ebay one morning, then it makes sense to back up elsewhere. Or you can decide that you trust Azure to go bust and just split data across multiple regions.
TL;DR
Understand what you are protecting your data from, and design your backup policy from that.

pictures on a sub-domain or separate domain

I was given a suggestion to put our vast picture gallery into separate subdomain, or even separate domain. I dont see how it would help, but I do know that meny other sites does it.
Anyone could point me to the right direction concerning this question ?

The two biggest reasons for this are
to use a CDN (Content Delivery Network) to maximize your user's experience through delivering static files faster
To better utilize specific hardware. For example, your web server may be super beefy, running on SSDs (relatively small capacity), 16GB RAM etc. that you don't want to have to bog down with serving millions of extra files. Your image server would only need a very weak computer but could very inexpensively be backed by 10 TB of storage.

Distributing a bundle of files across an extranet

I want to be able to distribute bundles of files, about 500 MB per bundle, to all machines on a corporate "extranet" (which is basically a few LANs connected using various private mechanisms, including leased lines and VPN).
The total number of hosts is roughly 100, and the goal is to get a copy of the bundle from one host onto all the other hosts reliably, quickly, and efficiently. One important issue is that some hosts are grouped together on single fast LANs in which case the network I/O should be done once from one group to the next and then within each group between all the peers. This is as opposed to a strict central server system where multiple hosts might each fetch the same bundle over a slow link, rather than once via the slow link and then between each other quickly.
A new bundle will be produced every few days, and occasionally old bundles will be deleted (but that problem can be solved separately).
The machines in question happen to run recent Linuxes, but bonus points will go to solutions which are at least somewhat cross-platform (in which case the bundle might differ per platform but maybe the same mechanism can be used).
That's pretty much it. I'm not opposed to writing some code to handle this, but it would be preferable if it were one of bash, Python, Ruby, Lua, C, or C++.

I think all these problems have been solved by modern research into p2p networking and well packaged into nice forms. A bit of script and bit torrent should solve these problems. torrent clients exist for all modern OSs, then a script on each machine to check a location for a new torrent file, start the DL, then delete the old bundle once the DL has finished.

What about rsync?

I'm going to suggest you use compie's idea of rysnc to copy the files in which case you can use a scripting language of your choice.
On the propagating system you will need a script containing some form of representation of the hosts and a matrix between them weighted with the speed. You then need to calculate a minimum spanning tree from that information. From that, you can then send messages to the systems to which you intend to propagate detailing the MST and the bundle to fetch, whereby that script/daemon begins transfer. That host then contacts the hosts over the fastest links...
You could implement it in bash - python might be better or a custom C daemon.
When you update the network you'll need to update the matrix based on latest information.
See: Prim's Algorithm.

network drive file sharing

For the better part of 10 years + we have relied on various network mapped drives to allow file sharing. One drive letter for sharing files between teams, a seperate file share for the entire organization, a third for personal use etc. I would like to move away from this and am trying to decide if an ECM/Sharepoint type solution, or home grown app, is worth the cost and the way to go? Or if we should simply remain relying on login scripts/mapped drives for file sharing due to its relative simplicity? Does anyone have any exeperience within their own organization or thoughts on this?
Thanks.

SharePoint is very good at document sharing.
Documents generally follow a process for approval, have permissions, live in clusters... and these things lend themselves well to SharePoints document libraries.
However there are somethings that don't lend themselves well to living inside SharePoint... do you have a virtual hard drive (.vhd) file that you want to share with a workmate? Not such a good idea to try and put a 20GB file into SharePoint.
SharePoint can handle large files, and so can SQL Server behind it... but do you want your SQL Server bandwidth being saturated by such large files? Do you want your backup of SQL Server to hold copies of such large files multiple times?
I believe that there are a few Microsoft partners who offer the ability to disassociate file blobs from the SharePoint database, so that SharePoint can hold the metadata and a file system holds the actual files, and SharePoint simply becomes the gateway to manage access, permissions, and offer a centralised interface to files throughout an organisation. This would offer you the best of both worlds.
Right now though, I consider SharePoint ideal for documents, and I keep large files (that are not document centric) on Windows file shares.

Definetely, use a tool.
The main benefit here is version control. Being able to jump easily to a previous version, diff'ing and seeing who modified what (see most VCS' blame/annotate tool- it prints out a text file showing when/who modified each line in the text file).
Second, you can probably benefit from issue tracking/task tracking.
Other benefits include web access from the internet, having a wiki (which can be great in some situations), etc.
I use Subversion + Redmine at work, and I find it highly useful- test a few solutions and you will surely find out further advantages for you.

One thing that can be overlooked in the change to an document management tool is the planning required around how much is going to be stored and information architecture issues like where different content is going to end up.
SharePoint particularly is easy to setup without a good plan going forward and is particularly vulnerable to difficulties later on when things get to busy.
I would not recommend a home grown app for something like this. The problem has been solved by off the shelf tools and growing one from scratch is going to cost a huge amount and not get you any way near the features for the money.
Did I mention how important planning your security groups and document areas (IA) was?

If you need just document storage then sharepoint can do very well. WSS is ewen free and it provides very good document storage capabilities.
But you have to plan carefully as updating existing applications is painfull. If you decide to go with Sharepoint then I can give you few advices from top of my head
Pay attention to security configuration (user groups, privilegies,..)
Plan your document libraries well as it is not easy to just move documents betveen them
Also consider limiting number of versions that one document can have, because sharepoint stores full backups betveen verions, not just changes
Don't use infopath:) we have very bad experience with it (just don't tell this to managers)
If you don't really need to change graphical look of Sharepoint than don't bother with it as it brings many problems (I'm talking about custom masterpages and custom site templates)
Try to use as much OOB stuff as possible, because developing your own webparts not only cost more, but it can be quite complicated.
Make sure to turn-on search indexing. This is quite tricky, because it is by default turned off and then you will be as surprised that search is not working as I was :)
If you try to just deploy it and load 10.000 documents into it then you will surely have problems with it later. If you give a little thought about structure then you will end up with really good document storage.

Migrating is very probably worth the cost in the long term. You will gain reliability, versioning, traceability, and extensibility.
Be sure to first identify the groups/rights, and to identify which links need to be fixed (maybe you have applications that use links to the shares).
An open source alternative to SharePoint is Alfresco, it is very good for CIFS (Windows shares) too.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string