Best distributed filesystem for commodity linux storage farm [closed] - linux

I have a lot of spare intel linux servers laying around (hundreds) and want to use them for a distributed file system in a web hosting and file sharing environment. This isn't for a HPC application, so high performance isn't critical. The main requirement is high availability, if one server goes offline, the data stored on it's hard drives is still available from other nodes. It must run over TCP/IP and provide standard POSIX file permissions.
I've looked at the following:
Lustre ( Comes really close, but it doesn't provide redundancy for data on a node. You must make the data HA using RAID or DRBD. Supported by Sun and Open Source, so it should be around for a while
gfarm ( Looks like it provides the redundancy but at the cost of complexity and maintainability. Not as well supported as Lustre.
Does anyone have any experience with these or any other systems that might work?

check also GlusterFS
Edit (Aug-2012): Ceph is finally getting ready. Recently the authors formed Inktank, an independent company to sell commercial support for it. According to some presentaions, the mountable POSIX-compliant filesystem is the uppermost layer and not really tested yet, but the lower layers are being used in production for some time now.
The interesting part is the RADOS layer, which presents an object-based storage with both a 'native' access via the librados library (available for several languages) and an Amazon S3-compatible RESP API. Either one makes it more than adequate for adding massive storage to a web service.
This video is a good description of the philosophy, architecture, capabilities and current status.

In my opinion, the best file system for Linux is MooseFS , it's quite new, but I had an opportunity to compare it with Ceph and Lustre and I say for sure that MooseFS is the best one.

Gluster is getting quite a lot of press at the moment:

Lustre has been working for us. It's not perfect but it's the only thing we have tried that has not broken down over load. We still get LBUGS from time to time and dealing with 100TB + file systems is never easy but the Lustre system has worked and increased both performance and availability.

If not someone forces you to use it, I would also highly recommend using anything else than Lustre. From what I hear from others and what also gave myself nightmares for quite some time is the fact that Lustre quite easily breaks down in all kinds of situations. And if only a single client in the system breaks down, it puts itself into an endless do_nothing_loop mode typically while holding some important global lock - so the next time another client tries to access the same information, it will also hang. Thus, you often end up rebooting the whole cluster, which I guess is something you would try to avoid normally ;)
Modern parallel file systems like FhGFS ( are way more robust here and also allow you to do nice things like running server and client components on the same machines (though built-in HA features are still under development, as someone from their team told me, but their implementation is going to be pretty awesome from what I've heard).

Ceph looks to be a promising new-ish entry into the arena. The site claims it's not ready for production use yet though.

I read a lot about distributed filesystems and I think FhGFS is the best.
It worth a try. See more about it at:


Why dockerize a service or application when you could install it? [closed]

We have around 12 services and other applications such as presto.
We are thinking about building Docker containers for each service and application. Is it right to dockerize all of them?
When would a Docker container not be the ideal solution?
Quick local environment set up for your team - if you have all your services containerized. It will be a quick environment set up for your development team.
Helps Avoid the "It works on mine, but doesn't work on yours problem" - a lot of our development issue usually stems from development environment setup. If you have your services containerized, a big chunk of this gets offloaded somewhere else.
Easier deployments - while we all have different processes for deploying code, it goes to tell that having them containerized makes thing a hell lot easier.
Better Version Control - as you already know, can be tagged, which helps in VERSION CONTROL.
Easier Rollbacks - since you have things version controlled, it goes to say that it is easier to rollback your code. Sometimes, by just simply pointing to your previously working version.
Easy Multi-environment Setup - as most development teams do, we set up a local, integration, staging and production environment. This is done easier when services are containerized, and, most of the times, with just a switch of ENVIRONMENT VARIABLES.
Community Support - we have a strong community of software engineers who continuously contribute great images that can be reused for developing great software. You can leverage that support. Why re-invent the wheel, right?
Many more.. but there's a lot of great blogs out there you can read that from. =)
I don't really see much cons with it but here's one I can think of.
Learning Curve - yes, it does have some learning curve. But from what I have seen from my junior engineers, it doesn't take too much time to learn how to set it up. It usually takes you longer when you are figuring out how to containerized.
Data Persistence - some engineers are having concerns with data persistence. You can simply fix this by mounting a volume to your container. If you want to use your own database installation, you can simply switch your HOST, DB_NAME, USERNAME and PASSWORD with the one you have in your localhost:5432 and all should be fine.
I hope this helps!
You should containerize all Linux-based services that are stateless and require frequent upgrades/changes/patches. These include all types of front-end and application servers.
Databases/datastores, on the other hand, are a more complex case, since there are issues of performance and data persistence/integrity. Also, databases are not upgraded/patched as frequently as front-end applications.
*Windows containers will only run in Windows.
Docker is a recipe for consistency and reproducibility.
To make a nice cup of tea, you need boiling water, put some tea bag in it and let it brew for three minutes. How you achieve boiling water is absolutely irrelevant.
Now let's imagine that you need to serve up 12 cups of tea. Does your staff know how to make a proper brew? Does your staff know how to use a kettle or a pan? What guarantee do you have that each cup of tea will be the same?
You could spend a lot of time training people and make you sure you have all the appliances you need. Or you can invest in a machine that will produce the same cup of tea over and over again.
The analogy may seem stupid but my point is that relatively common problems already have well-known solutions.
Unless it's a one-off scenario or you have additional constraints we don't know about, what reasons do you have to not consider Docker?
There is no issue with dockerizing multiple services. I think you need to consider about following things too.
You have to think about how to save your data you have used inside the container. By default the data inside the container will destroy the the container shuts down. You may have to mount a volume in order to keep the data permanently.
You may not be able to get the bare-metal performance when running in docker.
IMO It's not a good choice if you are going to run all the applications in docker unless you need to take the advantage of containerization. But it is easy to run stateless applications and services with docker.

Is it possible that the popular applications in my laptop are surveilling my files on hard drive?

What if I develop a desktop application which million people will use, and behind the scene, the application is surveilling users' files on their hard drives, streaming the data time to time?
Can one be assured no such things happen, with any popular software applications, be it MS Office or Google Chrome?
Or this is just a stupid question?
Is it technically possible? Yes, it is.
Could it be happening in an application used by a million users for a relatively long time without being noticed? Very unlikely. Somebody would notice the strange network traffic eventually.
Also #Mjh mentioned open source in a comment. While open source can help by allowing people to audit the source code, how many times have you checked that the binary you are using is actually the compiled source that you were looking at? Of course, there are signatures on binary packages and all, but the signature is made by the package maintainer. There is an inherent trust not only in the developer of the application, but also in the tool chain that creates a binary package from the source code. And then we haven't talked about strange "bugs", or the fact that even in open source, some security issues are very hard to find (otherwise all open source software would be security bug-free, which they are not).
So back to your question, sure, you could use all kinds of techniques to monitor the behavior of an application, you could monitor memory access, network traffic, whatever else. You can also analyse the code itself, look for suspicious things. It will take a huge amount of effort and still there will be no 100% guarantee, only some level of assurance.
Automated version upgrades could make detection even harder by the way. Even if you put lots of resources into analysis of one version, what if only a short-lived version had malicious code? Sure, that too can be analysed, but would anyone bother, unless there was a good reason (like indications of something malicious)?
Yet I think you can be pretty sure that major vendors don't do this. It's just not worth it for them, why would they? Their risk would be huge, with a relatively low benefit.

Do you use 30 day trial servers to do development work? [closed]

I know this is an odd question but I need to ask it to get information to present to a client. Their lead network admin wants me to work on 30 day trial servers like Sharepoint & SQL Server to develop projects for their clients. While I will do as they ask, I'm not convinced this is the best way to go about developing software or troubleshooting previously developed software. To be honest, I've never worked on custom development for any server/software using a trial version.
What arguements are there for and against working on trial software/servers?
Pro: It enables you to mock up a concept and see if it seems like the development path will be easy before you shell out large amounts of money for the real deal.
Cons: It could trap you in a vicious cycle of wiping your virtual machine and re-installing the OS, the trial version, and your product (you do use source control, correct?) if they are hoping that this will alleviate the need for ever paying for the real product.
Suggestion: If you don't mind unsolicited advice, then I would determine why the lead admin wants to use the trial versions -- and then go from there. Until you know the reasons you cannot respond to them.
If they are doing it for the pro reason, then determine if you feel comfortable working with the possibility of switching technologies 30 days into your build. (Can you do it efficiently?)
If they are doing it to avoid spending money, present some of the alternate open source / free options that you are comfortable developing with. If they will not change their modus operandi at that point, then do what is necessary, knowing what you will be walking away from / getting in to.
(And if you don't mind one more bit of unsolicited advice -- if they are doing it for the con reason and will not change WALK AWAY)
Point them at BizSpark. Microsoft is begging people to use their stuff. A hunny will get you everything on the map for 3 years or until you start making money.
Oh, to answer your question: If I need to get funding for technology not present in the infrastructure or to do a proof of concept I would not think twice about using evals. That is what they are for. I would be evaluating the suitability of the product for use with my designs. Seems easy to me. Maybe I am just, hold on, i have to give my parrot a cracker... ;-)
Apart from the ethical arguments, there are practical ones:
What are you supposed to do if development overruns? Start reinstalling everything, wasting several days doing so?
Additionally, if the client is so strapped for cash that they want to do this, how can you be certain they will pay you (either due to cash flow problems, or simply because of their shady ethics)?
I'm pretty sure that that kind of use is a violation of license terms. Trial editions of servers are for evaluating a product. And if you are in fact creating a product, then you have gone way beyond evaluation.
I would never work under such terms. If you are developing a concrete product, get proper licenses for the development tools. I know that the developer edition of SQL server is not hugely expensive (compared to a version licensed for production use), so I would imagine that the same counts for Sharepoint.
And then there is of course, as already mentioned, what do you do when the trail period expires?
I wouldn't mind doing this so long as the job is shorter than 30 days. Make sure your work contract they're paying for the time worked and not specific deliverables, because your deliverables are time-bombed.
Also be prepared to walk away. If this company doesn't have resources to get the right software, you don't want to be there longer than 30 days anyhow.
Microsoft provides several pre-built virtual machines, that contains full stacks.
(Server 2008/Sql 2008/Sharepoin) (Server 2003/Sql/Project Server) etc.
They are time bombed, but often (not always) Microsoft will provide a new image after the time out.
The benefit of using these images is that they are already configured and good to go.
As an example here is a beta of sharepoint 2010 (
If the project has a quick timeline, it provides the developers access to the configured stack right away, with no ramp up time of building new virtual machines.
Esp when working on beta/early release software this is great.
The SQL Server evaluation's download page mentions that the evaluation license is good for 180 days, and specifically advertises it as a tool you can use for mission-critical applications. This tells me MS is fine with your using it for development work.
To answer a question with more questions:
How long does this project run?
What phase of the effort are you in now?
Is this an internal/proof-of-concept project, or something that your customer(s) will be using for a long time?
If you are going to need to use SQL Server for Operations & Maintenance support months past the initial evaluation period, you ought to get a license for the full version of it. And also consider what your customers are using so that you can reproduce any bugs that come back from them.
I don't think it's ethical to continually renew evaluation licenses to have a longer evaluation period. Companies call them "evaluations" as a try-before-you-buy, not a keep-trying-without-buying.
I'm not sure what others are seeing as unethical here. If the project is short enough to be completed within the 30 day trial, I don't see any issues. I think that's a great use of trials - if they can't handle a clients applications then they aren't a good option and you can use something else.
I think others here have given some good advice regarding the longer than 30 days projects and some good contract ideas.
How in-house do the servers have to be? Would a hosted solution work for them? (Dreamhost, Amazon Web Services, whatever)? Some hosting systems provide pretty complex machine images (lots of stuff pre-installed--definitely AWS, presumably most others), decreasing setup time/effort. I think those come with licenses, though I don't honestly know. Plus, in at least some cases, you (they) only pay for what you (they) use.
Obviously no good if the physical machine needs to be in-house, or if things are otherwise super-sensitive.

when to start performance tuning a website [closed]

i have a mvc website and the volume of traffic is increasing. I have the site pointing to a backend sql server 2008 database.
at what point, do i need to figure out what the bottleneck of the system and look to review if i need to load balance machines, or change the way i am doing database connection management.
are there specific tools and thresholds that are indicators that the current model isn't scalable or is hitting a breaking point (besides just observations of a slow site.
When you start noticing performance issues.
There are some very easy things you can do to increase performance with so little work, it's easier to do them that see if you need to yet ;)
First and foremost is putting all static images and other media on a separate server. That eliminates a whole lot of queries on the boxen running the dynamic parts of the web server.
Next in line is make sure you are using as many hard drive spindles as possible. Of course you want your database on a separate machine, let alone a separate hard drive, but you also want your web server logs written to a separate hard drive. That prevents a lot of jumping around of the hard drive heads.
As far as "how do you know when you need to performance tune", I will give a different answer than George Stocker: When there is a cost associated with your performance that outstrips the cost of looking into it. I say it this way because your customers may be a little unhappy if your website is a little sluggish, but if it doesn't prevent anyone from using it, or recommending it to others, then it may not be worth looking into. People put up with sub-optimal performance all the time.
There are a plethora of tools available to address the plethora of possible bottlenecks. A decent performance tuning strategy starts with measurement and consistent instrumentation of the given system.
But performance tuning requires precious time and resources, and should only be pursued when it gives you the most bang for the buck, i.e. it provides the greatest improvement to achieving your website's objectives given the work required. If your website supports (or is) a business or organization, you must continuously evaluate the business landscape and plan the next allocation of resources. This is entirely dependent on the particular industry.
An engineer might focus on continual refinement of an existing system, but the project commissioners (be they an external client, or your company's management) must weigh the costs and benefits of all types of development, from improving an existing featureset, to adding new features, to addressing technical limitations affecting product usability (including performance issues). That's not to say engineers have no say in resource allocation, but their perspective is just one of many contributing to success.
When you have doubts that the website would survive a doubling in max usage. One common line of thought where I am from is that you should have the performance capacity to support at least 2x the number of users you expect.
Determining whether or not you can support 2x is something better left to load testing though, rather then speculation. One comment from your other comment though: chances are a website performance problem is going to affect everyone using the web site, including you on a local machine... unless it's a bandwidth problem and you're connected to a local network. Barring cable cuttings, it's not going to be 'just the people in Asia'.

(*nix) Cloud/Cluster solutions for bulding fast & scalable web-services [closed]

I'm going to build a high-performance web service. It should use a database (or any other storage system), some processing language (either scripting or not), and a web-server daemon. The system should be distributed to a large amount of servers so the service runs fast and reliable.
It should replicate data to achieve reliability and at the same time it must provide distributed computing features in order to process large amounts of data (primarily, queries on large databases that won't survive being executed on a single server with a suitable level of responsiveness). Caching techniques are out of the subject.
Which cluster/cloud solutions I should take for the consideration?
There are plenty of Single-System-Image (SSI), clustering file systems (can be a part of the design), projects like Hadoop, BigTable clones, and many others. Each has its pros and cons, and "about" page always says the solution is great :) If you've tried to deploy something that addresses the subject - share your experience!
UPD: It's not a file hosting and not a game, but something rather interactive. You can take StackOverflow as an example of a web-service: small pieces of data, semi-static content, intensive database operations.
Cross-Post on ServerFault
You really need a better definition of "big". Is "Big" an aspiration, or do you have hard numbers which your marketing department* reckon they'll have on board?
If you can do it using simple components, do so. The likes of Cassandra and Hadoop are neither easy to setup (especially the later) or develop for; developers who are going to be able to develop such an application effectively will be very expensive and difficult to hire.
So I'd say, start off using your favourite "Traditional" database, with an appropriate high-availability solution, then wait until you get close to the limit (You can always measure where the limit is on your real application, once it's built and you have a performance test system).
Remember that Stack Overflow uses pretty conventional components, simply well tuned with a small amount of commodity hardware. This is fine for its scale, but would never work for (e.g. Facebook), but the developers knew that the audience of SO was never going to reach Facebook levels.
When "traditional" techniques start failing, e.g. you reach the limit of what can be done on a single database instance, then you can consider sharding or doing functional partitioning into more instances (again with your choice of HA system).
The only time you're going to need one of these (e.g. Cassandra) "nosql" systems is if you have a homogeneous data store with very high write requirement and availability requirement; even then you could probably still solve it by sharding conventional systems - as others (even Facebook) have done at times.
It's hard to make specific recommendations since you've been a bit vague, but I would recommend Google Appengine for basically any web service. It's reliable, easy to use, and is built on the google architecture so is fast and reliable.
i'd like to recommend stratoscal symphony. it's a private cloud service that does it all. everything you just mentiond - this service provides perfectly. their symphony products deliver the public cloud experience in you enterprise data center. if that's what you're looking for, i suggest you give it a shot
