I'm currently researching cloud storage solutions and I came across Ceph which looks quite interesting. I need it for a project where customers can store data that needs to be processed by a piece of software. Potentially that data contains sensitive information, which brings me to my actual question: if a customer or an automated system removes data from the Ceph cluster, do I have to take further steps to ensure a DoD compliant removal?
Assessing Department of Defence compliance without listing a standard or security level of the information leads to a lot of guess-work and assumptions on the answerers part.
The aforementioned being said, the definitive answer is yes, you will have to take additional steps to adhere to any applicable data erasure standards. Ceph does not provide any automated sanitizing processes to remove data from disks, however, the general practice for decommissioning disks that may have held sensitive information includes strict chain-of-custody, degaussing and destruction procedures. Typical government standards also call for verification of data sanitation and usually exclude the sanitizing system from performing the verification
Generally, overwrite procedures (such as the superseded DoD 5220.22-M standard) are no longer considered sufficient to mitigate possible recovery tactics, and only layered defences including the final destruction of the disk have been demonstrated to be effective.
Additionally, Ceph is generally not considered a "cloud storage solution" as it is not typically used on top of a cloud platform, but rather is used to provide distributed storage leveraged in some on-premise solution. Using Ceph on top of something like AWS's Elastic Block Storage or GCP's Persistent Disk is not advisable.
Related
So I've come across this AzCopy tool, and multiple tutorials that say it's good for backing up my storage blobs and whatnot.
Isn't Azure Storage automatically backed up? Isn't that what locally redundant means?
I just want to make sure I'm not missing something and putting my application in jeopardy by not running some external backup.
Redundancy is different from back-ups. Redundancy means that all your changes are replicated to another location. In case of a failover your slave can theoretically function as a master and serve the (hopefully) latest state of your file system. However, the fact that everything is replicated also means that your accidental delete actions, file corruptions, etc. are replicated. Back-ups are meant to prevent this. In case you accidentally mess something up and perform some delete requests, you still have the back-ups and you can usually go back to any point in time (if you made a backup at that time of course).
And of course it's not a bad idea to be not fully dependent on Azure.
The most important thing about any backup policy is that before you create it you decide what you are protecting against, and what sort of data are you backing up.
If the data you backing up is an offsite backup of working data. If access to that data is restricted to admin personnel and they all know what the data is. Then replication could well be all you need to protect from a hardware failure on Azure.
If however you are backing up customer data, or file data that fred in accounts randomly deletes when he falls asleep at the keyboard then you have a different threat model and you should consider your backups accordingly.
Where you back it up is very much a matter of personal requirements and philosophy. I have known customers who will keep backups on Azure and AWS (even though their only compute workload was Azure) If in your threat model you want to protect against MS going bust and selling all of their kit on ebay one morning, then it makes sense to back up elsewhere. Or you can decide that you trust Azure to go bust and just split data across multiple regions.
TL;DR
Understand what you are protecting your data from, and design your backup policy from that.
We presently have a social networking kind of platform. We are next working on file sharing feature, wherein the user should be able to upload and share files(pdf,ppt,docs,images,zip) with friends and groups.
Which specific technologies we should look out for? We are not looking for storage providers like Dropbox, Amazon s3 as answer. We want some advice for efficient storage technologies. We have to store attributes of files like author, with whom the file is shared, edit rights, download rights etc.
Any help would be appreciated.
The answer depends on your specific requirements. In general, you should look for a provider that offers high availability (e.g. no single point of failure), high durability (once something is written, it stays written), and high performance (low latency, high throughput). In addition, you may want certain security features but the specifics are, again, a function of your needs. You noted the ability to specify sharing attributes so you'll want a provider that has a high degree of flexibility and control in terms of specifying access permissions. To store related data, like authorship, you'll want the ability to store and retrieve arbitrary meta data associated with the storage object. Finally, while you stated you don't want a specific provider recommendation, I will nonetheless add that Google Cloud Storage is an excellent choice because it provides all of the above functionality and more (full disclosure: I work on Google's cloud products).
I've been researching the idea of using distributed file system along with my dedicated servers instead of going with Amazon S3 and the results are nothing but massive headaches!
My project have the following characteristics/requirements:
User files are stored in dedicated servers. Each file is stored in 2 separate machines, located in different data centers (150-200 miles away from each other)
I'm using Amazon RDS to host the associated mysql database (*). It's fairly compact (only hold IDs/files metadata)
Files/data is around 50TB. Naturally, data does change and will definitely grow with time
My question is: is there a good general-purpose, distributed parallel fault-tolerant file system that have the following characteristics:
Stable & reasonably fast (upload/download)
Fairly easy to setup & maintain
Handle data storage so that I only have to care about removing/adding new servers if the need arise (ie. add new servers to the filesystem's server pool by editing a simple config, or something like that)
I've read about OpenStack, GlusterFS, MogileFS, XtreemFS, etc...but the more I read, the more I get confused!
(*) Yes, I realize the contradiction. Cost-wise it does make sense to host the database on RDS. But storing (up to) 50TB of users files on amazon is way too expensive compared to using dedicated servers (provided it's good enough).
PS. my app isn't live yet, so I'm open to suggestion if someone have a good idea that fits well in my case.
EDIT I'm not trying to make a S3 clone, I just need to use an existing hosting infrastructure to build small-scale cloud solution, my question is about finding the right distributed file system to handle/automate this.
We recently switched from an expensive storage solution to the opensource Lizardfs for our Distributed storage solution. It is quite simple to set up and scale once your understand the basic concept.
Check out https://docs.lizardfs.com/introduction.html#architecture for a quick overview. But forget about shadow master en meta loggers for now. What you need to know is that there are
a master: that regulates the traffic (make sure that has enough cpu)
chunkservers: which actually store the data. Use any kind of off the shelf hardware with a bunch of harddisks attached.
Clients: which are just simple mount points. So you can get a giant 50TB mount if you want. The master will tell the client where to find/store the files. The actual data is being transfered straight from the client->chunkserver and back.
You can add as many chunkservers as you want, the master will automatically try to balance your storage usage across them. Adding storage is a matter of adding harddrives, or adding servers. They don't have to be actual bare metal machines, but that is probably the cheapest.
There are 2 amazing features in lizardfs that allow georeplication.
Goals (see https://docs.lizardfs.com/adminguide/replication.html#standard-goals): How important are files to you. You can define, on a file level/folder level how many times a file needs to be replicated. Do you want 2 copies 3? 10? You could define a goal of 2 copies for old files that are simply there for archiving purposes. And define a goal of 4 copies on SSD drives for all new files.
Those same goals can also be used to do georeplication. You define that your data has to be stored it least two different locations by labeling your chunkservers accordingly. (e.g. DC1 and DC2)
Rack awareness (see https://docs.lizardfs.com/adminguide/advanced_configuration.html#configuring-rack-awareness-network-topology): you basically define IP ranges to teach the system how your network looks like. This way, clients will try to serve files from the closest server.
The ease of setting it up is what sold lizardfs for me. I've heard very good things about Ceph, but setting it up is another matter...
What worried me at first was how proven the technology is/was. So I spent quite a lot of research on figuring out who uses it.
Orange Poland (A large telecom provider) is one of the users.
And Cloudweavers/opennebula actualy built a business around it selling complete solutions.
Won't it take more than one person a few months a year to manage these servers? That will cost some $, then you have the cost of hosting the data yourself, then you have the added huge cost that the business / system you are building is not obviously scalable? In addition any likely investor will be turned away by a complex home grown data hosting system. How will you ensure integrity/security on par with Amazon? Your max savings per year look like $30,000 or so.
You could save money by doing a de-duplicated storage system where you just store all the unique chunks of data - also see rsync. Don't know how redundant your data is though.
I recommend LizardFS and GfarmFS.
IMHO Ceph is a major disappointment and so is XtreemFS.
I would like to understand what is the best way to mitigate risk of vendor lock-in for cloud-based systems.
For example, I'd like to deploy a multitude of different systems to, say, Amazon EC2 or Windows Azure, but I'd like to minimize the cost of migrating those systems to an alternative cloud vendor if/when necessary.
At the very least, it seems like the more I rely on vendor-specific solutions (like Amazon Queue Service), the more I'm inherently locked in (at least I think so), but I'd like to understand this risk better and any beyond it.
Are there architectural strategies I can use to mitigate this (e.g., rely on map reduce, since my scripts will be portable to another map reduce cloud env)? Are there O/S or stacks that are better than others (Linux, LAMP?). Is using JClouds helpful?
Ideally, I'd like to design virtual systems that can be deployed on EC2, for example, but then easily migrated to Azure or App Engine (or vice versa).
I generally write in Java, but am considering selective use of Scala and Python (or Jython) and am generally still trying to stay JVM-based. I tend to do a lot of parallel processing, and rely on both SQL and non-SQL (but not necessary NoSQL) storage and data manipulation technologies.
Thanks in advance. Hope I'm not being too unrealistic here.
In my opinion, the only architectural pattern to the problem you describe is: abstraction
Make sure to stick to using resources that are offered across various vendors,like storage, queue, etc. Create abstraction layers for each of them.
Hope this helps. I don't think its a super simple task to do, given the variability of the services across cloud providers
I agree with IgoreK - if you're doing this in code, it'll take a lot of abstraction, that's about it.
Another option is to take an IaaS cloud approach - design your application based on Virtual machine roles only. Most Cloud providers offer some form of Virtual machine role - Amazon, Azure, Rackspace etc. Migration then means far less code changes, but a bit more admin on your side.
Microsoft's Customer Advisory Team has an excellent sample on how to do that (I think I downloaded the project from here). There's a whole lot of code in it, and some really good abstractions to make things "free". Obviously, as with any abstraction, you also introduce a new layer of complexity so make sure you really all of it before applying it.
In most cases, less is more. And even though a lock-in is not something you want, it's probably not that hard to "fix" if the need arises. But ask yourself if it's important for that need to be satisfied now, or should you finish the project, and refactor later.
Honestly, your question is based on a bit of a false premise. You're looking to avoid lock-in rather than trying to take full advantage of the platform you've chosen to use.
The better way of approaching the issue is not to try to have your infrastructure be hot-swappable (e.g., avoid vendor lock-in), but to actually make a decision about the IaaS provider you want to use and leverage it as best as you possibly can.
Okay, so we have to store our clients` private medical records online and also the web site will have a lot of requests, so we have to use some scaling solutions.
We can have our own share of a datacenter and run something like Zend Server Cluster Manager on it, but services like Amazon EC2 look a lot easier to manage, and they are incredibly cheaper too. We just don't know if they are secure enough!
Are they?
Any better solutions?
More info: I know that there is a reference server and it's highly secured and without it, even the decrypted data on the cloud server would be useless. It would be a bunch of meaningless numbers that aren't even linked to each other.
Making the question more clear: Are there any secure storage and process service providers that guarantee there won't be leaks from their side?
First off, you should contact AWS and explain what you're trying to build and the kind of data you deal with. As far as I remember, they have regulations in place to accommodate most if not all the privacy concerns.
E.g., in Germany such thing is a called a "Auftragsdatenvereinbarung". I have no idea how this relates and translates to other countries. AWS offers this.
But no matter if you go with AWS or another cloud computing service, the issue stays the same. And therefor, whatever is possible is probably best answered by a lawyer and based on the hopefully well educated (and expensive) recommendation, I'd go cloud shopping, or maybe not. If you're in the EU, there are a ton of regulations especially in regards to medical records -- some countries add more to it.
From what I remember it's basically required to have end to end encryption when you deal with these things.
Last but not least security also depends on the setup and the application, etc..
For complete and full security, I'd recommend a system that is not connected to the Internet. All others can fail.
You should never outsource highly sensitive data. Your company and only your company should have access to it - in both software and hardware terms. Even if your hoster is generally trusted someone there might just steal hardware.
Depending on the size of your company you should have your custom servers - preferable even unaccessible for the technicans in your datacenter (supposing you don't own the datacenter ;).
So the more important the data is, the less foreign people should have access to it in any means. In the best case you can name all people that have access to them in any way.
(Update: This might not apply to anonymous data, but as you're speaking of customers I don't think that applies here?)
(On a third thought: There're are probably laws to take into consideration of how you have to handle that kind of information ;)