Why backup azure storage account if it's locally or geo redundant? - azure

So I've come across this AzCopy tool, and multiple tutorials that say it's good for backing up my storage blobs and whatnot.
Isn't Azure Storage automatically backed up? Isn't that what locally redundant means?
I just want to make sure I'm not missing something and putting my application in jeopardy by not running some external backup.

Redundancy is different from back-ups. Redundancy means that all your changes are replicated to another location. In case of a failover your slave can theoretically function as a master and serve the (hopefully) latest state of your file system. However, the fact that everything is replicated also means that your accidental delete actions, file corruptions, etc. are replicated. Back-ups are meant to prevent this. In case you accidentally mess something up and perform some delete requests, you still have the back-ups and you can usually go back to any point in time (if you made a backup at that time of course).
And of course it's not a bad idea to be not fully dependent on Azure.

The most important thing about any backup policy is that before you create it you decide what you are protecting against, and what sort of data are you backing up.
If the data you backing up is an offsite backup of working data. If access to that data is restricted to admin personnel and they all know what the data is. Then replication could well be all you need to protect from a hardware failure on Azure.
If however you are backing up customer data, or file data that fred in accounts randomly deletes when he falls asleep at the keyboard then you have a different threat model and you should consider your backups accordingly.
Where you back it up is very much a matter of personal requirements and philosophy. I have known customers who will keep backups on Azure and AWS (even though their only compute workload was Azure) If in your threat model you want to protect against MS going bust and selling all of their kit on ebay one morning, then it makes sense to back up elsewhere. Or you can decide that you trust Azure to go bust and just split data across multiple regions.
Understand what you are protecting your data from, and design your backup policy from that.


Architecture decisions for system comprising mobile app with database in cloud and varying user restriction levels

I am looking to develop an app that is to be used by a fairly small number of people and which has to store and recall data from a cloud database. Users should have various access levels in that some can create stuff, some just read, others modify, some can do anything etc. Just like you would do on a file system.
I am currently considering Azure (very new to it) and thinking what would be the components involved in the project. Obviously, a mobile app (Xamarin.Forms) would be front end. Some kind of Cosmos DB or another database in the cloud. Blob storage too for the media files created by users. But my main question is how to implement the control of what user can do what actions to which data.
A simple way would be to do it within the app itself, but that is counter intuitive and a security risk. Even though this is internal app used by people in the same or sister organizations, it really sounds bad.
Best option would be if that's handled by database itself, but I am not aware of existence of such mechanism. Hopefully, this actually exists and someone will point me in the right direction.
Only other way I see is having some kind of mid layer, still on the back end but just before database. However that also seems clunky and am also unaware of how to even implement it "in cloud".
What would be my actual options?
To clarify, it's about having permissions assigned based on certain columns of a table, for example, and not about having different tables with different user that share parts of data.
That's why it is "Architecture decisions" question, and not "how do i give read permissions to user X of my database Y".
An answer might be "Database X" has what you want. Or, least favourably, "There's no way to offload that to DB. You will have to keep all data separately, so that users can only operate on their set of data, and then collate stuff on the backend". Or something in between, perhaps.
I'm not knowledgeable with Azure or any of that other stuff, but every DBMS will have user accounts that enable different permissions, eg for Apache Derby, MySQL, etc.
I would never implement authentication on the client side.

Azure traffic manager irratic load balancing causing issues

I have an azure traffic manager configured to route traffic over two data centres based on performance (latency). The two DCs are replicas of each other, and is engineered in this way so that our global customers are givin a good performance no matter where they are connecting from.
The application tiers do not hold state, and the data tiers are set up using SQL merge replication on a 1 minute timer to keep the DBS in sync as to provide service continuity in the event of a Datacenter failover.
The issues that I have found is that the traffic managers routing is slightly erratic. I have observed registering a user under one Datacenter only to find the login has bee routed to the other one - the SQL replication hasn't synced at this point and the second DC isn't aware that the user exists. Even though the user both registered and logged in from the same location! The DCs are in the West US and South east asia.
I'm looking at a few options to fix this. Solution A is to Silo the users data to a specific data center, therefor whatever DC the user registers to is used thereafter. I wouldn't have syncing issues but I lose the advantage of continuity that the SQL replication provides.
Solution B is to use a different more predictable global load balancer. But first I want some opinions and to perhaps see if I am doing something wrong or perhaps my architecture is flawed.
Thanks for advice.
My solution had challenges using the traffic manager also, although slightly different to yours. The traffic manager is a great value solution if it can work for you. As far as I am aware no configuration in traffic manager allows it to be aware of sessions, therefore it is blinkered to its config setting of performance in your case. This means its acting erratic based on your expectation for it to use sessions to be persistent to an endpoint subject to it being available.
In terms of your solution, it is very much Enterprise. To move backwards with solution A probably doesn't fit the requirement given what you went to the effort of building. Solution B brings many more features that Traffic Manager lacks and one of them will resolve your issue. For other reasons I am looking at
It is designed for Azure and is available as a pre-installed VM. There are others available but this has been my choice and what I would use if I were in your position and wanted to keep the level of resilience you currently have.
Hope this helps.

Is CouchDB per-user database approach feasible for users with lots of shared data?

I want to implement a webapp - a feed that integrates data from various sources and displays them to users. A user should only be able to see the feed items that he has permissions to read (e.g. because they belong to a project that he is a member of). However, a feed item might (and will) be visible by many users.
I'd really like to use CouchDB (mainly because of the cool _changes feed and map/reduce views). I was thinking about implementing the app as a pure couchapp, but I'm having trouble with the permissions model. AFAIK, there are no per-document permissions in CouchDB and this is commonly implemented using per-user databases and replication.
But when there is a lot of overlap between what various users see, that would introduce a LOT of overhead...stuff would be replicated all over the place and duplicated in many databases. I like the elegance of this approach, but the massive overhead just feels like a dealbreaker... (Let's say I have 50 users and they all see the same data...).
Any ideas how on that, please? Alternative solution?
You can enforce read permissions as described in CouchDB Authorization on a Per-Database Basis.
For write permissions you can use validation functions as described on CouchDB
The Definitive Guide - Security.
You can create a database for each project and enforce the permissions there, then all the data is shared efficiently between the users. If a user shares a feed himself and needs permissions on that as well you can make the user into a "project" so the same logic applies everywhere.
Using this design you can authorize a user or a group of users (roles) for each project.
Other than (as victorsavu3 has suggested already) handling your read auth in a proxy between your app and couch, there are only two other alternatives that I can think of.
First is to just not care, disk is cheap and having multiple copies of the data may seem like a lot of unnecessary duplication, but it massively simplifies your architecture and you get some automatic benefits like easy scaling up to handle load (by just moving some of your users' DBs off to other servers).
Second is to have the shared data split into a different DB. This will occasionally limit things you can do in views (eg. no "Linked Documents") but this is not a big deal in many situations.

Good distributed general purpose filesystem in my case?

I've been researching the idea of using distributed file system along with my dedicated servers instead of going with Amazon S3 and the results are nothing but massive headaches!
My project have the following characteristics/requirements:
User files are stored in dedicated servers. Each file is stored in 2 separate machines, located in different data centers (150-200 miles away from each other)
I'm using Amazon RDS to host the associated mysql database (*). It's fairly compact (only hold IDs/files metadata)
Files/data is around 50TB. Naturally, data does change and will definitely grow with time
My question is: is there a good general-purpose, distributed parallel fault-tolerant file system that have the following characteristics:
Stable & reasonably fast (upload/download)
Fairly easy to setup & maintain
Handle data storage so that I only have to care about removing/adding new servers if the need arise (ie. add new servers to the filesystem's server pool by editing a simple config, or something like that)
I've read about OpenStack, GlusterFS, MogileFS, XtreemFS, etc...but the more I read, the more I get confused!
(*) Yes, I realize the contradiction. Cost-wise it does make sense to host the database on RDS. But storing (up to) 50TB of users files on amazon is way too expensive compared to using dedicated servers (provided it's good enough).
PS. my app isn't live yet, so I'm open to suggestion if someone have a good idea that fits well in my case.
EDIT I'm not trying to make a S3 clone, I just need to use an existing hosting infrastructure to build small-scale cloud solution, my question is about finding the right distributed file system to handle/automate this.
We recently switched from an expensive storage solution to the opensource Lizardfs for our Distributed storage solution. It is quite simple to set up and scale once your understand the basic concept.
Check out https://docs.lizardfs.com/introduction.html#architecture for a quick overview. But forget about shadow master en meta loggers for now. What you need to know is that there are
a master: that regulates the traffic (make sure that has enough cpu)
chunkservers: which actually store the data. Use any kind of off the shelf hardware with a bunch of harddisks attached.
Clients: which are just simple mount points. So you can get a giant 50TB mount if you want. The master will tell the client where to find/store the files. The actual data is being transfered straight from the client->chunkserver and back.
You can add as many chunkservers as you want, the master will automatically try to balance your storage usage across them. Adding storage is a matter of adding harddrives, or adding servers. They don't have to be actual bare metal machines, but that is probably the cheapest.
There are 2 amazing features in lizardfs that allow georeplication.
Goals (see https://docs.lizardfs.com/adminguide/replication.html#standard-goals): How important are files to you. You can define, on a file level/folder level how many times a file needs to be replicated. Do you want 2 copies 3? 10? You could define a goal of 2 copies for old files that are simply there for archiving purposes. And define a goal of 4 copies on SSD drives for all new files.
Those same goals can also be used to do georeplication. You define that your data has to be stored it least two different locations by labeling your chunkservers accordingly. (e.g. DC1 and DC2)
Rack awareness (see https://docs.lizardfs.com/adminguide/advanced_configuration.html#configuring-rack-awareness-network-topology): you basically define IP ranges to teach the system how your network looks like. This way, clients will try to serve files from the closest server.
The ease of setting it up is what sold lizardfs for me. I've heard very good things about Ceph, but setting it up is another matter...
What worried me at first was how proven the technology is/was. So I spent quite a lot of research on figuring out who uses it.
Orange Poland (A large telecom provider) is one of the users.
And Cloudweavers/opennebula actualy built a business around it selling complete solutions.
Won't it take more than one person a few months a year to manage these servers? That will cost some $, then you have the cost of hosting the data yourself, then you have the added huge cost that the business / system you are building is not obviously scalable? In addition any likely investor will be turned away by a complex home grown data hosting system. How will you ensure integrity/security on par with Amazon? Your max savings per year look like $30,000 or so.
You could save money by doing a de-duplicated storage system where you just store all the unique chunks of data - also see rsync. Don't know how redundant your data is though.
I recommend LizardFS and GfarmFS.
IMHO Ceph is a major disappointment and so is XtreemFS.

Windows BackupRead / BackupWrite and ACLs

I have been trying to understand what should be the right way in using BackupRead and BackupWrite for backing up data on a computer and especially about restoring it reliably.
Now I understand how to use the API and have been successful. However there's one thing that bothers me.
You can backup, beside the file content itself, any alternate data streams also the security information (ACLs).
Now if I would store the ACL data for backup and then later, once the data needs to be restored on a different machine OR a newly setup machine what should I do with the SIDs which are related to the ACL?
The SID is most likely no longer valid for the machine and how should the right user be selected?
Now I am looking at this on a bigger scale let's say this is a computer with multiple users and hundreds or thousands of objects with different settings this would be mess to get the data restored with the security settings applied to them again.
Is this something, if the user of the software wishes to backup the security settings, what the user has to take about himself and update them accordingly or what?
Additionally BackupRead and BackupWrite will give me the raw binary data of those items which is not all too hard to use however obviously this API does not even intend to face this issue.
Anyone has an idea how a backup application should handle this situation? What is your thought, or any pointers on guidelines for this specific topic?
Thanks a lot.
I think you understand correctly the problems with backup and restore of data. I think that correct understanding of problems is a half of its solving. I suppose that you are, like the most of users of the stackoverflow site, mostly software developer and not an administrator of a large network. So you see on the problem from another side of software developer and not from the side of the administrator. An administrator knows the restrictions of backup and restore of ACLs and already use it.
In general you should understand that the main purpose of backups to save the data and to restore the data later on the same computer or server. Another standard case is: one restore backup from one server to another server after the changing of hardware. In the case the old server will no more exist. Mostly one makes backups of servers and organize to work on the clients so, that no important data will be saved of the client computer.
In the most cases the backed up data has Domain Groups SIDs, Domain Users SIDs, well-known SIDs or SID aliases from the BUILTIN domain in the security descriptors. In the case one need make no changes of SIDs at all. If the administrator do will make some changes in ACL he can use different existing utilities like SubInACL.exe.
If you write Backup/Restore software which you want use for moving the data with the security information you can include in the backup some additional meta-information about the local SIDs of accounts/groups used in the saved security descriptors. In the Restore software you can provide the possibilities to replace SIDs from the saved security descriptors. Many year ago I wrote for one large customer some utilities to clear up the SIDs in SD in the file system, registry and services after domain migration. It was not so complex. So I suggest that you could implement the same feature in you Backup/restore software.
I do believe the Backup* APIs are primarily intended to backup and restore on the same machine, which would render the SID problem irrelevant. However, assuming a scenario where you need to restore a backup on a new install, here's my thoughts on solutions.
For well-known SIDs such as Everyone, Creator Owner and so on, there isn't really any problem.
For domain dependent SIDs you can store them as is, and upon restore you could fixup the domain part, if needed. Likely you should store the domain name as well for such SIDs.
For local users and groups, you should at least store the user/group name for each SID. Fixup on restore could be partially automatic based on these names, or manual (assuming an user interface for the application) where you ask the user whether he wishes to map this user to a new local user, convert these SIDs to a well-known SID, or keep as is.
Most of the issues related to such SIDs can (and probably typically will) be possible to handle automatically. I'd certainly appreciate a backup application that was smart enough to do the restore I asked it to and figure out that "Erik" on the old machine must be "Erik" on the new machine as well.
And a side note, if you do decide to go with such a solution, remember how annoying it is to start an overnight data transfer just to get back to something 5% done blocking on a popup it could just as easily defer :)
