Creating a file server in Azure

Creating a file server in Azure - azure

Our company has an on-prem file server that I'd like to move to the cloud. I followed these directions and was successfully able to map a drive on my local work computer to connect to an Azure File Share. Our company has about 20 locations, ~5 TB of data (mostly "office" type of files) in total, and about 500 users accessing them.
There are two issues I would like to improve but I'm not sure how:
There's somewhat of a lag when opening files. Other than increasing our office's internet speed, is there anything to be done to make it faster? Would some kind of site-to-site VPN help? Would adding some type of server or VM in the "middle" (maybe one per location?) that would perhaps somehow cache the files reduce the lag?
Also, we have and use an Office 365 subscription. What's the easiest way to use our existing AD structure to transfer over the NTFS permissions that are currently in place?
I Googled around and found a bunch of companies advertising their services, notable among them was Talon Storage. But it seems like something that could be done without hiring a company. What I'm hoping for is a DIY direction to optimally solve these issues. Perhaps there's a standard or commonly recommended solution for such issues. Any guidance would be greatly appreciated.

L-A-T-E-N-C-Y. The number one enemy for any cloud-based file server attempt. It ranges from annoying to down right unusable, depending on how far you are from the Azure datacenter of choice.
Imagine a poor soul trying to "stream" a large 20-meg Excel file with 20 references to external files. What used to take maybe 8 seconds on-prem will now take 40 in the cloud (on a good day). It's game over for productivity. Your marketing department that sometimes used to cut video in iMovie over the network? Those days are over.
I understand this is not the answer you were after, but it's the crude reality.
Do not panic, there are solutions, here's a good one - https://azure.microsoft.com/en-us/services/storsimple/
I'm sure you wanted to get rid of boxes not buy more, but it is what it is.

Related

Azure Web Apps: Fastest Way(s) to Transfer Files Between Apps

I've found many questions and answers that are similar to mine, but they all seem to be very specific use cases - or, at least, different/old enough to not really apply to me (I think).
What I want to do is something that I thought would be simple today. The most inefficient thing with the web apps is that copying files between them can be slow and/or time-consuming. You have to FTP (or similar) down, the send it back up.
There must be a way to do this same thing, but natively within Azure so the files don't necessarily have to go far and certainly not with the same bandwidth restrictions.
Are there any solid code samples or open-source/commercial tools out there that help make this possible? So far, I haven't come across any code samples, products, or anything else that makes it possible (aside from many very old PowerShell blogs from 5+ years ago). (I'm not opposed to a PowerShell-based solution, either.)
In my case, these are all the same web apps that have minor configuration-based customization differences between them. So, I don't think webdeploy is an option because it's not about deployment of code. Sometimes it's simply creating a clone for a new launch, and other times its creating a copy for staging/development.
Any help in the right direction is appreciated.

As you've noticed, copying files over to AppService is not the way to go. For managing files across different AppService instances, try using a Storage Account. You can attach an Azure Storage file share to the app service and mount it. There's a comprehensive answer below on how to get files into the storage account.
Alternatively, if you have control over the app code, you can use blob storage instead of files and read the content directly from the app.

Azure- one drive transfer

I am very new to microsoft azure. I would like to transfer 5gb of files(datasets) from my Microsoft one drive account to azure storage(blob storage I guess), and then share those files to about 10 other azure accounts on azure(I have some idea as to how to share files these files). I am not really sure how to go about it, and I prefer not downloading the 5gb of files from one drive and then uploading to azure. Help would be greatly appreciated, thanks a lot.

David's comment is correct, but I still want to provide a couple links to get you started. Like he mentioned, if you can break this into several questions that are more specific you can probably get much better StackOverflow response. I think the first part of the question could be phrased as 'How can I quickly transfer 5GB of files to Azure storage?'. This is still opinion based to some degree but has a couple more finite answers:
AzCopy/DmLib are, respectively, a command line tool and an Azure library that specialize in bulk transfer. There's a couple options including async copy and sync copy. These libraries are specialized to a greater degree for upload/download from the file system but will get you started.
There's a variety of language storage libraries where you can write custom code to connect up with OneDrive. Here is a getting started with .Net.

I think this is a very genuine question as downloading huge files and uploading them back is a very expensive and time consuming task. You can refer to a template here which would allow you to do a server side copy.
Hopefully, if not you, someone else would be benefited with this.

Good distributed general purpose filesystem in my case?

I've been researching the idea of using distributed file system along with my dedicated servers instead of going with Amazon S3 and the results are nothing but massive headaches!
My project have the following characteristics/requirements:
User files are stored in dedicated servers. Each file is stored in 2 separate machines, located in different data centers (150-200 miles away from each other)
I'm using Amazon RDS to host the associated mysql database (*). It's fairly compact (only hold IDs/files metadata)
Files/data is around 50TB. Naturally, data does change and will definitely grow with time
My question is: is there a good general-purpose, distributed parallel fault-tolerant file system that have the following characteristics:
Stable & reasonably fast (upload/download)
Fairly easy to setup & maintain
Handle data storage so that I only have to care about removing/adding new servers if the need arise (ie. add new servers to the filesystem's server pool by editing a simple config, or something like that)
I've read about OpenStack, GlusterFS, MogileFS, XtreemFS, etc...but the more I read, the more I get confused!
(*) Yes, I realize the contradiction. Cost-wise it does make sense to host the database on RDS. But storing (up to) 50TB of users files on amazon is way too expensive compared to using dedicated servers (provided it's good enough).
PS. my app isn't live yet, so I'm open to suggestion if someone have a good idea that fits well in my case.
EDIT I'm not trying to make a S3 clone, I just need to use an existing hosting infrastructure to build small-scale cloud solution, my question is about finding the right distributed file system to handle/automate this.

We recently switched from an expensive storage solution to the opensource Lizardfs for our Distributed storage solution. It is quite simple to set up and scale once your understand the basic concept.
Check out https://docs.lizardfs.com/introduction.html#architecture for a quick overview. But forget about shadow master en meta loggers for now. What you need to know is that there are
a master: that regulates the traffic (make sure that has enough cpu)
chunkservers: which actually store the data. Use any kind of off the shelf hardware with a bunch of harddisks attached.
Clients: which are just simple mount points. So you can get a giant 50TB mount if you want. The master will tell the client where to find/store the files. The actual data is being transfered straight from the client->chunkserver and back.
You can add as many chunkservers as you want, the master will automatically try to balance your storage usage across them. Adding storage is a matter of adding harddrives, or adding servers. They don't have to be actual bare metal machines, but that is probably the cheapest.
There are 2 amazing features in lizardfs that allow georeplication.
Goals (see https://docs.lizardfs.com/adminguide/replication.html#standard-goals): How important are files to you. You can define, on a file level/folder level how many times a file needs to be replicated. Do you want 2 copies 3? 10? You could define a goal of 2 copies for old files that are simply there for archiving purposes. And define a goal of 4 copies on SSD drives for all new files.
Those same goals can also be used to do georeplication. You define that your data has to be stored it least two different locations by labeling your chunkservers accordingly. (e.g. DC1 and DC2)
Rack awareness (see https://docs.lizardfs.com/adminguide/advanced_configuration.html#configuring-rack-awareness-network-topology): you basically define IP ranges to teach the system how your network looks like. This way, clients will try to serve files from the closest server.
The ease of setting it up is what sold lizardfs for me. I've heard very good things about Ceph, but setting it up is another matter...
What worried me at first was how proven the technology is/was. So I spent quite a lot of research on figuring out who uses it.
Orange Poland (A large telecom provider) is one of the users.
And Cloudweavers/opennebula actualy built a business around it selling complete solutions.

Won't it take more than one person a few months a year to manage these servers? That will cost some $, then you have the cost of hosting the data yourself, then you have the added huge cost that the business / system you are building is not obviously scalable? In addition any likely investor will be turned away by a complex home grown data hosting system. How will you ensure integrity/security on par with Amazon? Your max savings per year look like $30,000 or so.
You could save money by doing a de-duplicated storage system where you just store all the unique chunks of data - also see rsync. Don't know how redundant your data is though.

I recommend LizardFS and GfarmFS.
IMHO Ceph is a major disappointment and so is XtreemFS.

hardware infrastructure for public web application

I'd like to start a free budget/personal finance site and will need plenty of horsepower and storage. I'm definitely a nubee, so how does one get started in terms of hardware infrastructure? Do I need to get a dedicated IP from my ISP and obtain my own servers? Do I go with amazon or Sql Server Data Services/Azure or something like that? Is the latter services free or a discount offering available to non-profit/free services such as the budget/personal finance site I'm looking to start?

If you don't mind writing your web application in python, then I's suggest using Google App Engine. See: What Is Google App Engine?

What I like to do when I have new ideas for a site is to find an inexpensive hosting solution ($10 per month). This allows me to test the idea and see if the site is going to be successful. If it is a flop, I haven't wasted much money and if it is successful I can upgrade to better hosting (dedicated server).
There are many hosting options available and several of them have great tools such as an online SQL Server management studio. Your other option would be to host it yourself if you are prepared to deal with firewall issues, backups, storage, etc.

Whether it is feasible to DIY varies a lot by country...if you have a decent broadband connection with a fixed IP this can be the cheapest route to play around with first, especially if you need an awful lot of storage.
Note however that many fast broadband connections are only fast for downloads - when you're running a server, the speed your users will see is the upload speed, which is usually a lot less. Also, you'll need to do your own admin and backup etc.
Apart from this most hosting options have a price tag on top, varying from virtual hosts (sharing a real machine), to colocation (your machine in somebody's data center), to cloud services like amazon et al (which have a good scaling ability)- and you will need to shop around for the software stack and hardware features you really need.

There's really two ways to answer this question, what differentiates them is budget.
One is to properly design this solution, prototype it, benchmark the prototype, extrapolate anticipated user load, add overhead and scale accordingly. This takes time, costs but gives you a supportable solution that serves your customers well.
The other is to just give something, anything a go and fix the problems as they come along. This is quicker and cheaper but might be a headache for a while and might p*** off your customers.
Basically it comes down to budget.
Best of luck.

network drive file sharing

For the better part of 10 years + we have relied on various network mapped drives to allow file sharing. One drive letter for sharing files between teams, a seperate file share for the entire organization, a third for personal use etc. I would like to move away from this and am trying to decide if an ECM/Sharepoint type solution, or home grown app, is worth the cost and the way to go? Or if we should simply remain relying on login scripts/mapped drives for file sharing due to its relative simplicity? Does anyone have any exeperience within their own organization or thoughts on this?
Thanks.

SharePoint is very good at document sharing.
Documents generally follow a process for approval, have permissions, live in clusters... and these things lend themselves well to SharePoints document libraries.
However there are somethings that don't lend themselves well to living inside SharePoint... do you have a virtual hard drive (.vhd) file that you want to share with a workmate? Not such a good idea to try and put a 20GB file into SharePoint.
SharePoint can handle large files, and so can SQL Server behind it... but do you want your SQL Server bandwidth being saturated by such large files? Do you want your backup of SQL Server to hold copies of such large files multiple times?
I believe that there are a few Microsoft partners who offer the ability to disassociate file blobs from the SharePoint database, so that SharePoint can hold the metadata and a file system holds the actual files, and SharePoint simply becomes the gateway to manage access, permissions, and offer a centralised interface to files throughout an organisation. This would offer you the best of both worlds.
Right now though, I consider SharePoint ideal for documents, and I keep large files (that are not document centric) on Windows file shares.

Definetely, use a tool.
The main benefit here is version control. Being able to jump easily to a previous version, diff'ing and seeing who modified what (see most VCS' blame/annotate tool- it prints out a text file showing when/who modified each line in the text file).
Second, you can probably benefit from issue tracking/task tracking.
Other benefits include web access from the internet, having a wiki (which can be great in some situations), etc.
I use Subversion + Redmine at work, and I find it highly useful- test a few solutions and you will surely find out further advantages for you.

One thing that can be overlooked in the change to an document management tool is the planning required around how much is going to be stored and information architecture issues like where different content is going to end up.
SharePoint particularly is easy to setup without a good plan going forward and is particularly vulnerable to difficulties later on when things get to busy.
I would not recommend a home grown app for something like this. The problem has been solved by off the shelf tools and growing one from scratch is going to cost a huge amount and not get you any way near the features for the money.
Did I mention how important planning your security groups and document areas (IA) was?

If you need just document storage then sharepoint can do very well. WSS is ewen free and it provides very good document storage capabilities.
But you have to plan carefully as updating existing applications is painfull. If you decide to go with Sharepoint then I can give you few advices from top of my head
Pay attention to security configuration (user groups, privilegies,..)
Plan your document libraries well as it is not easy to just move documents betveen them
Also consider limiting number of versions that one document can have, because sharepoint stores full backups betveen verions, not just changes
Don't use infopath:) we have very bad experience with it (just don't tell this to managers)
If you don't really need to change graphical look of Sharepoint than don't bother with it as it brings many problems (I'm talking about custom masterpages and custom site templates)
Try to use as much OOB stuff as possible, because developing your own webparts not only cost more, but it can be quite complicated.
Make sure to turn-on search indexing. This is quite tricky, because it is by default turned off and then you will be as surprised that search is not working as I was :)
If you try to just deploy it and load 10.000 documents into it then you will surely have problems with it later. If you give a little thought about structure then you will end up with really good document storage.

Migrating is very probably worth the cost in the long term. You will gain reliability, versioning, traceability, and extensibility.
Be sure to first identify the groups/rights, and to identify which links need to be fixed (maybe you have applications that use links to the shares).
An open source alternative to SharePoint is Alfresco, it is very good for CIFS (Windows shares) too.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string