how to do sharepoint database disk usage analysis and selective replication?

how to do sharepoint database disk usage analysis and selective replication? - sharepoint

I have a SharePoint 2007 database that is 16GB in size and I want to know why, and how I can reduce the size. Ideally I would like a trimmed replica to use as a developer workstation that retains a good sample data set, and has the ability to be refreshed.
Can you please tell me if there are any third party tools or other methods to accomplish this? I have found the Microsoft tool (stsadm) to be very limited in this regard.
Many thanks.

You can start with the Storage Space Allocation page, available in every site collection.. http://server/_layouts/storman.aspx
That can tell you what lists are big etc.
Trashcans are also good candidates for trimming a database.
I regularly make backups of every site collection and just inspect the ones that get too big. It's always something; large PPTs or loads of images, etc.
Ultimately SQL Server will not just automatically shrink your database, so if you delete stuff the filesize on disk will not decrease; this is a SQL Server admin task.
16 GB is not that big really.. you can just backup and restore it in your dev environment and then delete some unneeded site(collection)s out of it to make it smaller.

Related

Creating a file server in Azure

Our company has an on-prem file server that I'd like to move to the cloud. I followed these directions and was successfully able to map a drive on my local work computer to connect to an Azure File Share. Our company has about 20 locations, ~5 TB of data (mostly "office" type of files) in total, and about 500 users accessing them.
There are two issues I would like to improve but I'm not sure how:
There's somewhat of a lag when opening files. Other than increasing our office's internet speed, is there anything to be done to make it faster? Would some kind of site-to-site VPN help? Would adding some type of server or VM in the "middle" (maybe one per location?) that would perhaps somehow cache the files reduce the lag?
Also, we have and use an Office 365 subscription. What's the easiest way to use our existing AD structure to transfer over the NTFS permissions that are currently in place?
I Googled around and found a bunch of companies advertising their services, notable among them was Talon Storage. But it seems like something that could be done without hiring a company. What I'm hoping for is a DIY direction to optimally solve these issues. Perhaps there's a standard or commonly recommended solution for such issues. Any guidance would be greatly appreciated.

L-A-T-E-N-C-Y. The number one enemy for any cloud-based file server attempt. It ranges from annoying to down right unusable, depending on how far you are from the Azure datacenter of choice.
Imagine a poor soul trying to "stream" a large 20-meg Excel file with 20 references to external files. What used to take maybe 8 seconds on-prem will now take 40 in the cloud (on a good day). It's game over for productivity. Your marketing department that sometimes used to cut video in iMovie over the network? Those days are over.
I understand this is not the answer you were after, but it's the crude reality.
Do not panic, there are solutions, here's a good one - https://azure.microsoft.com/en-us/services/storsimple/
I'm sure you wanted to get rid of boxes not buy more, but it is what it is.

Why backup azure storage account if it's locally or geo redundant?

So I've come across this AzCopy tool, and multiple tutorials that say it's good for backing up my storage blobs and whatnot.
Isn't Azure Storage automatically backed up? Isn't that what locally redundant means?
I just want to make sure I'm not missing something and putting my application in jeopardy by not running some external backup.

Redundancy is different from back-ups. Redundancy means that all your changes are replicated to another location. In case of a failover your slave can theoretically function as a master and serve the (hopefully) latest state of your file system. However, the fact that everything is replicated also means that your accidental delete actions, file corruptions, etc. are replicated. Back-ups are meant to prevent this. In case you accidentally mess something up and perform some delete requests, you still have the back-ups and you can usually go back to any point in time (if you made a backup at that time of course).
And of course it's not a bad idea to be not fully dependent on Azure.

The most important thing about any backup policy is that before you create it you decide what you are protecting against, and what sort of data are you backing up.
If the data you backing up is an offsite backup of working data. If access to that data is restricted to admin personnel and they all know what the data is. Then replication could well be all you need to protect from a hardware failure on Azure.
If however you are backing up customer data, or file data that fred in accounts randomly deletes when he falls asleep at the keyboard then you have a different threat model and you should consider your backups accordingly.
Where you back it up is very much a matter of personal requirements and philosophy. I have known customers who will keep backups on Azure and AWS (even though their only compute workload was Azure) If in your threat model you want to protect against MS going bust and selling all of their kit on ebay one morning, then it makes sense to back up elsewhere. Or you can decide that you trust Azure to go bust and just split data across multiple regions.
TL;DR
Understand what you are protecting your data from, and design your backup policy from that.

Good distributed general purpose filesystem in my case?

I've been researching the idea of using distributed file system along with my dedicated servers instead of going with Amazon S3 and the results are nothing but massive headaches!
My project have the following characteristics/requirements:
User files are stored in dedicated servers. Each file is stored in 2 separate machines, located in different data centers (150-200 miles away from each other)
I'm using Amazon RDS to host the associated mysql database (*). It's fairly compact (only hold IDs/files metadata)
Files/data is around 50TB. Naturally, data does change and will definitely grow with time
My question is: is there a good general-purpose, distributed parallel fault-tolerant file system that have the following characteristics:
Stable & reasonably fast (upload/download)
Fairly easy to setup & maintain
Handle data storage so that I only have to care about removing/adding new servers if the need arise (ie. add new servers to the filesystem's server pool by editing a simple config, or something like that)
I've read about OpenStack, GlusterFS, MogileFS, XtreemFS, etc...but the more I read, the more I get confused!
(*) Yes, I realize the contradiction. Cost-wise it does make sense to host the database on RDS. But storing (up to) 50TB of users files on amazon is way too expensive compared to using dedicated servers (provided it's good enough).
PS. my app isn't live yet, so I'm open to suggestion if someone have a good idea that fits well in my case.
EDIT I'm not trying to make a S3 clone, I just need to use an existing hosting infrastructure to build small-scale cloud solution, my question is about finding the right distributed file system to handle/automate this.

We recently switched from an expensive storage solution to the opensource Lizardfs for our Distributed storage solution. It is quite simple to set up and scale once your understand the basic concept.
Check out https://docs.lizardfs.com/introduction.html#architecture for a quick overview. But forget about shadow master en meta loggers for now. What you need to know is that there are
a master: that regulates the traffic (make sure that has enough cpu)
chunkservers: which actually store the data. Use any kind of off the shelf hardware with a bunch of harddisks attached.
Clients: which are just simple mount points. So you can get a giant 50TB mount if you want. The master will tell the client where to find/store the files. The actual data is being transfered straight from the client->chunkserver and back.
You can add as many chunkservers as you want, the master will automatically try to balance your storage usage across them. Adding storage is a matter of adding harddrives, or adding servers. They don't have to be actual bare metal machines, but that is probably the cheapest.
There are 2 amazing features in lizardfs that allow georeplication.
Goals (see https://docs.lizardfs.com/adminguide/replication.html#standard-goals): How important are files to you. You can define, on a file level/folder level how many times a file needs to be replicated. Do you want 2 copies 3? 10? You could define a goal of 2 copies for old files that are simply there for archiving purposes. And define a goal of 4 copies on SSD drives for all new files.
Those same goals can also be used to do georeplication. You define that your data has to be stored it least two different locations by labeling your chunkservers accordingly. (e.g. DC1 and DC2)
Rack awareness (see https://docs.lizardfs.com/adminguide/advanced_configuration.html#configuring-rack-awareness-network-topology): you basically define IP ranges to teach the system how your network looks like. This way, clients will try to serve files from the closest server.
The ease of setting it up is what sold lizardfs for me. I've heard very good things about Ceph, but setting it up is another matter...
What worried me at first was how proven the technology is/was. So I spent quite a lot of research on figuring out who uses it.
Orange Poland (A large telecom provider) is one of the users.
And Cloudweavers/opennebula actualy built a business around it selling complete solutions.

Won't it take more than one person a few months a year to manage these servers? That will cost some $, then you have the cost of hosting the data yourself, then you have the added huge cost that the business / system you are building is not obviously scalable? In addition any likely investor will be turned away by a complex home grown data hosting system. How will you ensure integrity/security on par with Amazon? Your max savings per year look like $30,000 or so.
You could save money by doing a de-duplicated storage system where you just store all the unique chunks of data - also see rsync. Don't know how redundant your data is though.

I recommend LizardFS and GfarmFS.
IMHO Ceph is a major disappointment and so is XtreemFS.

Sharepoint as a high volume information system

I'm looking at designing some core information systems at a new company I'm working at (described one of my ideas here Workflow system)
I've thought a bit more, and am strongly considering using sharepoint for a lot of the heavy lifting seeing as it comes with so much out of the box.
However, I'm not sure how it will handle the high volume of data we'll be throwing at it. I read the MS whitepaper (http://go.microsoft.com/fwlink/?LinkId=95450&clcid=0x409), and it says about 2000 items in a list is about the limit using traditional design methods.
But first a bit of info on my plan and data structures :
We have multiple clients. Each client has multiple applications. Each application will have multiple, ongoing jobs (or process runs).
Each application will store significant correspondence and documentation. Each job represents the processing of a data file on a single run, and stores information about the job such as the postscript file, postal manifests, etc.
Job volume will be about 50 - 100 a day. Each job will have a workflow, triggered by external programs. Then, say on a "job scheduler" page, production staff can schedule the jobs and perform custom actions on the job (written as plugins).
I was thinking the jobs would sit outside and accessed via the BDC, but I would still like them represented in sharepoint lists, to add in sharepoint functionality and reporting, and they'd be accessible in multiple places
e.g.
Application portal - see jobs for application
Production scheduler - see lists of upcoming jobs, assign to resources, trigger other functionality (e.g. copy print file to printer, produce mailing machine file)
Invoicing view - view completed but uninvoiced jobs, export to accounting package
Client view - client portal displays jobs, invoices, stock levels (from external warehouse system), documentation, change register / helpdesk
So basic info about the job would sit in the BDC, but then sharepoint would capture additional metadata about each job. Also, down the line we might put in more advanced workflows using WF or something like K2 blackpoint / blackpearl.
Is this feasible? Any resources you'd recommend to read to get up to speed?

To use SharePoint, you should concentrate on what SharePoint is good at and what it is designed for.
SharePoint is a great collaboration portal, it is not so good as a simple high volume database. So...
You can setup a small site for each client and subsites for each job. The goal of the "job site" is to display (using a webpart perhaps) the relevant upcoming jobs, a list of job errors/exceptions and relevant team documentation on each job.
Separate sites can be created to give a particular "view" of the jobs. E.g an "Invoicing" site can be created to give a view again from BDC webparts of what is requiring invoicing.
https://iwsolve.partners.extranet.microsoft.com/SDPS/ may provide some help.
Don't try and store huge amounts of information in a SharePoint list, just because it may be possible to "tag" it with meta data. A database table is perfectly able to include columns supplying additional information if required.
Think about it this way. If you are creating 50-100 jobs a day, putting that data into a list pre-supposes your sites users are going to want to enter metadata on those jobs manually. I thought not, so create systems you need in order to get the metadata stored correctly at source, or store metadata about the "types" of jobs within a SharePoint list and allow SharePoint to match the job type with jobs in the BDC.
SharePoint will help you to integrate all your systems information together, but unfortunately it looks like you have a lot of work to do just planning what information should go where and how each type of use will view it.

Please take a look at this blog post I wrote on managing large SharePoint lists for better performance- it might offer a bit of an explanation for the 2,000 items issue, which is not actually a hard limit on the number of items in a list, as SharePoint will support up to 5 million items per list. One way around this would be to create and maintain different views that filter by an indexed field to show you different items, up to 2,000 at a time. Hope that helps.
Dina Ayoub
Program Manager
Windows SharePoint Services

SharePoint is probably quite a good fit for the UI side of things, though you'll need to think carefully about which parts are stored and modified in SharePoint lists and which parts are stored elsewhere. That's not so much a SharePoint issue as something you always have to deal with when you have multiple data sources.
I'd probably use a SharePoint list as the primary store for jobs, to avoid any sync issues and make editing easier. The volume of data shouldn't be an issue - just make sure you aren't trying to display 2000 items at once - it's the view, not the list itself that runs into performance issues on large numbers of items.

Tough question Dane... I would like to know a little more about your design / vision before giving an opinion.
Based on what I read in your question I would not use SharePoint 2007 as a development platform for this application.
1) Development experience in SharePoint 2007 can be painful and unproductive at times.
Hard to debug
Steep learning curve
2) Easy to get in trouble with performance
Data Layer is complex and can require expert SQL / SharePoint Admin skills to make platform scale.
Content databases should not exceed 100 GB.
3) Deployment can be extremely difficult depending on what you are doing.
4) New version will be released in the next 12 months.
Just my .02.

network drive file sharing

For the better part of 10 years + we have relied on various network mapped drives to allow file sharing. One drive letter for sharing files between teams, a seperate file share for the entire organization, a third for personal use etc. I would like to move away from this and am trying to decide if an ECM/Sharepoint type solution, or home grown app, is worth the cost and the way to go? Or if we should simply remain relying on login scripts/mapped drives for file sharing due to its relative simplicity? Does anyone have any exeperience within their own organization or thoughts on this?
Thanks.

SharePoint is very good at document sharing.
Documents generally follow a process for approval, have permissions, live in clusters... and these things lend themselves well to SharePoints document libraries.
However there are somethings that don't lend themselves well to living inside SharePoint... do you have a virtual hard drive (.vhd) file that you want to share with a workmate? Not such a good idea to try and put a 20GB file into SharePoint.
SharePoint can handle large files, and so can SQL Server behind it... but do you want your SQL Server bandwidth being saturated by such large files? Do you want your backup of SQL Server to hold copies of such large files multiple times?
I believe that there are a few Microsoft partners who offer the ability to disassociate file blobs from the SharePoint database, so that SharePoint can hold the metadata and a file system holds the actual files, and SharePoint simply becomes the gateway to manage access, permissions, and offer a centralised interface to files throughout an organisation. This would offer you the best of both worlds.
Right now though, I consider SharePoint ideal for documents, and I keep large files (that are not document centric) on Windows file shares.

Definetely, use a tool.
The main benefit here is version control. Being able to jump easily to a previous version, diff'ing and seeing who modified what (see most VCS' blame/annotate tool- it prints out a text file showing when/who modified each line in the text file).
Second, you can probably benefit from issue tracking/task tracking.
Other benefits include web access from the internet, having a wiki (which can be great in some situations), etc.
I use Subversion + Redmine at work, and I find it highly useful- test a few solutions and you will surely find out further advantages for you.

One thing that can be overlooked in the change to an document management tool is the planning required around how much is going to be stored and information architecture issues like where different content is going to end up.
SharePoint particularly is easy to setup without a good plan going forward and is particularly vulnerable to difficulties later on when things get to busy.
I would not recommend a home grown app for something like this. The problem has been solved by off the shelf tools and growing one from scratch is going to cost a huge amount and not get you any way near the features for the money.
Did I mention how important planning your security groups and document areas (IA) was?

If you need just document storage then sharepoint can do very well. WSS is ewen free and it provides very good document storage capabilities.
But you have to plan carefully as updating existing applications is painfull. If you decide to go with Sharepoint then I can give you few advices from top of my head
Pay attention to security configuration (user groups, privilegies,..)
Plan your document libraries well as it is not easy to just move documents betveen them
Also consider limiting number of versions that one document can have, because sharepoint stores full backups betveen verions, not just changes
Don't use infopath:) we have very bad experience with it (just don't tell this to managers)
If you don't really need to change graphical look of Sharepoint than don't bother with it as it brings many problems (I'm talking about custom masterpages and custom site templates)
Try to use as much OOB stuff as possible, because developing your own webparts not only cost more, but it can be quite complicated.
Make sure to turn-on search indexing. This is quite tricky, because it is by default turned off and then you will be as surprised that search is not working as I was :)
If you try to just deploy it and load 10.000 documents into it then you will surely have problems with it later. If you give a little thought about structure then you will end up with really good document storage.

Migrating is very probably worth the cost in the long term. You will gain reliability, versioning, traceability, and extensibility.
Be sure to first identify the groups/rights, and to identify which links need to be fixed (maybe you have applications that use links to the shares).
An open source alternative to SharePoint is Alfresco, it is very good for CIFS (Windows shares) too.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string