I have a lot of documents I want to store in a document library in SharePoint 2010. We're talking about 50k+ documents. I've worked with document libraries many times, but not of this size and I find myself getting confused about some definitions when it comes to how these should be stored and the number of elements allowed.
By looking here: http://technet.microsoft.com/en-us/library/cc262787%28v=office.14%29.aspx#ListLibrary it says that a document library can hold up to 30 million documents. Nice! 50k is not close to 30 millions. However, can I just dump all of the documents into a library without grouping them in views or sub folders? Cause a view only can have 5k elements and then I have to create views and put the documents in many views in order not to exceed this limit.
Now, the documents, and the library, will most likely never be browsed by going to the library. Each document will be linked from another place, and this will also not be that often. Therefore I am kind of hoping I can just dump all the documents in one big library. I have read that if the number of elements in a list exceeds 5k SharePoint will not query the query to return everything, but instead exchange this query with some default query. In my case this is fine, but are there other concerns about dumping this many files into one library in SharePoint 2010? And is there anything else I may not have thought about?
Also quick question at the end, I am planning on scripting the upload by using PowerShell, but I have heard from others that uploading documents this way to SharePoint could takea lot of time because it does it one document at the time. Is it possible to "bulk upload" documents through PowerShell or another approach?
The key here is to understand that SharePoint can STORE up to 30 million documents, but can only display 5,000 at a time. The easiest way to maintain that would be to dump the documents into separate folders with no more than 5,000 documents in each folder. Its easy to do that, but I'm not a big fan of folders since they impose a single organizational structure on a set of documents. Applying metadata and then filtering views is more efficient in the long run, but much harder to do when dumping documents into a library. I would suggest looking at some of the third party migration software that can do this kind of bulk upload and still maintain appropriate metadata. One I've used (there are others) is Metalogix Content Matrix.
Related
I need to export data from multiple document libraries in sharepoint 2007. The data is stored in many thousands libraries across a few sites. All libraries have identical columns definitions and usually contain only a small number of documents each.
I am using sharepoint services.
First calling GetListCollection() to get all lists from the site, and the call GetListItems() for every single document library.
The process of querying each library separately is very slow and time consuming.
Is there a way to get all data from multiple document lists in a single call?
A SPSiteDataQuery may be of use to you. Have a google around it and see if this would help you. The lists would have to be of the same type though
CAML would certainly increase performance.
Cheers
Truez
I am currently architecting a large SharePoint deployment.
This deployment has the potential to grow to petabytes in size over the course of several years.
One of the current issues we are discussing is the option of storing our data in SharePoint using InfoPath Forms. Some of these forms contain hundres of fields and require a lot of mapping to content types for persistence and search. Our search requirement is primarily a singular identifier and NOT the contents of the forms, although I am told I should preempt the "want" to search in the future.
We require our information to be utilised for secondary purposes (such as reporting etc). The information MUST be accessible instantly after persisting to the system.
My core questions therefore are:
What are the benefit/risks of this approach compared to storing
our data in a singular relational store using web-service
persistence?
If we decided on this approach what would be the
impact of changing the forms, content-types over time?
What happens when our farm grows beyond a single web-application / site collection how accessible will the information be?
Will I know where it is and how portable will the information be overtime?
1.)
Benefit:
Form templates can be created & deployed (relatively) easy
You can easily configure Field Validation
Probably no code involved
Risks:
Hitting SharePoint 2010 Limits (not so uncommon as you might
think)
Needs careful form design/planning (correct XML structure)
Information only accessible via SharePoint Object model or
WebService's (very slow)
2.) Well this is a tough one. Changing the form template and re-deploying is easy and only takes a few minutes. However changing the structure (underlying XML) of the template can get you in trouble very easily, because older (filled out) forms will be invalid - there is an option to "upgrade" older forms out-of-the-box, but in my experience it never worked as it supposed to.
Content Types behave very similar, say you want to delete a column from a content type because it's no longer needed - you'll have to remove all references to it, which means removing all items so you can delete the column.
3.) Well portability is definitly an issue with InfoPath, because it heavily relies on the corresponding URL structure. You absolutely can add more site collections, but this means you have to deploy your form template to each site collection. Information (filled out forms) can't easily be shared between site collection's because each form contains the SourceURL (where it came from) and the Namespace of the template (which changes constantly once you deploy).
Considering your requirements, i would strongly recommend a relational store instead of InfoPath - simply because it is not designed to be a data storage.
I would use a SQL database to store the data and a custom UI (WebPart or Application Page) to perform CRUD operations. This means that the information is not actually stored in SharePoint - just displayed (which also means that it can't be searched with the builtin SharePoint Search). There is also the possibilty to use the Business Connectivity Services (which basically does all of the above without you needing to create a custom UI - however very slow with large amount of data).
If you do need the information just in SharePoint, why not just make all this happen with Lists only?
This is going to be a long one and may not have an answer just because there's no silver bullet for what you're looking for. It's mostly insight and ultimately the choice is up to you.
the option of storing our data in SharePoint using InfoPath Forms
This statement throws me a little. SharePoint data is stored in SharePoint (well, SQL technically) but InfoPath is just a UI layer for accessing any part of that data.
Some of these forms contain 100s of fields and require alot of mapping to content types for persistence and search
From this I assume there are multiple forms which would mean different types of data being accessed (and probably different purposes). Hundreds of fields is no problem and it really boils down to managing the form and view design.
From the form side you should check out cxpartners form design crib sheet. This gives you a nice standard to follow to manage all those fields. Another thing would be to look at breaking the form up in tabs or views itself (in InfoPath) based on what the user needs to fill out. Basically it breaks down to not creating a form with 100s of fields on one massively scrolling screen the user will just freak out over.
Same with the views on the form or document library you're storing the form data in. InfoPath forms are just xml stored in a library (so regardless of how many fields you have, the footprint is pretty minimal). You don't want to map and surface every field in the form nor do you want to have a view with 100 columns on it. You should look at breaking down the views as they're fit for purpose, with only a few hundred items in each view with a few columns. It's a balancing act too as you don't want to create 100s of views either so you need to find out what's right. A good B.A. or Information Architect will help with this with the SharePoint/InfoPath guru and business user helping out.
We require our information to be utilised for secondary purposes (such as reporting etc). The information MUST be accessible instantly
This is another requirement that's going to be a little difficult to meet exactly. If the library has thousands of items (or 10s of thousands) and a view has dozens of fields then expect the view to come to crawl (especially if the user is insistent on "seeing everything" and wants the limits of each view to be set to 1000 items, like anyone could process that much information at once). Instant access is difficult if you're keeping everything online for a long time (like for reporting). There's the operational side where users are filling out forms, finding forms, editing them, etc. and for that you only want a few hundred items to be live at any given moment (up to a few thousand but you need to be careful on the views). If you have a list with 100,000 items in it and users are using this for daily activities and trying to run reports for trending or long term operations against it, you're going to lose the performance battle. Look at doing reporting offline, potentially shipping the data that's reportable to a second source like SQL and using SSRS against it. Performance Point is an option but adds a layer of complexity to the architecture. The question will ultimately fall to what reporting looks like and how important is it in relation to daily operations.
To try to answer your questions directly:
The benefits to using SharePoint over a database are that the data can be easily viewed and sliced and diced up. Creating a view is child's play and can quickly show you useful information like # of sales in a month or customer feedback grouped by call centre person. SharePoint makes it easy to view this information and even setup dashboards, hook in KPIs, etc. without having to get some developer to craft custom web pages. As far as risks go, you need to be careful with letting things grow organically and out of control. Don't let the users design views of data, they generally want something but not sure and will ask for all columns to be available which they just export to Excel to slice and dice. Make sure there's a good design around the views and lists and they're fit for purpose and meet what needs the user is trying to get out of the data. Ask the question of what they're looking for and why, that will help shape what to expose.
Any change needs to be thought out and planned and tested. It's no different in SharePoint if you add a column to a list as you would by adding a column to a SQL database. Form updates should be considered and while you won't get it 100% right the first time, you should try to get as much as possible without going overboard and putting in crazy things like 100 "blank" fields that are players to be named later. Strike a balance by understanding the needs of the users and company and where things are going. Hopefully someone will have a vision of what this thing might be when it grows up and that'll go a long way to understanding the impact of change.
Data is just xml and as long as you're not doing stupid stuff in the form like hard coding absolute paths to services (use data connection libraries) the impact of growth will be minimal. Growing beyond a web application into multiple ones is a pretty big change and not something to be taken lightly. Even splitting site collections out is big and there needs to be a really good reason for this. Site collections can handle thousands of sites and millions of documents without issue. Web applications are really there for dividing up areas of interest or separation of purpose (like team sites on one web app and a publishing portal on another) and not really meant for splitting data due to growth concerns.
Like I said, there's no silver bullet here and what you're asking for is an architecture for a solution that nobody here has all the requirements for. Hope this helps.
I am having question, we have to handle huge volume of data, like one billion records that should load into a SharePoint list, after loading into the SharePoint list users can edit and update records.
Suggest me best solutions to edit /update one billion records in SharePoint list
Thanks
Ramesh Reddy
1,000,000,000 records? Good luck with that! You may want to revisit the decision to use SharePoint lists as the underlying data store and look at a database instead with some very well constructed indexes.
In case our a masochist though :-
Working with Large Lists in Office SharePoint Server 2007
If its 2010, BCS could be used, and its available in all versions of SharePoint, completely agree with all the others though, you need to rethink your approach, sounds like you need to leverage a proper DB. If you do need to surface the data in SharePoint BCS and External Content Types in 2010 are a god-send.
A SharePoint list is the wrong storage choice here. A quick quote from the Working with Large Lists in Office SharePoint Server 2007 (bottom of the first paragraph under "Test results and findings"):
The maximum number of items supported
in a list with recursive folders is 5
million items.
Perhaps this list is more logically several lists dumped into one?
If you need to expose data through SharePoint, use a regular DB and maybe read about SharePoint Business Data Catalog?
I'm writing a solution around MOSS 2007. And storing fairly large quantities of data in a list.
My first question is: Can lists handle large quantities of data - around 200 000 items. Now I've already read up about it, and it seems like the limitations of lists are on the number of items the views can display (2000). So question is: Is this a recommendation or a real limitation? No documentation actually confirms this.
second question if its a physical limitation in how many items the view can display, Does this mean that its impossible to check for duplicates in a sharepoint list that contains vast quantities of data?
In the sense that to perform a wsList.getListItems you have to pass a view (if the list contains 100 000 records, and the view can only contain 2000 records) how is it possible to check for duplicates?
Thanks
Huge list performance
You may want to read "Scaling to Extremely Large Lists and Performant Access Methods" and "Best Practices for LARGE SharePoint Lists and Documents Libraries".
Another thing this article does not mention that adding list items with SPList.Items.Add, because on large list it's a huge performance penality. What you do is build efficient query that returns no items and then add item to that collection (somwhere i was reading that webservices perform good on adding item, however i can't find that article no more).
You can also see some tests (or other tests) on how huge lists perform.
As for duplicates
You may want to create Scheduled job (SPJobDefinition) that runs somwhere at night and checks for duplicates.
Better idea than looping all SPListItem's and then Query list for each item to check for duplicates would probably be to get a DataTable (SPListItemCollection.GetDataTable()) for all items and use some technique to determine duplicates.
As for views
Filter items, order to see relevant ones and define your RowLimit. That's the key for views - you just need most relevant items, don't you?
You can have very large lists, but the performance is going to SUCK.
We had lists with 50,000+ items in a project and we found the best way we could query and process the contents was using SPSiteDataQuery and CrossListQueryCache and formatting the queries in the obscure, annoying SharePoint CAML dialect.
If possible breaking up the items into containers like folders would help with performance. If one of the list item fields is some type of classification lookup, then that could be replaced by putting items in folders of that classification type.
I have only been working with sharepoint for three months but right from the start I was told that the SharePoint content db was off limits as MS could change the schema at any time. The recommended route is to use the object model, and in most case I kind of understands that.
Now I need to join some lists in order to present the content grouped by some specific fields. Rather then iterating through each and every list I would prefer to link our own db which resides on the same DB server, to the WSS content DB and just create a view on the tables. This view should be on our DB in order to make such that we don't change ANYTHING on the WSS content DB.
Am I on the route to eternal damnation or not?
Yes, you are. Microsoft is very clear that any modifications to the SharePoint tables renders you unsupportable.
Direct modification of the SharePoint database or its data is not recommended because it puts the environment in an unsupported state.
Now, creating a link on your own DB which queries the SharePoint DB is shaky ground. Personally I'd do one of two things:
If this is a mission-critical application, run it past MSFT support.
If it is anything else, just make sure that your view is not locking the DB during querying.
A better strategy might be to iterate the lists and sync it to your own table so you can do whatever kind of data-mining you'd like - if you don't mind whatever lag time your sync routine would need.
SharePoint pretty much relies on total "ownership" of the underlying database.
Small things like another process reading from the SharePoint database could potentially slow down SharePoint's operations in unexpected ways.
As SharePoint does not usually update in a "real time" manner, it should be good enough to create a process that queries the sharepoint lists and adds the data to a table in your own application.
Schedule the crawl for a low activity period and voila a solution that is not going to risk unexpected slow downs to SharePoint.
Start your search on querying SharePoint at SPQuery.
Check out SLAM, SharePoint List Assocation Manager. It allows you to easily push your SharePoint data to SQL, including complex joins (one to one, one to many, many to many). And it keeps the data synchonized in real time.
http://slam.codeplex.com
Well, if the joins you need to do are pretty simple, defining a linked data source in SharePoint Designer may work for you