SharePoint InfoPath best practices for persisting large forms - sharepoint

I am currently architecting a large SharePoint deployment.
This deployment has the potential to grow to petabytes in size over the course of several years.
One of the current issues we are discussing is the option of storing our data in SharePoint using InfoPath Forms. Some of these forms contain hundres of fields and require a lot of mapping to content types for persistence and search. Our search requirement is primarily a singular identifier and NOT the contents of the forms, although I am told I should preempt the "want" to search in the future.
We require our information to be utilised for secondary purposes (such as reporting etc). The information MUST be accessible instantly after persisting to the system.
My core questions therefore are:
What are the benefit/risks of this approach compared to storing
our data in a singular relational store using web-service
persistence?
If we decided on this approach what would be the
impact of changing the forms, content-types over time?
What happens when our farm grows beyond a single web-application / site collection how accessible will the information be?
Will I know where it is and how portable will the information be overtime?

1.)
Benefit:
Form templates can be created & deployed (relatively) easy
You can easily configure Field Validation
Probably no code involved
Risks:
Hitting SharePoint 2010 Limits (not so uncommon as you might
think)
Needs careful form design/planning (correct XML structure)
Information only accessible via SharePoint Object model or
WebService's (very slow)
2.) Well this is a tough one. Changing the form template and re-deploying is easy and only takes a few minutes. However changing the structure (underlying XML) of the template can get you in trouble very easily, because older (filled out) forms will be invalid - there is an option to "upgrade" older forms out-of-the-box, but in my experience it never worked as it supposed to.
Content Types behave very similar, say you want to delete a column from a content type because it's no longer needed - you'll have to remove all references to it, which means removing all items so you can delete the column.
3.) Well portability is definitly an issue with InfoPath, because it heavily relies on the corresponding URL structure. You absolutely can add more site collections, but this means you have to deploy your form template to each site collection. Information (filled out forms) can't easily be shared between site collection's because each form contains the SourceURL (where it came from) and the Namespace of the template (which changes constantly once you deploy).
Considering your requirements, i would strongly recommend a relational store instead of InfoPath - simply because it is not designed to be a data storage.
I would use a SQL database to store the data and a custom UI (WebPart or Application Page) to perform CRUD operations. This means that the information is not actually stored in SharePoint - just displayed (which also means that it can't be searched with the builtin SharePoint Search). There is also the possibilty to use the Business Connectivity Services (which basically does all of the above without you needing to create a custom UI - however very slow with large amount of data).
If you do need the information just in SharePoint, why not just make all this happen with Lists only?

This is going to be a long one and may not have an answer just because there's no silver bullet for what you're looking for. It's mostly insight and ultimately the choice is up to you.
the option of storing our data in SharePoint using InfoPath Forms
This statement throws me a little. SharePoint data is stored in SharePoint (well, SQL technically) but InfoPath is just a UI layer for accessing any part of that data.
Some of these forms contain 100s of fields and require alot of mapping to content types for persistence and search
From this I assume there are multiple forms which would mean different types of data being accessed (and probably different purposes). Hundreds of fields is no problem and it really boils down to managing the form and view design.
From the form side you should check out cxpartners form design crib sheet. This gives you a nice standard to follow to manage all those fields. Another thing would be to look at breaking the form up in tabs or views itself (in InfoPath) based on what the user needs to fill out. Basically it breaks down to not creating a form with 100s of fields on one massively scrolling screen the user will just freak out over.
Same with the views on the form or document library you're storing the form data in. InfoPath forms are just xml stored in a library (so regardless of how many fields you have, the footprint is pretty minimal). You don't want to map and surface every field in the form nor do you want to have a view with 100 columns on it. You should look at breaking down the views as they're fit for purpose, with only a few hundred items in each view with a few columns. It's a balancing act too as you don't want to create 100s of views either so you need to find out what's right. A good B.A. or Information Architect will help with this with the SharePoint/InfoPath guru and business user helping out.
We require our information to be utilised for secondary purposes (such as reporting etc). The information MUST be accessible instantly
This is another requirement that's going to be a little difficult to meet exactly. If the library has thousands of items (or 10s of thousands) and a view has dozens of fields then expect the view to come to crawl (especially if the user is insistent on "seeing everything" and wants the limits of each view to be set to 1000 items, like anyone could process that much information at once). Instant access is difficult if you're keeping everything online for a long time (like for reporting). There's the operational side where users are filling out forms, finding forms, editing them, etc. and for that you only want a few hundred items to be live at any given moment (up to a few thousand but you need to be careful on the views). If you have a list with 100,000 items in it and users are using this for daily activities and trying to run reports for trending or long term operations against it, you're going to lose the performance battle. Look at doing reporting offline, potentially shipping the data that's reportable to a second source like SQL and using SSRS against it. Performance Point is an option but adds a layer of complexity to the architecture. The question will ultimately fall to what reporting looks like and how important is it in relation to daily operations.
To try to answer your questions directly:
The benefits to using SharePoint over a database are that the data can be easily viewed and sliced and diced up. Creating a view is child's play and can quickly show you useful information like # of sales in a month or customer feedback grouped by call centre person. SharePoint makes it easy to view this information and even setup dashboards, hook in KPIs, etc. without having to get some developer to craft custom web pages. As far as risks go, you need to be careful with letting things grow organically and out of control. Don't let the users design views of data, they generally want something but not sure and will ask for all columns to be available which they just export to Excel to slice and dice. Make sure there's a good design around the views and lists and they're fit for purpose and meet what needs the user is trying to get out of the data. Ask the question of what they're looking for and why, that will help shape what to expose.
Any change needs to be thought out and planned and tested. It's no different in SharePoint if you add a column to a list as you would by adding a column to a SQL database. Form updates should be considered and while you won't get it 100% right the first time, you should try to get as much as possible without going overboard and putting in crazy things like 100 "blank" fields that are players to be named later. Strike a balance by understanding the needs of the users and company and where things are going. Hopefully someone will have a vision of what this thing might be when it grows up and that'll go a long way to understanding the impact of change.
Data is just xml and as long as you're not doing stupid stuff in the form like hard coding absolute paths to services (use data connection libraries) the impact of growth will be minimal. Growing beyond a web application into multiple ones is a pretty big change and not something to be taken lightly. Even splitting site collections out is big and there needs to be a really good reason for this. Site collections can handle thousands of sites and millions of documents without issue. Web applications are really there for dividing up areas of interest or separation of purpose (like team sites on one web app and a publishing portal on another) and not really meant for splitting data due to growth concerns.
Like I said, there's no silver bullet here and what you're asking for is an architecture for a solution that nobody here has all the requirements for. Hope this helps.

Related

Complex Finds in Domain Driven Design

I'm looking into converting part of an large existing VB6 system, into .net. I'm trying to use domain driven design, but I'm having a hard time getting my head around some things.
One thing that I'm completely stumped on is how I should handle complex find statements. For example, we currently have a screen that displays a list of saved documents, that the user can select and print off, email, edit or delete. I have a SavedDocument object that does the trick for all the actions, but it only has the properties relevant to it, and I need to display the client name that the document is for and their email address if they have one. I also need to show the policy reference that this document may have come from. The Client and Policy are linked to the SavedDocument but are their own aggregate roots, so are not loaded at the same time the SavedDocuments are.
The user is also allowed to specify several filters to reduce the list down. These to can be from properties that are stored on the SavedDocument or the Client and Policy.
I'm not sure how to handle this from a Domain driven design point of view.
Do I have a function on a repository that takes the filters and returns me a list of SavedDocuments, that I then have to turn into a different object or DTO, and fill with the additional client and policy information? That seem a little slow as I have to load all the details using multiple calls.
Do I have a function on a repository that takes the filters and returns me a list of SavedDocumentsForList objects that contain just the information I want? This seems the quickest but doesn't feel like I'm using DDD.
Do I load everything from their objects and do all the filtering and column selection in a service? This seems the slowest, but also appears to be very domain orientated.
I'm just really confused how to handle these situations, and I've not really seeing any other people asking questions about it, which masks me feel that I'm missing something.
Queries can be handled in a few ways in DDD. Sometimes you can use the domain entities themselves to serve queries. This approach can become cumbersome in scenarios such as yours when queries require projections of multiple aggregates. In this case, it is easier to use objects explicitly designed for the respective queries - effectively DTOs. These DTOs will be read-only and won't have any behavior. This can be referred to as the read-model pattern.

Can I query SAP BO WEBI via Excel VBA? Can I do it fast enough?

Following up on my previous post, I need to be able to query a database of 6M+ rows in the fastest way possible, so that this DB can be effectively used as a "remote" data source for a dynamic Excel report.
Like I said, normally I would store the data I need on a separate (perhaps hidden) worksheet and I would manipulate it through a second "control" sheet. This time, the size (i.e. number of rows) of my database prevents me from doing so (as you all know, excel cannot handle more than 1,4M rows).
The solution my IT guy put in place consists of holding the data on a txt file inside of a network folder. This far, I managed to query this file through ADO (slow but no mantainance needed) or to use it as a source to populate an indexed Access table, which I can then query (faster but requires more mantainance & additional software).
I feel both solutions, although viable, are sub-optimal. Plus it seems to me as all of this is but an unnecessary overcomplication. The txt file is actually an export from SAP BO, which the IT guy has access to through WEBI. Now, can't I just query the BO database through WEBI myself in a "dynamic" kind of way?
What I'm trying to say is, why can't I extract only bits of information at a time, on a need-to-know basis and directly from the primary source, instead of having all of the data transfered in bulk on a secondary/duplicate database?
Is this sort of "dynamic" queries even possible? Or will the "processing" times hinder the success of my approach? I need this whole thing to really feel istantaneuos, as if the data was already there and I'm not actually retrieving it all the times.
And most of all, can I do this through VBA? Unfortunately that's the only thing I will be having access to, I can't do this BO-side.
I'd like to thank you guys in advance for whatever help you can grant me!
Webi (short for Web Intelligence) is a front-end analytical reporting application from Business Objects. Your IT contact apparently has created (or has access to) such a Webi document, which retrieves data through a universe (an abstraction layer) from a database.
One way that you could use the data retrieved by Web Intelligence as a source and dynamically request bits instead of retrieving all information in one go, it to use a feature called BI Web Service. This will make data from Webi available as a web service, which you could then retrieve from within Excel. You can even make this dynamic by adding prompts which would put restrictions on the data retrieved.
Have a look at this page for a quick overview (or Google Web Intelligence BI Web Service for other tutorials).
Another approach could be to use the SDK, though as you're trying to manipulate Web Intelligence, your only language options are .NET or Java, as the Rebean SDK (used to talk to Webi) is not available for COM (i.e. VBA/VBScript/…).
Note: if you're using BusinessObjects BI 4.x, remember that the Rebean SDK is actually deprecated and replaced by a REST SDK. This could make it possible to approach Webi using VBA after all.
That being said, I'm not quite sure if this is the best approach, as you're actually introducing several intermediate layers:
Database (holding the data you want to retrieve)
Universe (semantic abstraction layer)
Web Intelligence
A way to get data out of Webi (manual export, web service, SDK, …)
Excel
Depending on your license and what you're trying to achieve, Xcelsius or Design Studio (BusinessObjects BI 4.x) could also be a viable alternative to the Excel front-end, thereby eliminating layers 3 to 4 (and replacing layer 5). The former's back-end is actually heavily based on Excel (although there's no VBA support). Design Studio allows scripting in JavaScript.

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

How to see before/after state of entities in SubSonic?

I have a few large forms that I need to provide visual cues about the before/after state, so the person approving the form can see what has been modified (not the previous answer, tho that would be a plus). This is currently being done with an extra column for each column of data (Name, Name_IsModified, Phone, Phone_IsModified, etc...). I'm curious if there was a better way to getting around this, leveraging SubSonic?
The initial load is done by grabbing data from 6 source tables on 3 different servers. This data is saved in the form tables, where it resides until it is approved by various people who will manually update it into the live systems that then update the 6 source tables. The visual cues are primarily used during the approval process, but are occassionaly used to research when a change has been made in the past.
Since I have to make a bunch of updates, I thought this might be a good time to break away from the legacy 2000+ lines of code, making my job a bit easier!!!
Thanks,
Zach
All of the properties on SubSonic objects are actually collections and you can pull this out and review changes - all without reflection.
We have a "DirtyColumns" collection (not sure if it's public or not) that we use to run updates - this would be the thing you'd want to have a look at.

Is creating a view on SharePoint tables bad style?

I have only been working with sharepoint for three months but right from the start I was told that the SharePoint content db was off limits as MS could change the schema at any time. The recommended route is to use the object model, and in most case I kind of understands that.
Now I need to join some lists in order to present the content grouped by some specific fields. Rather then iterating through each and every list I would prefer to link our own db which resides on the same DB server, to the WSS content DB and just create a view on the tables. This view should be on our DB in order to make such that we don't change ANYTHING on the WSS content DB.
Am I on the route to eternal damnation or not?
Yes, you are. Microsoft is very clear that any modifications to the SharePoint tables renders you unsupportable.
Direct modification of the SharePoint database or its data is not recommended because it puts the environment in an unsupported state.
Now, creating a link on your own DB which queries the SharePoint DB is shaky ground. Personally I'd do one of two things:
If this is a mission-critical application, run it past MSFT support.
If it is anything else, just make sure that your view is not locking the DB during querying.
A better strategy might be to iterate the lists and sync it to your own table so you can do whatever kind of data-mining you'd like - if you don't mind whatever lag time your sync routine would need.
SharePoint pretty much relies on total "ownership" of the underlying database.
Small things like another process reading from the SharePoint database could potentially slow down SharePoint's operations in unexpected ways.
As SharePoint does not usually update in a "real time" manner, it should be good enough to create a process that queries the sharepoint lists and adds the data to a table in your own application.
Schedule the crawl for a low activity period and voila a solution that is not going to risk unexpected slow downs to SharePoint.
Start your search on querying SharePoint at SPQuery.
Check out SLAM, SharePoint List Assocation Manager. It allows you to easily push your SharePoint data to SQL, including complex joins (one to one, one to many, many to many). And it keeps the data synchonized in real time.
http://slam.codeplex.com
Well, if the joins you need to do are pretty simple, defining a linked data source in SharePoint Designer may work for you

Resources