I have to display big amounts of data on Google Maps. Fussion tables could be a solution, but it has a limit of 500 rows/polylines per tile. I think that we can easily exceed this limit, so I need to find out another solution.
I think that creating a KML could be a good alternative. However, if I display the KML on Google Maps, then it will be easily accessible by thirds, and this is something that our client won't allow.
Is there any way to display a KML on a map and prevent others from download it?
Thank you very much.
One option would be FusionTables for Business and a private table
Related
First of all, apologies for what might seem like an 'amateur' scenario & question...
Situation
I have many, many documents (100,000) that I need to users to be able to search and browse via a web application we are building
This search functionality is just 1 of several other functions
I currently have around a dozen excel spreadsheets that contain the 'metadata' (title, date, author, source, country etc.) or document information
Each of the 100,000 'records' (or excel rows) has a unique identifier
The actual files (majority PDF but some Word & Excel) are stored in Dropbox using the corresponding unique identifier as the file name
Questions
Is Apache Solr the best tool to use in order to provide the search functionality?
What is the best design to facilitate this (e.g. Files in AWS S3 etc.)?
What is the best method to migrate from excel/Dropbox to the proposed Apache solr solution?
I very much appreciate any assistance as I have just been getting many different answers from paid consultants.
Regards
Mark
Your questions:
Q: Is Apache Solr the best tool to use in order to provide the search functionality?
A: In my opinion, Solr is an awesome option for things like this. However, as you've discovered, there's "some assembly required" (and that's putting it mildly)
Q: What is the best design to facilitate this (e.g. Files in AWS S3 etc.)?
A: If it were me, I'd use the filesystem. I think it's the easiest to debug.
Also, if it were me, I'd export the Excel sheets to CSV, I think it's a bit easier to work with then. However, Solr does include open source Tika filters, which do support Excel, but they won't treat your multiple-records as multiple-documents; Tika will make each Excel sheet into just one document each.
Q: What is the best method to migrate from excel/Dropbox to the proposed Apache solr solution?
A: I'm a fan of local filesystem. Dropbox does let you mirror your Dropbox files on a local directory. And as I said before, if you can get the Excel sheets exported to CSV, in some automated or "macro" way, I think it'd make your life easier too. For example, Python can read and write CSV files and is a great tool for messaging data into its final form.
If you don't mind commercial solutions, you might consider Lucidworks Fusion; it does include a bunch of connectors, including a filesystem datasource connector. Disclaimer: I happen to work for Lucid, but listing this suggestion last, before the "free" answers. And I'd mention this anyway, even if I didn't work there.
I have a lot of documents I want to store in a document library in SharePoint 2010. We're talking about 50k+ documents. I've worked with document libraries many times, but not of this size and I find myself getting confused about some definitions when it comes to how these should be stored and the number of elements allowed.
By looking here: http://technet.microsoft.com/en-us/library/cc262787%28v=office.14%29.aspx#ListLibrary it says that a document library can hold up to 30 million documents. Nice! 50k is not close to 30 millions. However, can I just dump all of the documents into a library without grouping them in views or sub folders? Cause a view only can have 5k elements and then I have to create views and put the documents in many views in order not to exceed this limit.
Now, the documents, and the library, will most likely never be browsed by going to the library. Each document will be linked from another place, and this will also not be that often. Therefore I am kind of hoping I can just dump all the documents in one big library. I have read that if the number of elements in a list exceeds 5k SharePoint will not query the query to return everything, but instead exchange this query with some default query. In my case this is fine, but are there other concerns about dumping this many files into one library in SharePoint 2010? And is there anything else I may not have thought about?
Also quick question at the end, I am planning on scripting the upload by using PowerShell, but I have heard from others that uploading documents this way to SharePoint could takea lot of time because it does it one document at the time. Is it possible to "bulk upload" documents through PowerShell or another approach?
The key here is to understand that SharePoint can STORE up to 30 million documents, but can only display 5,000 at a time. The easiest way to maintain that would be to dump the documents into separate folders with no more than 5,000 documents in each folder. Its easy to do that, but I'm not a big fan of folders since they impose a single organizational structure on a set of documents. Applying metadata and then filtering views is more efficient in the long run, but much harder to do when dumping documents into a library. I would suggest looking at some of the third party migration software that can do this kind of bulk upload and still maintain appropriate metadata. One I've used (there are others) is Metalogix Content Matrix.
I am facing the following challenge in an XPage: There are three databases with exactly the same views in it. The goal is to unite these three views from the three databases in one XPage and one view component!
AFAIK, one can usually provide just one view per view component. Currently, I have a Java back end where the documents are fetched. They are then processed to HTML markup and made more beautiful / functional by using jQuery data tables.
I see (at least) three disadvantages:
It is quite some code and if you want to display another view from the databases you quickly run into boiler plate code...
It is not too fast as it takes up to 30 sec. to fetch and display all records.
I can hardly image that my way is best practice.
Has anyone ever faced this challenge? I would like to reduce Java code, make it faster and use some standard component if possible.
Tim has good questions in his comment. With your current approach make sure you use ViewNavigator cache which is the fastest way to retrieve view entries:
Notes/Domino Release 8.52 or greater
View.setAutoUpdate must be False
ViewNavigator cache must be enabled
ViewNavigator.getNext() (or getPrev) must be used
http://www-10.lotus.com/ldd/ddwiki.nsf/dx/Fast_Retrieval_of_View_Data_Using_the_ViewNavigator_Cache
I have a general requirement in my current project to make an existing XPage application faster. One thing we looked at was how to speed up some slower type-ahead fields, and one solution to this which seems to be fast, is implementing it using FTSearch rather than the DBColumn we originally had. I want to get advice on whether this would be an OK approach, or if there are any suggestions to do what we need in a different way.
Background:
While there are a number of factors affecting the speed (like network latency, server OS, available server memory etc.), as we are using 8.5.3, we have optimized the application in general as far as we can, making use of the IBM Toolkit to find problem areas, and also using the features IBM added to help with this in 8.5.3 (e.g. Partial Execution, using the optimized JS and CSS option, etc.). Unfortunately we are stuck with the server running on a 32bit Windows OS with 3.5Gb Ram for another few months.
One of the slowest elements to respond are in certain type-aheads which reference a large number of documents. The worst one averages around 5 or 6 seconds before the suggested list appears for a type-ahead enabled field.
It uses SSJS to call a java class to perform a dbcolumn call (using Ferry Kranenburg's XPages Snippet) to get a unique list from a view, then back in SSJS it loops though the array to check if each entry contains the search key value, and if found it adds a highlight (bold) html tag around the search text in the word, then returns the formatted list back to the browser.
I added a print statement to output the elapsed time it takes to run the code, and on average today on our dev server it is around 3250 ms.
I tried a few things to see how we could make this process faster:
Added a Java class to do all processing (so not using SSJS). This only saved an average of 100ms.
Using a view-scoped Managed Bean, I loaded the unique Lookup list into memory when the page is loaded. This produces a really fast type-ahead response (16ms), but I suspect this is a very bad way to do this with a large data set - and could really impact the general server if multiple users were accessing the application. I tried to find information on what would be considered a large object, but couldn't find any guidance or recommendation on how much is too much to store in memory (I searched JSF and XPage sites). Does anyone have any suggestions on this?
Still in a Java class - instead of performing a dblookup to get the 'list' of all values to search through, I have the code run a FT Search to get the doc collection, then loop each doc to extract the field value I want and add those to a 'SortedSet' (which automatically doesn't allow duplicates), then loop the sorted set to insert the bold tags around the search term, and return that to the browser. This takes on average 100ms - which is great and barely noticeable. Are there an drawbacks to this approach - or reasons I should not do it this way?
Thanks for any feedback or advice on this.
Pam.
Update Aug, 14. 2013: I tried another approach (inspired by the IBM/Tony McGuckin Insights application on OpenNtf) as the Company Search type-ahead in that is using managed beans and is fast across a lot of data.
4 . Although the Insights application deals with data split across multiple databases, the principle for the type-ahead is similar. I couldn't use a view with getAllEntriesByKey though as I needed to search for a string within the text too, not just at the start of the entry. I tried creating a ViewEntryCollection based on a view FTSearch, but as we have a lot of duplicate names in the column, this didn't give the unique list I wanted. I then tried using a NotesViewNavigator on a categorized view, and looping through that. This produced the unique list I needed, but it turned out to be slower than any of the other methods above. (I did implement these ViewNavigator performance tips).
From my standpoint, performance may be affected by any of many layers every Domino application (not only XPages) consists of.
From top - browser (DOM, JS, CSS, HTML...), network (latencies, DNS, SSO...) to application layer (effective algorithms, caches), database/API (amount of data, indexes, reader names...) and OS/hardware (disks, memory...)
According to things you tested:
That is interresting, but could be expected: SSJS is cached and may use lower level API to get data (NAPI).
For your environment (32bit/3.5G RAM - I expect your statement about 3.5M is typo) I DO NOT recommend to cache big lists, especially if you apply it as a pattern to many fields/forms/applications. Cache in WeakHashMap could be more stable, though.
Use of FT search is perfectly fine, unless you need data that update frequently. FT index need some time and resources to update.
My suggestion is: go for FT, if it solves your problem. Definitely, troubleshoot FT performance in some heavy performance test on your server first.
(I cannot comment because of my low reputation)
I have recently been tackling with a similar problem. Here are some additional points to consider:
Are there many duplicate keywords in the view? Consider making a categorized view for #DbColumn.
FTSearching a view is often slower than a database, I believe. See Andre Guirard's article. Consider using db.FTSearch() and refining your FT query to include view's selection #Formula, if possible.
The FT index can be updated programmatically with db.updateFTIndex(). If keywords are added rarely, but need to be instantly available, you can perform index update in keyword document's QuerySave event (or similar). We used this approach when the keywords were stored in different (much smaller) database and the update was very fast.
The memory consumption can be checked this way:
Install XPages Toolbox from OpenNTF.
Open your application.
Create a JVM memory dump (Session dumps - Generate Heap Dump).
Install Eclipse Memory Analyzer Tool
Install IBM Diagnostic Tool Framework into Memory Analyzer.
Load your memory dump into MAT. You will see every Java object and their sizes.
In the end, I believe that there is no single general answer to your question. You need to test different approaches to find the fastest solution in your environment.
One problem with FT search is this error:
The full text index for this database is in use
Based on my experience this will occur for a while (maybe a few seconds) when the indexer task starts to index the database. If your users are not very demanding they can just try again and it will probably work.
But in many cases you want to minimize the errors the users get and will have to handle this error nicely. I've built my own FTSearch method which waits a bit and tries again until the error is not received. This will show as slowness to the user instead of error.
I am currently architecting a large SharePoint deployment.
This deployment has the potential to grow to petabytes in size over the course of several years.
One of the current issues we are discussing is the option of storing our data in SharePoint using InfoPath Forms. Some of these forms contain hundres of fields and require a lot of mapping to content types for persistence and search. Our search requirement is primarily a singular identifier and NOT the contents of the forms, although I am told I should preempt the "want" to search in the future.
We require our information to be utilised for secondary purposes (such as reporting etc). The information MUST be accessible instantly after persisting to the system.
My core questions therefore are:
What are the benefit/risks of this approach compared to storing
our data in a singular relational store using web-service
persistence?
If we decided on this approach what would be the
impact of changing the forms, content-types over time?
What happens when our farm grows beyond a single web-application / site collection how accessible will the information be?
Will I know where it is and how portable will the information be overtime?
1.)
Benefit:
Form templates can be created & deployed (relatively) easy
You can easily configure Field Validation
Probably no code involved
Risks:
Hitting SharePoint 2010 Limits (not so uncommon as you might
think)
Needs careful form design/planning (correct XML structure)
Information only accessible via SharePoint Object model or
WebService's (very slow)
2.) Well this is a tough one. Changing the form template and re-deploying is easy and only takes a few minutes. However changing the structure (underlying XML) of the template can get you in trouble very easily, because older (filled out) forms will be invalid - there is an option to "upgrade" older forms out-of-the-box, but in my experience it never worked as it supposed to.
Content Types behave very similar, say you want to delete a column from a content type because it's no longer needed - you'll have to remove all references to it, which means removing all items so you can delete the column.
3.) Well portability is definitly an issue with InfoPath, because it heavily relies on the corresponding URL structure. You absolutely can add more site collections, but this means you have to deploy your form template to each site collection. Information (filled out forms) can't easily be shared between site collection's because each form contains the SourceURL (where it came from) and the Namespace of the template (which changes constantly once you deploy).
Considering your requirements, i would strongly recommend a relational store instead of InfoPath - simply because it is not designed to be a data storage.
I would use a SQL database to store the data and a custom UI (WebPart or Application Page) to perform CRUD operations. This means that the information is not actually stored in SharePoint - just displayed (which also means that it can't be searched with the builtin SharePoint Search). There is also the possibilty to use the Business Connectivity Services (which basically does all of the above without you needing to create a custom UI - however very slow with large amount of data).
If you do need the information just in SharePoint, why not just make all this happen with Lists only?
This is going to be a long one and may not have an answer just because there's no silver bullet for what you're looking for. It's mostly insight and ultimately the choice is up to you.
the option of storing our data in SharePoint using InfoPath Forms
This statement throws me a little. SharePoint data is stored in SharePoint (well, SQL technically) but InfoPath is just a UI layer for accessing any part of that data.
Some of these forms contain 100s of fields and require alot of mapping to content types for persistence and search
From this I assume there are multiple forms which would mean different types of data being accessed (and probably different purposes). Hundreds of fields is no problem and it really boils down to managing the form and view design.
From the form side you should check out cxpartners form design crib sheet. This gives you a nice standard to follow to manage all those fields. Another thing would be to look at breaking the form up in tabs or views itself (in InfoPath) based on what the user needs to fill out. Basically it breaks down to not creating a form with 100s of fields on one massively scrolling screen the user will just freak out over.
Same with the views on the form or document library you're storing the form data in. InfoPath forms are just xml stored in a library (so regardless of how many fields you have, the footprint is pretty minimal). You don't want to map and surface every field in the form nor do you want to have a view with 100 columns on it. You should look at breaking down the views as they're fit for purpose, with only a few hundred items in each view with a few columns. It's a balancing act too as you don't want to create 100s of views either so you need to find out what's right. A good B.A. or Information Architect will help with this with the SharePoint/InfoPath guru and business user helping out.
We require our information to be utilised for secondary purposes (such as reporting etc). The information MUST be accessible instantly
This is another requirement that's going to be a little difficult to meet exactly. If the library has thousands of items (or 10s of thousands) and a view has dozens of fields then expect the view to come to crawl (especially if the user is insistent on "seeing everything" and wants the limits of each view to be set to 1000 items, like anyone could process that much information at once). Instant access is difficult if you're keeping everything online for a long time (like for reporting). There's the operational side where users are filling out forms, finding forms, editing them, etc. and for that you only want a few hundred items to be live at any given moment (up to a few thousand but you need to be careful on the views). If you have a list with 100,000 items in it and users are using this for daily activities and trying to run reports for trending or long term operations against it, you're going to lose the performance battle. Look at doing reporting offline, potentially shipping the data that's reportable to a second source like SQL and using SSRS against it. Performance Point is an option but adds a layer of complexity to the architecture. The question will ultimately fall to what reporting looks like and how important is it in relation to daily operations.
To try to answer your questions directly:
The benefits to using SharePoint over a database are that the data can be easily viewed and sliced and diced up. Creating a view is child's play and can quickly show you useful information like # of sales in a month or customer feedback grouped by call centre person. SharePoint makes it easy to view this information and even setup dashboards, hook in KPIs, etc. without having to get some developer to craft custom web pages. As far as risks go, you need to be careful with letting things grow organically and out of control. Don't let the users design views of data, they generally want something but not sure and will ask for all columns to be available which they just export to Excel to slice and dice. Make sure there's a good design around the views and lists and they're fit for purpose and meet what needs the user is trying to get out of the data. Ask the question of what they're looking for and why, that will help shape what to expose.
Any change needs to be thought out and planned and tested. It's no different in SharePoint if you add a column to a list as you would by adding a column to a SQL database. Form updates should be considered and while you won't get it 100% right the first time, you should try to get as much as possible without going overboard and putting in crazy things like 100 "blank" fields that are players to be named later. Strike a balance by understanding the needs of the users and company and where things are going. Hopefully someone will have a vision of what this thing might be when it grows up and that'll go a long way to understanding the impact of change.
Data is just xml and as long as you're not doing stupid stuff in the form like hard coding absolute paths to services (use data connection libraries) the impact of growth will be minimal. Growing beyond a web application into multiple ones is a pretty big change and not something to be taken lightly. Even splitting site collections out is big and there needs to be a really good reason for this. Site collections can handle thousands of sites and millions of documents without issue. Web applications are really there for dividing up areas of interest or separation of purpose (like team sites on one web app and a publishing portal on another) and not really meant for splitting data due to growth concerns.
Like I said, there's no silver bullet here and what you're asking for is an architecture for a solution that nobody here has all the requirements for. Hope this helps.