Sharepoint search of external RSS feeds - sharepoint

I want my sharepoint site to allow a user to search content in a known collection of RSS feeds. I figure conceptually a few ways to do this
crawl the feeds at their source (Yikes!)
Pull the full articles into my sharepoint site, then let my crawler crawl it
Make use of an existing index (like google)
search the full articles, on demand, using something like a google utility (my preference)
So can I somehow, from my sharepoint site, allow a user to search the full articles from a couple dozen, named, rss feeds
thanks
Cary

I don't see why there is a problem with crawling the feeds at their source? That would seem to be reasonable.
It is fairly easy to create a content source to point at the feed and select the correct indexing schedule. If that does not work then you can try a more complicated approach.
Be aware that copying the content of another website to host on your own could have copyright implications (not too mention the risk that any inflammatory content would appear to be published on your own site).
--update--
Try reading the target sites robots.txt to see if (it even has one) it has a desired frequency. Otherwise it depends on the depth of the site you would be crawling.
If you are crawling just the rss feed xml, I suspect you could do that every hour without annoying anyone. Otherwise if you reach into each article, you may want to limit that. It really depends a lot on any relationship you have with the target site and type of site you are hitting.
Checkout this article for a little more info on how SharePoint deals with robots.txt
(p.s. the target site did not put the articles on the web so no one would read them)

The out of the box crawler will respect robots.txt and there are provisions for crawler impact rules that will lessen the chance that SharePoint will perform a beat down on the external site.

Related

Any Plone product for counting file downloads and pages view?

I'm doing an intranet which will not be accessible from outside the company's network and they want to display in Plone some nice statistics about file downloads and pages most viewed.
With the network constrain I can not use google analytics or any sort of external service, so is there any product that allows to count file downloads and pages viewed?
I've seen an idea on uservoice regarding file downloads, and maybe I could extend plone.piwik.now to get page view statistics but I have a hard time thinking that Plone doesn't have any product that (maybe partially) suits this use case.
Any tip?
Essentially, you've two options, you can use one of the existing HTTP log analysis tools and scrape the information you need from those reports, or you can write a custom analytics tool in Plone.
We're currently working on a version which we plan to release as open source later this year. Essentially the patterns we're using is that we have a small javascript which passes parameters to our lightweight logging app. We're than able to show results from the reporting app like "top downloads" in portlets, even filtering by section and keyword.
I don't know about Plone add-ons (nor do I understand why you'd want to use a Plone add-on to do this) but Webalizer and/or http://awstats.sourceforge.net/ are two of the most popular choices.

Adding Item to SharePoint Search index manually

I am looking for a way to add a document to Search Index using API, as and when document gets added to document library.
I can add eventhandler and write a code to call API. I need to know if API supports such interface. Any sample will be really helpful.
Thanks.
I think that SharePoint (2007 and 2010) have passive indexing, meaning it is out of your control beyond scheduling the indexing service to run at a certain frequency. That being the case, there are occasions when your search cache will be out of sync, such as when you first delete an item. However, I believe you can programmatically prime the index service.
It is also possible to have SharePoint non-SharePoint content, such as a UNC path, via the Central Admin.
As other mentioned it isn't quite possible to do what you want. However you can decrease the latency between when you add content and when it gets indexed. The process looks like this:
Create a new Search Content Source that includes your data that needs to be rapidly searched
Add only sites that you care about rapid search to this content source
Schedule this content source's incremental crawl to happen really often. Consider programmatically watching the crawl status so that you could restart the crawl after it has completed.
Tune your search databases I/O and its indexes so that search crawling happens as fast as possible.

Software for building a sitemap

If I had to create a content inventory for a website that doesn't have a sitemap, and I do not have access to modify the website, but the site is very large. How can I build a sitemap out of that website without having to browse it entirely ?
I tried with Visio's sitemap builder, but it fails great time.
Let's say for example: I Want to create a sitemap of Stackoverflow.
Do you guys know a software to build it ?
You would have to browse it entirely to search every page for unique links within the site and then put them in an index.
Also for each unique link you find within the site you then need to visit that page and search for more unique links.
You would use a tool such as HtmlAgilityPack to easily grab urls and extract links from them.
I have written an article which touches on the extracting links part of the problem:
http://runtingsproper.blogspot.com/2009/11/easily-extracting-links-from-snippet-of.html
I would register all your pages in a Database, and then just output them all on a page (php - sql). Maybe even indexing software could help you! First of all, just make sure all your pages are linked up and submit it to google still!
Just googled and found this one.
http://www.xml-sitemaps.com/
Looks pretty interesting!
There is a pretty big collection of XML Sitemaps generators (assuming that's what you want to generate -- not a HTML sitemap page or something else?) at http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators
In general, for any larger site, the best solution is really to grab the information directly from the source, for example from the database that powers the site. By doing that you can get the most accurate and up-to-date Sitemap file. If you have to crawl the site to get the URLs for a Sitemap file, it will take quite some time for a larger site and it will load the server during that time (it's like someone visiting all pages in your site). Crawling the site from time to time to determine if there are crawlability issues (such as endless calendars, content hidden through forms, etc) is a good idea, but if you can, it's generally better to get the URLs for the Sitemap file directly.

SharePoint site space usage

I would like to find out how much space my group's SharePoint site uses (files + version history). However, I only have administrative access to my site, not the entire SharePoint instance, so I have to come up with my own solution. I'm interested in the total, but usage per individual file is also fine.
I've googled everything I could think of but couldn't find much that would help. SharePoint programming seems out of the question since I don't have access to the machine. SharePoint Web Services looked promising but none of the services provided seem to give me what I need. I also found a VBA library that lets me list the versions of a document: Office.DocumentLibraryVersion. However, this type does not include a "size" property - why not?
Anyway, I would be happy with either of the following solutions:
A library or API to be used from VBA, VB, or C# (or any other language, for that matter)
A SharePoint Web Service that provides file size/space usage information
A completely crazy script that uses http to iterate through all the folders/files/versions in the library and does insane pattern matching to figure out the size of each file, then adds them together and returns the grand total (SharePoint du)
I figured SO is the best forum for this question, but a non-programmatic solution is just as welcome. Basically, anything you can come up with would help. At this point, even "this is not possible" would be useful.
Thanks in advance.
There is a hidden page that does this... Cannot find it right now.
Check the 1033 directory and similar to /layouts/usage.aspx.
That page links to /layouts/storman.aspx. Unfortunately that page does not work if your site collection does not have a quota.
Go to Site Settings / Site Usage Report.
If what you are looking for is not there, I don't think you can do it with your level of access.
got to siote actions--> site administration-->site usage reports-->
you can get the site usgae report
if you want to get it in excel chart--> open your site in sharepoint designer-->site-->reports-->usage-->then you can get
usage summary
monthly summary
daily summary
daily page hits
etc

Changing page location after google analytics setup

The current website structure is setup such that all the ASPX pages are in the main folder. It's becoming increasingly difficult to maintain, so I would like to create new folders and move the relevant pages. This would change the URL from say:
http://mydomain.com/DoStuff.aspx
to
http://mydomain.com/DoingFolder/DoStuff.aspx
I fear that this will skew up the google analytics results. Is it recommended I do this change? If so, is there a way to link the page locations of after and before the change?
Also, what would happen when I implement the URL rewrite? Would I run into the same issue again? Anyone?
So in general I think it is a good idea to add the folder for both your users to visually see the section they are in via the URL and to help the search engines figure out the areas and who knows you may even get a (small) SEO benefit out of it.
What I would advise is to setup a second profile in Analytics and then add a filter which removes the folder name from the request and will leave you with the same flat structure in your reports as you have currently. (NB Do this under a new profile with the same tracking code to avoid major mess-ups that you can't undo).
Cheers
Z

Resources