google index - will google index my logs? - google-index

I have some txt log files where i print out some important activities for my site.
These files ARE NOT referenced from any link within my site, so it's only me i know the url
(they contain current date in the filname so i have one for each day).
Question: will google index these kind of files?
I think google indexes only the pages whom urls are on the site.
Can you confirm my assumption? I just do not want others to find the link from google etc:)

In theory they shouldn't. If they aren't linked from anywhere they shouldn't be able to find them. However I'm not sure if stuff can make its way into the index by virtue of having the google toolbar installed. Definitely I've had some unexpected stuff turn up in search engines. The only safe way would be to password protect the folder.

Google can not index pages that it doesn't know they exist, so it won't index these, unless someone posts the url's to google, or place them on some website.
If you want to be sure, just disallow indexing for the files (in /robots.txt).

Best practice is to use the robots.txt to prevent the google crawler from indexing files you don't want to show up.
This description from Google Webmaster Tools is very helpful and leads you through the process of creating such a file:
https://support.google.com/webmasters/answer/6062608
edit: As it was pointed out in the comments there is no guarantee that the robots.txt is used so password-protecting the folders is also a good idea.

Related

Can sitemap.xml be misused to copy entire website?

I am planning to upload sitemap.xml on my website having generated content pages. As of now, if I try to copy the entire website using tools like HTTrack etc., it cannot be copied.
Now if I want search bots to find and index content pages on this website, I will have to include all urls in the sitemap.xml file.
So the question is - will such a sitemap.xml expose all urls thereby "facilitating" full copy of the website ?
Inputs on this will be highly appreciated.
Technically, yes.
But I suppose the question you really need to ask is 'Do I care'
If the answer is yes, you should really consider if you should be publishing it to the web in the first place?
A well constructed IA would contain links between each pages anyway (for navigational and SEO reasons), so tools like HTTrack would be able to copy the site anyway.
Anything you don't want to be seen by HTTrack needs also to be invisible to the ordinary web user - ie either password protected, or non-existent.

Software for building a sitemap

If I had to create a content inventory for a website that doesn't have a sitemap, and I do not have access to modify the website, but the site is very large. How can I build a sitemap out of that website without having to browse it entirely ?
I tried with Visio's sitemap builder, but it fails great time.
Let's say for example: I Want to create a sitemap of Stackoverflow.
Do you guys know a software to build it ?
You would have to browse it entirely to search every page for unique links within the site and then put them in an index.
Also for each unique link you find within the site you then need to visit that page and search for more unique links.
You would use a tool such as HtmlAgilityPack to easily grab urls and extract links from them.
I have written an article which touches on the extracting links part of the problem:
http://runtingsproper.blogspot.com/2009/11/easily-extracting-links-from-snippet-of.html
I would register all your pages in a Database, and then just output them all on a page (php - sql). Maybe even indexing software could help you! First of all, just make sure all your pages are linked up and submit it to google still!
Just googled and found this one.
http://www.xml-sitemaps.com/
Looks pretty interesting!
There is a pretty big collection of XML Sitemaps generators (assuming that's what you want to generate -- not a HTML sitemap page or something else?) at http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators
In general, for any larger site, the best solution is really to grab the information directly from the source, for example from the database that powers the site. By doing that you can get the most accurate and up-to-date Sitemap file. If you have to crawl the site to get the URLs for a Sitemap file, it will take quite some time for a larger site and it will load the server during that time (it's like someone visiting all pages in your site). Crawling the site from time to time to determine if there are crawlability issues (such as endless calendars, content hidden through forms, etc) is a good idea, but if you can, it's generally better to get the URLs for the Sitemap file directly.

Deny Access to directories from unauthorized users

I am not being paid for this and I would like to know the quickest way to do the following. A former client has a page which only members can access. This page links to a number of galleries which he only wants members to access. The galleries are not protected by any kind of authentication.
What I assume is the quickest way to do this is to create a .htaccess file which only allow people to view the site when the come from a certain referrer. Would this work?
My current thinking is that I could use a php script to deploy a .htaccess file into each of the gallery directories. (There are around a 100 at the moment.)
I found this link which might be what I am after but to be honest I really don't get it. Is my thinking sound? Could someone either link me to a tutorial that does this or show me how it is done.
http://perishablepress.com/press/2006/01/10/stupid-htaccess-tricks/#sec7
As always, a massive thank you for any help.
Jason

Changing page location after google analytics setup

The current website structure is setup such that all the ASPX pages are in the main folder. It's becoming increasingly difficult to maintain, so I would like to create new folders and move the relevant pages. This would change the URL from say:
http://mydomain.com/DoStuff.aspx
to
http://mydomain.com/DoingFolder/DoStuff.aspx
I fear that this will skew up the google analytics results. Is it recommended I do this change? If so, is there a way to link the page locations of after and before the change?
Also, what would happen when I implement the URL rewrite? Would I run into the same issue again? Anyone?
So in general I think it is a good idea to add the folder for both your users to visually see the section they are in via the URL and to help the search engines figure out the areas and who knows you may even get a (small) SEO benefit out of it.
What I would advise is to setup a second profile in Analytics and then add a filter which removes the folder name from the request and will leave you with the same flat structure in your reports as you have currently. (NB Do this under a new profile with the same tracking code to avoid major mess-ups that you can't undo).
Cheers
Z

Sharepoint search of external RSS feeds

I want my sharepoint site to allow a user to search content in a known collection of RSS feeds. I figure conceptually a few ways to do this
crawl the feeds at their source (Yikes!)
Pull the full articles into my sharepoint site, then let my crawler crawl it
Make use of an existing index (like google)
search the full articles, on demand, using something like a google utility (my preference)
So can I somehow, from my sharepoint site, allow a user to search the full articles from a couple dozen, named, rss feeds
thanks
Cary
I don't see why there is a problem with crawling the feeds at their source? That would seem to be reasonable.
It is fairly easy to create a content source to point at the feed and select the correct indexing schedule. If that does not work then you can try a more complicated approach.
Be aware that copying the content of another website to host on your own could have copyright implications (not too mention the risk that any inflammatory content would appear to be published on your own site).
--update--
Try reading the target sites robots.txt to see if (it even has one) it has a desired frequency. Otherwise it depends on the depth of the site you would be crawling.
If you are crawling just the rss feed xml, I suspect you could do that every hour without annoying anyone. Otherwise if you reach into each article, you may want to limit that. It really depends a lot on any relationship you have with the target site and type of site you are hitting.
Checkout this article for a little more info on how SharePoint deals with robots.txt
(p.s. the target site did not put the articles on the web so no one would read them)
The out of the box crawler will respect robots.txt and there are provisions for crawler impact rules that will lessen the chance that SharePoint will perform a beat down on the external site.

Resources