Can sitemap.xml be misused to copy entire website? - web

I am planning to upload sitemap.xml on my website having generated content pages. As of now, if I try to copy the entire website using tools like HTTrack etc., it cannot be copied.
Now if I want search bots to find and index content pages on this website, I will have to include all urls in the sitemap.xml file.
So the question is - will such a sitemap.xml expose all urls thereby "facilitating" full copy of the website ?
Inputs on this will be highly appreciated.

Technically, yes.
But I suppose the question you really need to ask is 'Do I care'
If the answer is yes, you should really consider if you should be publishing it to the web in the first place?
A well constructed IA would contain links between each pages anyway (for navigational and SEO reasons), so tools like HTTrack would be able to copy the site anyway.
Anything you don't want to be seen by HTTrack needs also to be invisible to the ordinary web user - ie either password protected, or non-existent.

Related

How to hide website directory from search engines without Robots.txt?

We know we can stop search engines from indexing directories on our site using robots.txt.
But this of course has the disadvantage of actually publicising directories we don't want found to possible attackers.
Password protecting the directory using .htaccess or other means is obviously the best way to keep the directory private.
But what if, for reasons of convenience, we didn't want to add another layer of security to the directory and just wanted to add another level of obfuscation? To hide, for example, an admin login page.
Is there another way to "hide" the directory without broadcasting its location in a robots.txt file?
Here is what to do, please note as you haven't mentioned any particular technology I haven't included how to do it.
If you configure your web server to output the following meta tag in the directory listing HTML page, it will prevent your page from being indexed by compliant search engines.
<meta name="robots" content="noindex">
Adding this would probably require implementing a custom module within your web server that will override the default directory listing output page.
Try using a random string. Something like http://website.com/some-random-string-here/file.html
The remember not use some-random-string-here in your robots.txt file or on any links.

Can you copy a website?

Can you copy a Composite C1 website? I would like to create a copy of an existing website as a new website.
I start by creating Site A. Then I want to copy it and create Site B.
For example: copy the pages, functions, data, content, layouts, css from website A to website B. The only difference between the two would be the name.
It would infringe copywrites and may get you sued, but yes, its possible with a scraper, which basicly get all of the site, and download it to you, such things are used by google and search engines for a cache of sites.
Some exaples:
http://www.grepsr.com/?adwords2&gclid=CIe4rrPF57cCFURcpQodASIAgg
http://info.kapowsoftware.com/WebScrapingDefinitiveGuide.html?pi_ad_id=11920224743&gclid=CPCfxbTF57cCFWNNpgodnCQAKQ
http://scrapy.org/
or just google "web scrapers"
If you own the site however, and have access to the ftp, just simply copy the files to a folder called /b and it can become www.a.com/b or you can set up an addon domain to point to /b and make the addon domain.... say www.b.com
The answer to your question "can you copy a website?"
Is Yes....you can.
Provided you have access to all the files/folders, its no different then copying a bunch of folders on your computer, to another folder.
So if you're using a shared host....and everything is in your public_html folder.
Just put the whole website in one folder, then copy it over to another folder.
And then just simply point your new domain to that folder, through your hosting platform.
The process to do this is different for different hosts, but the actual answer to your question is...
YES....YOU CAN COPY A WEBSITE FROM ONE FOLDER TO ANOTHER
IF you have access to the files on the server you can simply copy it to the other desired location...
But remember you have to update links and other paths (if they are absolute).
If you don't have the access you could maybe use the developer tools like firebug, or using F12 on chrome or IE and copy each file and source code you have by hand. This approach is a little more time consuming than the last one but at least it can be made.
Cheers
As far as I know the easiest way would be use use Internet Explorers save to offline webpage function (if it is still there) - this will copy all the resources of the currently open webpage and recode the HTML to use them, as for an entire website..I dont think it will be easy, for legal reasons.
If it's your own site, sure why not! Who is there to stop you?
But if it's someone elses site, of course you have to worry about copyright and most of the time the website uses server side scripts which are not downloabeable.
You can duplicate a Composite C1 website by copying the entire file structure to a new folder and then update the installation id in the folder ~/App_Data/Composite/Configuration/InstallationInformation.xml (put in a new random GUID). Then point a new IIS site into this new folder.
If your site is using SQL Server as a backend you also need to create a copy of your database, create a new user account with dbo access for this database and update the connection string in ~/web.config.
If you wish to duplicate an entire page structure inside the existing instance of the CMS and share media files, templates etc. this could be done, but no tooling is available. This would be a coding task.
Copy the the directory(website physical path) where the website is pointing to and paste it somewhere...create a new website and point it to that copied directory....

google index - will google index my logs?

I have some txt log files where i print out some important activities for my site.
These files ARE NOT referenced from any link within my site, so it's only me i know the url
(they contain current date in the filname so i have one for each day).
Question: will google index these kind of files?
I think google indexes only the pages whom urls are on the site.
Can you confirm my assumption? I just do not want others to find the link from google etc:)
In theory they shouldn't. If they aren't linked from anywhere they shouldn't be able to find them. However I'm not sure if stuff can make its way into the index by virtue of having the google toolbar installed. Definitely I've had some unexpected stuff turn up in search engines. The only safe way would be to password protect the folder.
Google can not index pages that it doesn't know they exist, so it won't index these, unless someone posts the url's to google, or place them on some website.
If you want to be sure, just disallow indexing for the files (in /robots.txt).
Best practice is to use the robots.txt to prevent the google crawler from indexing files you don't want to show up.
This description from Google Webmaster Tools is very helpful and leads you through the process of creating such a file:
https://support.google.com/webmasters/answer/6062608
edit: As it was pointed out in the comments there is no guarantee that the robots.txt is used so password-protecting the folders is also a good idea.

Software for building a sitemap

If I had to create a content inventory for a website that doesn't have a sitemap, and I do not have access to modify the website, but the site is very large. How can I build a sitemap out of that website without having to browse it entirely ?
I tried with Visio's sitemap builder, but it fails great time.
Let's say for example: I Want to create a sitemap of Stackoverflow.
Do you guys know a software to build it ?
You would have to browse it entirely to search every page for unique links within the site and then put them in an index.
Also for each unique link you find within the site you then need to visit that page and search for more unique links.
You would use a tool such as HtmlAgilityPack to easily grab urls and extract links from them.
I have written an article which touches on the extracting links part of the problem:
http://runtingsproper.blogspot.com/2009/11/easily-extracting-links-from-snippet-of.html
I would register all your pages in a Database, and then just output them all on a page (php - sql). Maybe even indexing software could help you! First of all, just make sure all your pages are linked up and submit it to google still!
Just googled and found this one.
http://www.xml-sitemaps.com/
Looks pretty interesting!
There is a pretty big collection of XML Sitemaps generators (assuming that's what you want to generate -- not a HTML sitemap page or something else?) at http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators
In general, for any larger site, the best solution is really to grab the information directly from the source, for example from the database that powers the site. By doing that you can get the most accurate and up-to-date Sitemap file. If you have to crawl the site to get the URLs for a Sitemap file, it will take quite some time for a larger site and it will load the server during that time (it's like someone visiting all pages in your site). Crawling the site from time to time to determine if there are crawlability issues (such as endless calendars, content hidden through forms, etc) is a good idea, but if you can, it's generally better to get the URLs for the Sitemap file directly.

Changing page location after google analytics setup

The current website structure is setup such that all the ASPX pages are in the main folder. It's becoming increasingly difficult to maintain, so I would like to create new folders and move the relevant pages. This would change the URL from say:
http://mydomain.com/DoStuff.aspx
to
http://mydomain.com/DoingFolder/DoStuff.aspx
I fear that this will skew up the google analytics results. Is it recommended I do this change? If so, is there a way to link the page locations of after and before the change?
Also, what would happen when I implement the URL rewrite? Would I run into the same issue again? Anyone?
So in general I think it is a good idea to add the folder for both your users to visually see the section they are in via the URL and to help the search engines figure out the areas and who knows you may even get a (small) SEO benefit out of it.
What I would advise is to setup a second profile in Analytics and then add a filter which removes the folder name from the request and will leave you with the same flat structure in your reports as you have currently. (NB Do this under a new profile with the same tracking code to avoid major mess-ups that you can't undo).
Cheers
Z

Resources