How can I discern which files in my public html folder are used by which domain? - web

I inherited two websites on the same host server and several thousand files/directories that are not being used by those websites. I want to remove those files that are not used. I have tried using developer tools on Chrome and looking at the network tab to see what is served to a browser when navigating to those websites but there some files that are never passed to the client. Does anyone know a way to do this efficiently?

Keep a list of all the files that you have.
Now write a crawler that will crawl over both the websites and each file that is crawled over, remove it from the initial list.
The list of files that you have now are the ones that are not used by any of those websites.

Related

Site wide HTTP Header for Tags

I have just inherited a massive old HTML site that I want to track on Google analytics. It's nearly a 1000 pages of good old '90s html.
I've been running a web server for many years but am not a coder in any particular language although I do edit my PHP config files and my HTML files, install and configure modules in Mediawiki, phpBB and Drupal. I am currently on Svr2016, IIS10. For this HTML site, how would I include the Google tag (or any other tracking tag) in the header on every page served from my IIS console?
I need a pretty cut and paste or point and click solution.
Assuming you’re using IIS and have SSI enabled (SSI= server side include).
I would create an include file (server side include - could be .shtml) and paste the google analytics or tracking.
Then, I would find a file such as footer that’s used by all the other files and include it.
Or maybe put it other common files like navigation that’s used site-wide.
See sample use/issues.
https://serverfault.com/questions/244352/why-wont-ssi-work-in-iis

Website directory structure advice needed

I am only used to developing a website that has like 4 to 6 pages (index, about, gallery...)
Currently working on a random project, basically building a large website. It will be using multiple subdomains and maybe up to 2 different CMSs
So before I start building on, I heard it is a good practice to have only one html file (index) per sub directory. Is it a good practice?
My current directory structure:
/main directory
/css
/img
/js
So if I were to create an about page I should add a new folder pages to the main directory and also for all other folders: css, img, js and have all relevant files there?
Example:
/pages
/about
Also if I start using a sub domain, should I create those (as shown above) folders for that specific sub domain?
There are other related question on here, however it does not fully answer my questions so I posting a new question.
There's no specific reason to keep each HTML file in its own directory. It just depends how you want the URLs to appear. You can just as easily link to
http://myapp.example.com/listing.html
as to
http://myapp.example.com/listing/
but the former will refer to a page explicitly, whereas the latter is an implicit call for index.html in the listing directory. Either should work, so it's up to you to determine what you want.
You aren't likely to see much efficiency differences in either solution until you are up in the thousands of pages.
For subdomains it is simplest to keep each domain separate, as there is no specific reason that a subdomain even runs on the same server (it's an entirely different web site). If both domains do run on the same server then you can play tricks with symbolic links to embed the same content into multiple subdomains, but this is already getting a bit too tricksy to scale well for simple static content.

Serving file:// files to users

Currently I'm building a local serach engine for network drives that is going to be used in our company.
The search engine is build on top of Solr and Tika. I've build an indexer that indexes Samba-shares over the network which works great and indexes all the directories that are given in a configuration file. However that is not really relevant.
The current problem we have is that the web interface that connects to Solr and delivers the search results will try to serve local file:// files that are links to the files with a absolute or Samba path. But serving file://'s are of course disallowed by browsers like Google Chrome. The error that Chrome gives is:
Not allowed to load local resource: file:///name/to/file.pdf
Which is obvious and logical, however I want to work around that issue and serve 'local' files to our users. Or at least open an Explorer window with the given path.
I was wondering if this is even possible or if there is a workaround available? The server that is going to serve these files is running on Apache or Tomcat (doesn't matter).
Alhtough opening file://'s seems pretty much impossible without the use of browser-specific plugins, I created a workaround by specifying a custom URI-handler combined with a Windows specific application that will open explorer.exe with the given directory.
This is by far not the ideal answer to my question, but I think it is a decent workaround for an intranet search application.
Streaming the file from your application to the browser is a much better idea from a usability and security perspective.
By assigning a MIME type to the stream, the user's browser can decide how best to open and display the file to the user.
By streaming from you application, control of the data can be maintained. The location of the file on you server is not revealed and proper authentication, authorization and auditing are easily achieved.
Assuming Java based upon your use of Solr and Tika:
http://www.java-forums.org/blogs/servlet/668-how-write-servlet-sends-file-user-download.html

How to use locations.kml with sitemap.xml

I would like to make sure website ranks as high as possible whenever my Google Places location ranks high.
I have seen references to creating a locations.kml file and putting it in the root directory of my site. Then creating lines in the sitemap.xml file to point to this .kml file.
I get this from this statement on the geolocations page
Google no longer supports the Geo extension to the Sitemap protocol. We recommmend that you tell Google about geographically-based URLs by including them in a regular Web Sitemap.
There is a link to the Web Sitemap page
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=183668
I'm looking for examples of how to include Geo location information in the sitemap.xml file.
Would someone please point me to an example so that I can know how to code the reference?
I think the point is that you dont use any specific formatting in the sitemap. You make sure you include all your locally relevent pages in the sitemap as normal. (ie you dont include any geo location in the sitemap)
GoogleBot will use its normal methods for detereriming if the page should be locally targeted.
(I think Google have found the sitemap-protocol has been abused, and or misunderstood, so they dont need it to tell them so much about the page. Rather its just a way to find pages, that it might take a long time to discover though conventual means. )

Remove incoming links from duplicate website

There is a duplicate development website that exists for legacy reasons and is pending a complete removal, it always had a rule in it's robots.txt file to deny all search engines, but at one point the robots.txt got deleted by accident, and for a point in time there were two cross-domain duplicates and Google indexed the entire duplicate website, and caused thousands of incoming links to the production website to show up in Google webmaster tools (Your site on the web > Links to your site).
The robots.txt got restored, and the entire development site is protected by a password, but the incoming links from the duplicate site remain in the production website webmaster tools, even though the development site robots.txt was downloaded by Google 19 hours ago.
I have spent hours reading about this, and see a lot of contradiction on the web, so would like to get an updated consensus from stackoverflow on how to perform a complete site removal and remove the links that point from the development site to the production site from Google.
Nobody will be able to tell you exactly how much time will it take for Google to remove the "bad" links from index, but it's likely going to take a few days not hours. Another thing to keep in mind is that only "good" crawlers will be actually honoring your robots.txt file, so if you don't want these links to show up elsewhere, just using disallow in your robots.txt file certainly won't be enough.

Resources