I cannot seem to find an answer anywhere as to whether or not I should disallow configuration files like /php.ini or hidden files like /.htaccess? And what about stuff like /includes or /includes/connect_to_database.php?
I have read on ProWebmasters and elsewhere on SO that we should not disallow: /*.js$ or /*.css$, but almost all of those answers are almost a decade old. Additionally, the purpose of robots.txt is to determine what is indexed, not what is crawled, is it not? I mean, we would not want crawlers trying to index our css and js files.
Even Google's own document regarding robots.txt do not seem to cover this stuff. Is anybody aware of informative resources somewhere on the web relating to this stuff?
Thanks
The files necessary to render the entire page in the web browsers and search spiders (ex. Google Bot) should be available in the robots.txt file. Allow files css, js, images jpg, jpeg, png, fonts.
Files loaded by the PHP (include() and require(), ex. connect_to_database.php) and configuration (php.ini) should be inaccessible to the public and blocked in the .htaccess file.
Related
I inherited two websites on the same host server and several thousand files/directories that are not being used by those websites. I want to remove those files that are not used. I have tried using developer tools on Chrome and looking at the network tab to see what is served to a browser when navigating to those websites but there some files that are never passed to the client. Does anyone know a way to do this efficiently?
Keep a list of all the files that you have.
Now write a crawler that will crawl over both the websites and each file that is crawled over, remove it from the initial list.
The list of files that you have now are the ones that are not used by any of those websites.
I'm using a program called ShareX which will upload screenshots I take to my web dir via ftp example: https://website.com/screenshots/
I need a way to block search engines and everyone else from browsing the screenshots dir and showing up in google images etc but have direct links work fine when I upload a screenshot to share with someone. (https://website.com/screenshots/screenshot01.jpg)
I don't upload anything super sensitive but would like the piece of mind that its off limits to everyone who doesn't know the direct path to an actual image.
Thanks for any help with this.
Disable directory indexes (assuming you're running Apache)
# .htaccess file in your screenshots/ directory
Options -Indexes
Use a robots.txt. Every reputable search engine will obey it.
Use a CAPTCHA (a little extreme in my opinion).
I am only used to developing a website that has like 4 to 6 pages (index, about, gallery...)
Currently working on a random project, basically building a large website. It will be using multiple subdomains and maybe up to 2 different CMSs
So before I start building on, I heard it is a good practice to have only one html file (index) per sub directory. Is it a good practice?
My current directory structure:
/main directory
/css
/img
/js
So if I were to create an about page I should add a new folder pages to the main directory and also for all other folders: css, img, js and have all relevant files there?
Example:
/pages
/about
Also if I start using a sub domain, should I create those (as shown above) folders for that specific sub domain?
There are other related question on here, however it does not fully answer my questions so I posting a new question.
There's no specific reason to keep each HTML file in its own directory. It just depends how you want the URLs to appear. You can just as easily link to
http://myapp.example.com/listing.html
as to
http://myapp.example.com/listing/
but the former will refer to a page explicitly, whereas the latter is an implicit call for index.html in the listing directory. Either should work, so it's up to you to determine what you want.
You aren't likely to see much efficiency differences in either solution until you are up in the thousands of pages.
For subdomains it is simplest to keep each domain separate, as there is no specific reason that a subdomain even runs on the same server (it's an entirely different web site). If both domains do run on the same server then you can play tricks with symbolic links to embed the same content into multiple subdomains, but this is already getting a bit too tricksy to scale well for simple static content.
I would like to make sure website ranks as high as possible whenever my Google Places location ranks high.
I have seen references to creating a locations.kml file and putting it in the root directory of my site. Then creating lines in the sitemap.xml file to point to this .kml file.
I get this from this statement on the geolocations page
Google no longer supports the Geo extension to the Sitemap protocol. We recommmend that you tell Google about geographically-based URLs by including them in a regular Web Sitemap.
There is a link to the Web Sitemap page
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=183668
I'm looking for examples of how to include Geo location information in the sitemap.xml file.
Would someone please point me to an example so that I can know how to code the reference?
I think the point is that you dont use any specific formatting in the sitemap. You make sure you include all your locally relevent pages in the sitemap as normal. (ie you dont include any geo location in the sitemap)
GoogleBot will use its normal methods for detereriming if the page should be locally targeted.
(I think Google have found the sitemap-protocol has been abused, and or misunderstood, so they dont need it to tell them so much about the page. Rather its just a way to find pages, that it might take a long time to discover though conventual means. )
Is it possible to get the htaccess to only allow certain files to be uploaded to a directory?
I have found several posts discussing how to get htaccess to only allow images out, disable php, output scripts in plain text form... but I cannot find if it is or is not possible to get htaccess to prevent files of a certain type from even entering the directory in the first place..
Could anyone possibly help me out on this query?
No, this is not possible. Apache is a web server, it can generally not control what gets written by a PHP or other script into a directory.
You will need to manage this in the script that you use to upload files.