does Options All -Indexes in htaccess prevent search engines from indexing the files in a folder - .htaccess

At the risk of getting hit for a duplicate question, I typed my question in the title and read all of the questions/answers returned but I am still a bit confused.
I want to run Options All -Indexes in htaccess but I want to make sure that the search engines can index the images in the image gallery folders.
This question appeared to answer the question...
Keep Options All -Indexes but allow access to a specific folder
But then one of the answers stated that
But if they go to that directory they won't see file listing
Does this mean the files will no be able to be indexed because there is no index file, or does it mean the file can be indexed but the search engine simply will not see an index of the files?
Thanks in advance,
Pete

If you block access to the directory listing, a search engine crawler cannot see the image files to index them,
UNLESS they are linked to by SOME accessible file that is visible.

Related

I can't find htaccess in typo3

I am using Typo3 4.5., and am running into the problem which appears to be solved here: TYPO3 breaks urls without WWW ... (website redirects to index without WWW)
The answer recommended there involves editing the htaccess file.
My problem is I cannot find this file anywhere. I am not experienced with Typo3, how can I safely edit this document?
I have gone into my filelist and found an htaccess file at [fileadmin/]: temp/ but I cannot edit this document. Clicking on it gives me an error 403 in a new window. In german nonetheless.
Even the question is related to a very old TYPO3 version I want to answer here for the current version 9.5 where several .htaccess files are saved in the folder
typo3/sysext/install/Resources/Private/FolderStructureTemplateFiles/
These are the files in that folder:
fileadmin-temp-htaccess
fileadmin-temp-index.html
fileadmin-user_upload-temp-importexport-htaccess
root-htaccess
root-web-config
typo3temp-var-htaccess
All files with htaccess at the end (above bold text) should be used according to the filename, copied in the corresponding directory and each copy then renamed to .htaccess.
The other files should be used too according to name and content.
It has to be considered that .htaccess-files perhaps are not exchangeable always between TYPO3 versions; I never had problems with it during many updates but it should always be checked.
The .htaccess file is a hidden file and is located in the document root of your website. This file can't be edited from inside of TYPO3, you need direct access to the webserver (ssh, sftp, ...)
If you installed TYPO3 from the "dummy package", it should have a "_.htaccess" file in the root that must be renamed. As the previous answer by #M Klein told you, you must rename or edit it by direct access to your server.
Another possibility is that the file has been accidentally removed; in this case you could download the "dummy package" (select your TYPO3 version) and pick a new one from there.

robots.txt needs only certain files and folders and disallow everything

I want robots.txt to allow only index.php and images folder and disallow all other folders, is this possible?
This is my code:
User-agent: *
Allow: /index.php
Allow: /images
Disallow: /
Secondly, is it possible to do the same job with htaccess?
First, be aware that the "Allow" option is actually a non-standard extension and is not supported by all crawlers. See the wiki page (in the "Nonstandard extensions" section) and the robotstxt.org page.
This is currently a bit awkward, as
there is no "Allow" field. The easy way is to put all files to be
disallowed into a separate directory, say "stuff", and leave the one
file in the level above this directory:
Some major crawlers do support it, but frustratingly they handle it in different ways. For example. Google prioritises Allow statements by matching characters and path length, whereas Bing prefers you to just put the Allow statements first. The example you've given above will work in both cases, though.
Bear in mind those crawlers who do not support it will simply ignore it, and will therefore just see your "Disallow" rule, effectively stopping them from indexing your entire site! You have to decide if the extra work moving files around (or writing a long list of Disallow rules for all your subdirectories) is really worth the bonus of getting indexed by the lesser crawlers. Probably not.
Ref htaccess, you can't really do anything useful with it here. You'd have to match the user agent against a large list of known bots and you'd just end up missing some - or worse, blocking real users.
Yes, that code is correct. The robots.txt file is read from top to bottom so as long as the disallow is on the bottom you won't run into problems. This is because it matches the first rule, if the disallow was on the top then it wouldn't ever reach the allow statements.
Edit/Sidenote:
This is only for "good" (Googlebot, Bingbot etc..) robots which follow the standard. Plenty of other robots either misinterpret the robots.txt file or just completely ignore it.

How to remove number in any url address in specific folder .htaccess

I want to remove numbers on the end url in specified folder, using htaccess.
(Numbers and minus sign befor numbers). For all urls in this folder.
For example
http://www.example.com/music/new-track-released-52
or
http://www.example.com/music/helo-there-4
Need to look like
http://www.example.com/music/new-track-released
http://www.example.com/music/helo-there
For all links in folder music
(I'm already removed php extension with htaccess)
How to do that?
Probably something like this:
RewriteEngine on
RewriteRule ^/music/(.+)-[0-9]+$ /music/$1
Note that this is the version for the host configuration. For .htaccess style files this has to be slightly modified. Whenever possible you should prefer not to use .htaccess style files but the real host configuration instead. Those files are notoriously error prone, hard to debug and really slow the server down.

.htaccess doesn't work if comment row (# ...) exist

code below doesn't work my .htaccess file. I mean, after this code is applied, I can still index folders in html.
# BEGIN disable folder index
Options All -Indexes
# END disable folder index
however, code below works. I mean, after this code is applied, server gives 403 if I try to index a folder which I know that it exists.
Options All -Indexes
I'm on a shared hosting and have nothing to do with server config. .htaccess is created via notepad++ with encoding setting UTF-8 without BOM. .htaccess permission is set to 0644. there exist no other code in .htaccess.
what does this situation mean? what am I doing wrong?
Ok, looks like my original comment above pushed you into the right direction:
Most likely this is a problem with the line breaks. So that for the
interpreting part of the http server that "Options" line is not on a
separate line, thus also commented out. Check your line ending
characters by using a hexeditor. That s the only reliable tool to do
so.

How can I find files that aren't needed on my site so I can delete them?

I'm developing a website, and after testing different ways to do things, I know that I have many files on my site that are not being used, including HTML/PHP files, images, stylesheets, and external scripts. Is there some program I can use or something so I can find all of the files that I don't need so I can delete them?
I need to find all files that are safe to delete, don't have anything to do with the site anymore, and that deleting them won't have any effect on how my site works.
I've tried finding orphaned files in Dreamweaver, but it lists a lot of files that I do actually need.
Here's one idea: Crawl the site and create a list of every file you can find, then check anything that's not on that list. Wikipedia has a list of crawlers including some open source ones.
Xenu's linksleuth is the easiest way I've found.
http://home.snafu.de/tilman/xenulink.html
After you do the scan you have the option to put in your FTP info. If you do so, it will also generate a list of files that are not accessible (orphans).
How would you qualify unnecessary? That's something you need to be sure of before beginning this. I guess one way to garbage collect your site is to delete files not being referenced by any other files.
The idea with the crawler #Brendan to get all files that actually are used is very nice.
Then you can start deleting files from your website and after that use a program to find any broken links in your website like Xenu or LinkTiger or then one you prefer.
You can connect with some ftp application, and delete files manual. This is the safest way, because scripts and programs don't know what is needed and what not...
This did not exist at the time this question was asked, but there is a Python script called weborphans designed for this purpose.
Here's a blog entry by the author with some more info: Finding orphaned files on websites

Resources