I've got a small flask site for my old wow guild and I have been unsuccessful in getting google to read my sitemap.xml file. I was able to successful verify my site using googles Search Console and it seems to crawl it just fine but when I go to submit my sitemap, it lists the status as "Couldn't fetch". When I click on that for more info all it says is "Sitemap could not be read" (not helpful)
I originally used a sitemap generator website (forgot which one) to create the file and then added it to my route file like this:
#main.route('/sitemap.xml')
def static_from_root():
return send_from_directory(app.static_folder, request.path[1:])
If I navigated to www.mysite.us/sitemap.xml it would display the expected results but google was unable to fetch it.
I then changed things around and started using flask-sitemap to generate it like this:
#ext.register_generator
def index():
yield 'main.index', {}
This also works fine when I navigate directly to the file but google again does not like this.
I'm at a loss. There doesn't seem to but any way to get help from google on this and so far my interweb searches aren't turning up anything helpful.
For reference, here is the current sitemap link: www.renewedhope.us/sitemap.xml
I finally got it figured out. This seems to go against what google was advising but I submitted the sitemap as http://renewedhope.us/sitemap.xml and that finally worked.
From their documentation:
Use consistent, fully-qualified URLs. Google will crawl your URLs
exactly as listed. For instance, if your site is at
https://www.example.com/, don't specify a URL as https://example.com/
(missing www) or ./mypage.html (a relative URL).
I think that only applies to the sitemap document itself.
When submitting the sitemap to google, I tried...
http://www.renewedhope.us/sitemap.xml
https://www.renewedhope.us/sitemap.xml
https://renewedhope.us/sitemap.xml
The only format that they were able to fetch the sitemap from was:
http://renewedhope.us/sitemap.xml
Hope this information might help someone else facing the same issue :)
put this tag in your robots.txt file Sitemap: domainname.com/sitemap.xml. Hope this will be helpful.
Related
Hello I was wondering how i can find a hidden Endpoint of a website that stores product urls.
I tried getting the sitemap of the website but the website either does not have one or it is hidden and I cant find it. Also searched for
I hope someone could help me or point me in the right direction
A helpful resource to request from the server you are crawling is /robots.txt. Most hosts will serve this file, and it tells your crawler where it is allowed to go.
This page:
http://netfs.dev.itcs.co.uk/downloads
is hidden behind a login wall.
When I use:
site:http://netfs.dev.itcs.co.uk/downloads *.pdf
as my google search string, google seems to be able to return an extensive list of pdf files that live within that URL directory.
My first thought was that these files were being linked to from another site, yet searching for a specific pdf:
netfs.dev.itcs.co.uk/downloads/BTFP%20FAQs.pdf
only returns 1 result (from netfs.dev.itcs.co.uk). I'm totally perplexed - how is google seemingly able to circumvent the login wall?
I tried spoofing my user-agent to that of the google-bots and that also didn't work. Obviously they could figure out its a spoof, but given that they don't even have HTTPS on their login page, I find it hard to believe they're doing anything complicated....
Thanks in advance...
My question pertains specifically to the two pages below, but is also more generally relating to methods for using clean URLs without an .htaccess file.
http://www.decitectural.com/
and
http://www.decitectural.com/about/
The pages above are hosted on Amazon's S3, which does not allow for the use of htaccess files. As a result, I have found no easy way to create a clean url rewrite scheme that sends all requests to an index file which, in turn, interprets the URL using javascript and loads up the correct page (with AJAX, or, as is the case with decitectural, with simple div visibility toggling).
In order to circumvent this problem, I usually edit the amazon S3 bucket properties and set both the index page and the error page to the index.html file. In this case, the index.html file is served even when an invalid path (such as /about/) is requested. This has, for the most part, been a functioning solution... That is, until I realized that I was also getting a 404 with the index.html page which would stop Google from indexing it.
This has led me to seek out an alternative solution to this problem. Currently, as a temporary fix, I am actually creating the /about/ directory on the server with a duplicate of the index.html file in it. This works, but obviously is not a real solution to the problem.
I would appreciate any advice on how to set up a clean URL routing scheme on S3 or in any instance where an .htaccess file can't be used.
Here's a few solutions: Pretty URLs without mod_rewrite, without .htaccess
Also, I guess you can run a script to create the files dynamically from an array or database so it generates all your URLs:
/index.html
/about/index.html
/contact/index.html
...
And hook the script on every edit, in a cron or run manually. Not the best in terms of performance but hey, it should work.
I think you are going about it the wrong way. S3 gives you complete control of the page structure of your site. If you want your link to be "/about", just upload a file called "about", and you're done. (Set the headers so that the browser knows it's HTML.)
Yes, it will break if someone links to "/about/" or "/about.html". But pretty much any site will break if you mess with their links in odd ways. You will have to be vigilant when linking to your own site, because you won't have any rewrite rules to clean up for you. But you should have automation doing that.
I’m fairly new to the Magento platform but I have a decent amount of experience in web development on apache servers.
A few days ago I was asked to look into an issue that was first made aware of with failing filters.
I had a look at the google analytics data and it seems the SEO friendly URLs have all stopped displaying. The navigation URLs still use friendly words however on the page return the URL is redirected to a basic catalog URL.
http://www.camera-camera.com/cameras-and-accessories.html
instead now it goes to
https://www.camera-camera.com/index.php/catalog/category/view/id/9
I checked the admin config. The Web > SEO URL rewrites are set to YES
I toggled them to No saved and back to yes then saved. Tried clearing the catalog URL rewrite cache
Checked the htaccess file and it hasn’t been touched for months.
Emptied the core rewrite table and reindexed it.
So I’m outta ideas now, was hoping some of you more experienced users can have some input as to what else I can check.
I also found it strange that the URL is now ignoring postback parameters. If you look at their filters they are simply an a link to the same page with a post parameter. This gets striped and ignored now might be related?
A file restore was on the day it happened. Any files I should check it against?
Thanks for any help you can provide !
I just discovered that it was related to HTTPS. I didn't notice but seems the site keeps redirecting to HTTPS even though the filter links etc are pointing to HTTP, in the redirect the parameters are dropped. Now to figure out why its going into HTTPS
Google Webmaster Tools is reporting 403 errors for some folders on the websites server for example:
http://www.philaletheians.co.uk/Study%20notes/
The folder isnt forbidden so dont understand why it would be 403 errors for Googles Crawler?
How come the Google Crawler is trying to browser the actual folders and not just going straight to the files in that folder? Is this somthing to do with robots.txt ?
Make sure is there any actual place or document to be present if some one request that url. I've browsed through your site and could not found a link that directs to http://www.philaletheians.co.uk/Study%20notes/
Also it seems, all the study notes are inside this "Study%20notes" directory.So actual this link will not work anyway. So check the google web master tools's link from to find where this broken link situate and cure it.
Have you set default document correctly in your web server? In apache, this comes in the DirectoryIndex setting (and defaults to index.html). Also, in general it might be better to strip off spaces etc.. from your traversable directory names (the %20 you are seeing between Study and notes is a url-encoded space character), so as to keep your URLs clean to your visitors and search engine bots.