Webscraping Endpoint - web

Hello I was wondering how i can find a hidden Endpoint of a website that stores product urls.
I tried getting the sitemap of the website but the website either does not have one or it is hidden and I cant find it. Also searched for
I hope someone could help me or point me in the right direction

A helpful resource to request from the server you are crawling is /robots.txt. Most hosts will serve this file, and it tells your crawler where it is allowed to go.

Related

Google couldn't fetch my sitemap.xml file

I've got a small flask site for my old wow guild and I have been unsuccessful in getting google to read my sitemap.xml file. I was able to successful verify my site using googles Search Console and it seems to crawl it just fine but when I go to submit my sitemap, it lists the status as "Couldn't fetch". When I click on that for more info all it says is "Sitemap could not be read" (not helpful)
I originally used a sitemap generator website (forgot which one) to create the file and then added it to my route file like this:
#main.route('/sitemap.xml')
def static_from_root():
return send_from_directory(app.static_folder, request.path[1:])
If I navigated to www.mysite.us/sitemap.xml it would display the expected results but google was unable to fetch it.
I then changed things around and started using flask-sitemap to generate it like this:
#ext.register_generator
def index():
yield 'main.index', {}
This also works fine when I navigate directly to the file but google again does not like this.
I'm at a loss. There doesn't seem to but any way to get help from google on this and so far my interweb searches aren't turning up anything helpful.
For reference, here is the current sitemap link: www.renewedhope.us/sitemap.xml
I finally got it figured out. This seems to go against what google was advising but I submitted the sitemap as http://renewedhope.us/sitemap.xml and that finally worked.
From their documentation:
Use consistent, fully-qualified URLs. Google will crawl your URLs
exactly as listed. For instance, if your site is at
https://www.example.com/, don't specify a URL as https://example.com/
(missing www) or ./mypage.html (a relative URL).
I think that only applies to the sitemap document itself.
When submitting the sitemap to google, I tried...
http://www.renewedhope.us/sitemap.xml
https://www.renewedhope.us/sitemap.xml
https://renewedhope.us/sitemap.xml
The only format that they were able to fetch the sitemap from was:
http://renewedhope.us/sitemap.xml
Hope this information might help someone else facing the same issue :)
put this tag in your robots.txt file Sitemap: domainname.com/sitemap.xml. Hope this will be helpful.

How to get rid of an index for a website in cpanel

I am currently trying to host a website on cpanel. When I try to host it there is an site index on the front of the page. How can I get rid of this index and stop it from rerouting me towards it. I placed all my documents into the public_html. Thanks for any help
This is a link to my website so you can inspect the problem
www.brantleybrennansfriends.org
To be able to solve this problem you must name the main file you want to be viewed as index. This creates it the first page and will resolve your problem

Website A 'redirect' to subdomain of website B, with content of website A

There has been a question made towards me recently to do the following:
We have a website with Drupal running in IIS.
On that site is an URL Redirect to a website hosted externally, obviously with a name completely irrelevant to the name of our company.
The question now is the following;
They want to change to URL to a subdomain of our website. Example: from "www.external-site.com" to "www.sub.internal.com" (while still showing content of the external website)
They want the current page of that website to be reflected in the URL bar. So it wouldn't say "www.sub.internal.com", but it would say "www.sub.internal.com/solutions/page1.html" (instead of "www.external-site.com/solutions/page1.html")
It's possible that I forgot another 'condition' but have mentioned before this.
So, if someone visits through our URL Redirect to External-website, it needs to show our subdomain instead of their domain in the URL, AND it needs to show the current page when people start browsing while still using our subdomain in the URL.
Now, I checked the external-website, and it seems that most of the links available are relative links (if this would be any useful information).
Currently, the external website is hosted externally, and will remain to be so for next few years. (I believe we bought the company)
I have been asking around and looking up, and the best possible thing seems to use domain forwarding, but even then it still doesn't seem to comply with the entirety that they asked of me.
I am but a 'simple' .NET programmer, held responsible to do support for anything involving the websites, and I can't say I have extended knowledge about infrastructure. (But I can ask people to do this for me)
Is there anything that could solve this?
Thanks so much!
IIS's URL rewite and Application Request Routing (ARR) combo can help you what you want to achive. Here are few links which may guide you to configure ARR. Please note that these links dont exibit exact solution to your problem however you can take clue from it and fabricate your solution accordingly.
http://www.iis.net/learn/extensions/url-rewrite-module/reverse-proxy-with-url-rewrite-v2-and-application-request-routing
http://www.iis.net/learn/extensions/url-rewrite-module/reverse-proxy-rule-template
It sounds like you'll want to use a full-page iframe: do not redirect but show a page with an "inner page" instead: that inner page is the external web site. That way, users do not see the external site in their URL bar.
http://webdesign.about.com/od/iframes/a/aaiframe.htm
You need to configure the equivalent of Apache Virtual Host with Reverse Proxy on IIS.
See this answers:
https://serverfault.com/a/271030
and
https://stackoverflow.com/a/10003306/2131693

Webmaster Tools Crawler 403 errors

Google Webmaster Tools is reporting 403 errors for some folders on the websites server for example:
http://www.philaletheians.co.uk/Study%20notes/
The folder isnt forbidden so dont understand why it would be 403 errors for Googles Crawler?
How come the Google Crawler is trying to browser the actual folders and not just going straight to the files in that folder? Is this somthing to do with robots.txt ?
Make sure is there any actual place or document to be present if some one request that url. I've browsed through your site and could not found a link that directs to http://www.philaletheians.co.uk/Study%20notes/
Also it seems, all the study notes are inside this "Study%20notes" directory.So actual this link will not work anyway. So check the google web master tools's link from to find where this broken link situate and cure it.
Have you set default document correctly in your web server? In apache, this comes in the DirectoryIndex setting (and defaults to index.html). Also, in general it might be better to strip off spaces etc.. from your traversable directory names (the %20 you are seeing between Study and notes is a url-encoded space character), so as to keep your URLs clean to your visitors and search engine bots.

Custom URL in Codeigniter using htaccess

I am developing a site on Codeigniter 2.0.2 . Its a site where companies/users can signup and create their own profile page, have their own custom url(like http://facebook.com/demouser), have their own feedback system, display their services.
This said, I have been successful in display the profile page in the following format
http://mainwebsite.com/company/profile/samplecompany
This displays the home page for the company samplecompany , where company is the controller and profile is the method.
Now I have few questions,
I guess it is possible to create to have/get http://mainwebsite.com/samplecompany using htaccess and a default controller. If anybody can help with the htaccess rule , that would be awesome. I am already using htacess to remove index.php from CI but could not get this working.
There will be few other pages for the given user/company such as feedback, contact us, services etc. So the implementation links that come to my mind is of the form
`
http://mainwebsite.com/company/profile/samplecompany/feedback or
http://mainwebsite.com/samplecompany/feedback
http://mainwebsite.com/company/profile/samplecompany/services or
http://mainwebsite.com/samplecompany/services
http://mainwebsite.com/company/profile/samplecompany/contactus or
http://mainwebsite.com/samplecompany/contactus
wheresamplecompany` is the dynamic part
Is it possible to create site links in the format?
I understand using A record for a given domain, I can point a domain say, http://www.samplecompany.com to http://mainwebsite.com/company/profile/samplecompany so typing http://www.samplecompany.com he should be taken to http://mainwebsite.com/company/profile/samplecompany . If this is successfully implemented, will
http://www.samplecompany.com/feedback
http://www.samplecompany.com/services
http://www.samplecompany.com/contactus
work correctly?
I guess it is possible to create to have/get http://mainwebsite.com/samplecompany using htaccess and a default controller. If anybody can help with the htaccess rule , that would be awesome. I am already using htacess to remove index.php from CI but could not get this working.
There will be few other pages for the given user/company such as feedback, contact us, services etc. So the implementation links that come to my mind is of the form ` http://mainwebsite.com/company/profile/samplecompany/feedback or http://mainwebsite.com/samplecompany/feedback
You can accomplish this using routes. For example, in your /config/routes.php file, put this:
$route['samplecompany'] = "company/profile/samplecompany";
$route['samplecompany/(:any)'] = "company/profiles/samplecompany/$1";
The first rule tells CodeIgniter that when someone accesses http://mainwebsite.com/samplecompany that it should process it as if the URL were "company/profile/samplecompany". The second rule captures anything that comes in the URI string after "samplecompany" and appends it onto the end.
However, if you have multiple companies(not just samplecompany), you're probably going to want to extend CI's router to suppor this unless you want to manually edit the config file each time a new company is added.
OK, you're definitely going to want to handle dynamic company names(as per your comment). This is a little trickier. I can't give you the full code, but I can point you in the right direction.
You'll want to extend CI's router and on an incoming request query the DB for the list of company names. If the URI segment matches a company name, you'll want to serve it up using your company/profile method. If it does not, you will ignore it and let CI handle it normally. Check out this post on this topic for more information: forum link.
Here's a great guide on how to achieve what you need: Codeigniter Vanity URL's.

Resources