Finding all pages on domain with NodeJS - node.js

I'm trying to find all the pages on a domain with Node.
I was searching on Stackoverflow, but all i found is this thread for Ruby: Find all the web pages in a domain and its subdomains - I have the same question, but for Node.
I've also googled the question, but all I find are scrapers that do not find the links to scrape themselves. I was also searching for something like "sitemap generator", "webpage robot", "automatic scraper", "getting all pages on domain with Node" but it didn't bring any result.
I have a scraper that needs an array of links it will be processing and for example I have a page www.example.com/products/ where I want to find all existing sub-pages, e.g. www.example.com/products/product1.html, www.example.com/products/product2.html etc.
Could you give me a hint how can I implement it in Node?

Have a look at Crawler (https://www.npmjs.org/package/crawler). You can use it to crawl the website and save the links.
Crawler is a web spider written with Nodejs. It gives you the full
power of jQuery on the server to parse a big number of pages as they
are downloaded, asynchronously. Scraping should be simple and fun!

Related

What pages I need to stop crawling with Robots.txt

I developed a website and I make it online, but when I accessing to the logs files I get
Message : 'No route matched.' Stack Trace :
Core\Router->dispatch('robots.txt')
after a small search in google I found that the robot.txt file is more important for search engine (Google, Bing ... etc)
and it help to stop crawling some pages, but my questions is what I need to block with it ( which pages ) and how to stop some specific routes.
for example I have routes of administration always start with /ad-dash
Example : /ad-dash/administration/index
and when I make the search I see that some tutorial block the about, privacy, terms pages
My question is : I want to know which pages I need to stop crawling with Robots.txt ?

Google Crawl Error > apple-app-site-association > No App, URL or attempt to link > Why?

I have:
a simple static website;
hosted on a shared server;
with SSL;
which I have recently redesigned.
Google tells me there were two url crawl errors for my website:
apple-app-site-association;
.well-known/apple-app-site-association
For reference, here is the error report for the first (the second is the same):
Not found
URL:
https://mywebsite.com/apple-app-site-association
Error details
Last crawled: 5/5/16
First detected: 5/5/16
Googlebot couldn't crawl this URL because it points to a non-existent page. Generally, 404s don't harm your site's performance in search, but you can use them to help improve the user experience. Learn more
From looking around here, these appear to be related to associating an apple app with related website.
I have never tried to implement any sort of "apple app / site association" - at least not intentionally.
I can't for the life of me figure out where these links are coming from.
I will be removing these urls but am concerned the error may arise again.
I have looked at several related questions here, but they seem to be for errors from people trying to make that verification - which I haven't - or from people querying why their server logs show requests to these urls.
Can anyone shed any light on why this is happening?
So the answer is that Googlebot is now searching for these urls when they crawl your site, and as part of their effort to map associations between sites and their related apps. See: https://firebase.google.com/docs/app-indexing/ios/app
It seems that Googlebot hasn't (at this time) been told not to return a crawl error if the url /folder is not there.
Here is a link to an answer to a very similar (but slightly different) question that gives more detail if you are so inclined: https://stackoverflow.com/a/35812024/4806086

How to discover amount of pages of an external website

I have to make an offer for a new website. It should be based on the amount of pages there are in the existing site. There is no sitemap present.
Question: how can i get the total amount of pages inside an external website that not belong to me?
Have you tried to reach a potential sitemap.xml file (http:www.yourwebsite.com/sitemap.xml)?
You can test pages discovery with an online sitemap generator : https://www.xml-sitemaps.com/
You can also try a Google research like this : site:www.yourwebsite.com. You'll see all indexed pages.

site google tag does not show all results

If I go to this url
http://sppp.rajasthan.gov.in/robots.txt
I get
User-Agent: *
Disallow:
Allow: /
That means that crawlers are allowed to fully access the website and index everything, then why site:sppp.rajasthan.gov.in on google search shows me only a few pages, where it contains lots of documents including pdf files.
There could be a lot of reasons for that.
You don't need a robots.txt for blanket allowing crawling. Everything is allowed by default.
http://www.robotstxt.org/robotstxt.html doesn't allow blank Disallow lines:
Also, you may not have blank lines in a record, as they are used to delimit multiple records.
Check google webmasters tools to see if some pages have been dissallowed for crawling.
Submit a sitemap to google.
Use "Fetch as google" to see if google can even see the site properly.
Try manually submitting a link through the fetch as google interface.
Looking closer at it.
Google doesn't know how to navigate some of the links on the site. Specifically http://sppp.rajasthan.gov.in/bidlist.php the bottom navigation uses onclick javascript that gets dynamically loaded and it doesn't change the URL so google couldn't link to page 2 it even if it wanted to.
On the bidlist you can click into a bid list detailing the tender. These don't have public URLs. Google has no way of linking into them.
The PDFs I looked at were image scans in sanskrit put into PDF documents. While Google does OCR PDF documents (http://googlewebmastercentral.blogspot.sg/2011/09/pdfs-in-google-search-results.html) it's possibly they can't do it with sanskrit. You'd be more likely to fidn them if they contained proper text as opposed to images.
My original points remain though. Google should be able to find http://sppp.rajasthan.gov.in/sppp/upload/documents/5_GFAR.pdf which is on the http://sppp.rajasthan.gov.in/actrulesprocedures.php page. If you have a question about why a specific page might be missing, I'll try to answer it.
But basically the website does some bizarre non-standard things, this is exactly what you need a sitemap for. Contrary to popular belief sitemaps are not for SEO, it's for when google can't locate your pages.

If a page is not linked to the main website, can search engines find it?

I want to put a secret page in my website (www.mywebsite.com). The page URL is "www.mywebsite.com/mysecretpage".
If there is no clickable link to this secret page in the home page (www.mywebsite.com), can search engines still find it?
If you want to hide from a web crawler: http://www.robotstxt.org/robotstxt.html
A web crawler collects links, and looks them up. So if your not linking to the site, and no one else is, the site won't be found on any search engine.
But you can't be sure, that someone looking for your page won't find it. If you want secret data, you should use a script of some kind, to grant access to those, who shall get access.
Here is a more useful link : http://www.seomoz.org/blog/12-ways-to-keep-your-content-hidden-from-the-search-engines
No. A web spider crawls based on links from previous pages. If no page is linking it, search engine wouldn't be able to find it.

Resources