How to discover amount of pages of an external website

How to discover amount of pages of an external website - web

I have to make an offer for a new website. It should be based on the amount of pages there are in the existing site. There is no sitemap present.
Question: how can i get the total amount of pages inside an external website that not belong to me?

Have you tried to reach a potential sitemap.xml file (http:www.yourwebsite.com/sitemap.xml)?
You can test pages discovery with an online sitemap generator : https://www.xml-sitemaps.com/
You can also try a Google research like this : site:www.yourwebsite.com. You'll see all indexed pages.

Related

Ensure that Nutch has crawled all pages of a particular domain

I am using Nutch to collect all the data from a single domain. How can I ensure that Nutch has crawled every page under a given domain?

This is not technically possible. Since there is no limit on the number of different pages that you can have under the same domain. This is especially true for dynamic generated websites. What you could do is look for a sitemap.xml and ensure that all of those URLs are crawled/indexed by Nutch. Since the sitemap is the one indicating that those are the URLs you could use them as guide for what needs to be crawled.
Nutch has a sitemap processor that will inject all the URLs from the sitemap to the current crawldb (i.e it will "schedule" those URLs to be crawled).
As a hint, even Google enforces a maximum number of URLs to be indexed from the same domain when doing a deep crawl. This is usually referred to as a crawl budget.

site google tag does not show all results

If I go to this url
http://sppp.rajasthan.gov.in/robots.txt
I get
User-Agent: *
Disallow:
Allow: /
That means that crawlers are allowed to fully access the website and index everything, then why site:sppp.rajasthan.gov.in on google search shows me only a few pages, where it contains lots of documents including pdf files.

There could be a lot of reasons for that.
You don't need a robots.txt for blanket allowing crawling. Everything is allowed by default.
http://www.robotstxt.org/robotstxt.html doesn't allow blank Disallow lines:
Also, you may not have blank lines in a record, as they are used to delimit multiple records.
Check google webmasters tools to see if some pages have been dissallowed for crawling.
Submit a sitemap to google.
Use "Fetch as google" to see if google can even see the site properly.
Try manually submitting a link through the fetch as google interface.
Looking closer at it.
Google doesn't know how to navigate some of the links on the site. Specifically http://sppp.rajasthan.gov.in/bidlist.php the bottom navigation uses onclick javascript that gets dynamically loaded and it doesn't change the URL so google couldn't link to page 2 it even if it wanted to.
On the bidlist you can click into a bid list detailing the tender. These don't have public URLs. Google has no way of linking into them.
The PDFs I looked at were image scans in sanskrit put into PDF documents. While Google does OCR PDF documents (http://googlewebmastercentral.blogspot.sg/2011/09/pdfs-in-google-search-results.html) it's possibly they can't do it with sanskrit. You'd be more likely to fidn them if they contained proper text as opposed to images.
My original points remain though. Google should be able to find http://sppp.rajasthan.gov.in/sppp/upload/documents/5_GFAR.pdf which is on the http://sppp.rajasthan.gov.in/actrulesprocedures.php page. If you have a question about why a specific page might be missing, I'll try to answer it.
But basically the website does some bizarre non-standard things, this is exactly what you need a sitemap for. Contrary to popular belief sitemaps are not for SEO, it's for when google can't locate your pages.

Finding all pages on domain with NodeJS

I'm trying to find all the pages on a domain with Node.
I was searching on Stackoverflow, but all i found is this thread for Ruby: Find all the web pages in a domain and its subdomains - I have the same question, but for Node.
I've also googled the question, but all I find are scrapers that do not find the links to scrape themselves. I was also searching for something like "sitemap generator", "webpage robot", "automatic scraper", "getting all pages on domain with Node" but it didn't bring any result.
I have a scraper that needs an array of links it will be processing and for example I have a page www.example.com/products/ where I want to find all existing sub-pages, e.g. www.example.com/products/product1.html, www.example.com/products/product2.html etc.
Could you give me a hint how can I implement it in Node?

Have a look at Crawler (https://www.npmjs.org/package/crawler). You can use it to crawl the website and save the links.
Crawler is a web spider written with Nodejs. It gives you the full
power of jQuery on the server to parse a big number of pages as they
are downloaded, asynchronously. Scraping should be simple and fun!

SEO-Setting website to search dynamic data

I want to set my website . It has many user profile which is kind of dynamic.
e.g. http://test.com?profile=2,http://test.com?profile=3.
Whats steps I need to make so that its show all profiles on search engine dynamically.
1) I have an Google webmaster tool
2) Added a sitemap and robot.txt for the site.
After 1 months or so(Indexing is done , as I can on Webmaster tool account)
If I search the profile(say by name) I don't see the user profile in search.
I have added the url parameters as well e.g. here profile.
Am i Missing anything?

Can you get to a profile from the home page by basic links alone?
Search engines like to be able to find your pages on their own.
Do a more specific search first. e.g. add site:test.com to your search so only your site is competing.
Check you have not blocked the pages in the robots.txt file or via the robots meta tag on the page.

Updating an existing website

I've been asked by a family friend to completely overhaul the website for their business. I've designed my own website, so I know some of the basics of web design and development.
To work on their website from my own home, I know I'll need to FTP into their server, and therefore I'll need their FTP credentials, as well as their CMS credentials. I'm meeting with them in a couple of days and I don't want to look like a moron! Is there anything else I need to ask them for during our first meeting (aside from what they want in their new site, etc.) before I start digging into it?
Thanks!

From an SEO point of view, you should be concerned with 301 redirects as (i suppose) some or all URL adressess will change (take a different name, be removed and etc)
So, after you`ve created a new version of the site - and before you put it online - you should go ahead and list all "old site" URLs and decide, preferably for each one, it's new status (unchanged or redirected and if so - to what URL).
Mind that even is the some content will not re-appear on the new site, you still have to redirect the URL (say to HomePage) to keep link juice and SERP rankings.
Also, for a larger sites, (especially dynamic sites) try looking for URL patterns for bulk redirects. For example, if you see that google indexes 1,000 index.php?search=[some-key-word] pages, you don`t need to redirect each one individually as these are probably just search result pages that can be grouped with REGEX to be redirected to main search result page.
To index "old site" URLs you should:
a. site:domainname.com in Google (then set the SERP to 100 results and scaped manually of with Xpath)
b. Xenu or other site crawler (some like screamingfrog) to get a list of all URLs.
c. combine the lists in excel and remove all duplicates.
If you need help with 301 redirects you can start with this link:
http://www.webconfs.com/how-to-redirect-a-webpage.php/

If the website is static, knowing html, css and javascript along with FTP credentials is enough for you to get started. However if the site is dynamic interactive and database driven, you may need to ask if they want to use a php, In that case you might end up building this site in wordpress.

If you are going to design the website from scratch then also keep this point in mind.. Your friend might have hosted this website at somewhere (i.e. hosting provider). You should get its hosting control panel details as well which will help to manage the website (including database, email, FTP, etc.).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string