What is another way to crawl the web besides following links? - web

What is another way to crawl the web besides following hyperlinks?

Most major sites use Sitemaps. This gives your crawler a fast way of discovering URLs and can be used with or instead of following outlinks.
The crawler commons project provides a Sitemap parser in Java.

Related

How to optimize SEO for SPA using React-Helmet?

My project is a single page application using react js. I have heard that Google can crawl javascript pages including react js single page applications, without the need of server side rendering (even though it's generally better for SEO).
However, when I used webmaster tool: fetch and render as google, both what google bots are seeing and what visitors to my page are seeing are blank.
Even though I can add specific urls to google indexing, google only uses the title and description tag that I have put in my static index.html file, it doesn't get the nested react helmet component's title and description. Does anyone have experience in this? Appreciate it much!
To answer your question, ensure that you have polyfilled the necessary es6 features, google crawler's javascript feature can be quite limited, it does not have Array.find for example. You can read more about that here https://github.com/facebookincubator/create-react-app/issues/746#issue-179072109
As for tips for improving SEO, you can use these tips:
You can prerender your pages on build time to static html by using react-snapshot https://www.npmjs.com/package/react-snapshot This works great if your app does not have many dynamic content.
You can use pre rendering service like prerender.io / use static hosting with prerendering feature like netlify or roast.io. As for prerender.io, you can even host it yourself!

Are NodeJs applications crawlable by search engines?

If I use Jades template engine with NodeJs will the app be crawlable by search engines and Facebook without using the _escaped_fragment_?
If your application outputs HTML, it is no different than if you had written that HTML in a file and simply served the file. The wider Web doesn't generally know or care what you're using to generate your HTML.
(It is possible to infer what tech a page is using by inspecting headers and looking for common idioms that are unique to a particular technology, but these are just clues, not a fundamental difference in what your Web page is.)

Angular SEO for a directory multi-language app

I am doing an angularjs app with a nodejs-expressjs server.
I want to do an app that it's similar to a business directory.
I have doubts about if it's possible doing it SEO friendly to the all items at the directory, either by his name or his features (tags). Always having in mind that all pages are created with AngularJS.
If it is possible, how to do that dinamically.
I implemented an example that uses prerender server (this https://github.com/prerender/prerender) and the prerender-node library at the app server.
My example's pages, created by angularjs, does work (are SEO friendly, it appears at google's search)..but the pages are "static", and the directory it's going to add always new bussines to the directory that I want to appear in googles searching.
Besize, I want my app to be multi-language, and also have doubts about how to do all of that be multi-language, and if it is possible.
I hope you can help me.
If you're hosting your own Prerender server, it will serve the page "on the fly" every time Google accesses it, so it will always have the latest, dynamic content from your pages. If you're using a Prerender plugin to cache your pages, you'll need to make sure you recache them... or use our Prerender.io SaaS and we'll take care of all of the recaching for you.
It sounds like you just want Google to crawl your pages more often because of how dynamic your content might be. In order to have Google crawl your pages more often, make sure to quality inbound links from other sites to increase your PageRank.
Here is lots of advice from Google about multi language sites: https://support.google.com/webmasters/answer/182192?hl=en

a crawler that builds the link tree form a single website

I want to know if there are any outsource solutions for a crawler that will parse only the links and pages form a given website, and will output:
1.The link tree
2.The pages (where necessary)
thanks!
You dont need any particular framework to achieve this task. What languages do you know? If you know Java you can use HttpClient or HttpUnit libs to help you with crawling tasks.
If you are python user, there is great framework called Scrapy (http://scrapy.org/). You should check it out.

Sitemap.xml or Sitemap.html

I am working on one web application. I want to submit site to web crawlers using a site-map file.
There are many ways to do so
Use sitemap.xml
Use sitemap.html
Use urllist.txt
Use compressed sitemap files
All we need to do is add one of these files in root directory of applications.
My question is which option out of these is good to use ?
I'd go with an xml sitemap as defined here http://sitemaps.org/
Html sitemaps are more geared towards user navigation and urllist.txt seems to be an old method of providing links to yahoo.
XML sitemaps in the format defined on the above site were created by Google, Yahoo! and Microsoft and are recognised by all three.
The others wont do any harm but I believe the biggest benefit will come from an xml sitemap.
As for compression, that's up to you, if you want to conserve your bandwidth then gzip it however keep in mind it must be no larger than 10mb when uncompressed.

Resources