Search engine components - search

I'm a middle school student learning computer programming, and I just have some questions about search engines like Google and Yahoo.
As far as I know, these search engines consist of:
Search algorithm & code
(Example: search.py file that accepts search query from the web interface and returns the search results)
Web interface for querying and showing result
Web crawler
What I am confused about is the Web crawler part.
Do Google's and Yahoo's Web crawlers immediately search through every single webpage existing on WWW? Or do they:
First download all the existing webpages on WWW, save them on their huge server, and then search through these saved pages??
If the latter is the case, then wouldn't the search results appearing on the google search results be outdated, Since I suppose searching through all the webpages on WWW will take tremendous amount of time??
PS. One more question: Actually.. How exactly does a web crawler retrieve all the web pages existing on WWW? For example, does it search through all the possible web addresses, like www.a.com, www.b.com, www.c.com, and so on...? (although I know this can't be true)
Or is there some way to get access to all the existing webpages on world wide web?? (sorry for asking such a silly question..)
Thanks!!

The crawlers search through pages, download them and save (parts of them) for later processing. So yes, you are right that the results that search engines return can easily be outdated. And a couple of years ago they really were quite outdated. Only relatively recently Google and others started to do more realtime searching by collaborating with large content providers (such as Twitter) to get data from them directly and frequently but they took the realtime search again offline in July 2011. Otherwise they for example take notice how often a web page changes so they know which ones to crawl more often than others. And they have special systems for it, such as the Caffeine web indexing system. See also their blogpost Giving you fresher, more recent search results.
So what happens is:
Crawlers retrieve pages
Backend servers process them
Parse text, tokenize it, index it for full text search
Extract links
Extract metadata such as schema.org for rich snippets
Later they do additional computation based on the extracted data, such as
Page rank computation
In parallel they can be doing lots of other stuff such as
Entity extraction for Knowledge graph information
Discovering what pages to crawl happens simply by starting with a page and then its following links to other pages and following their links, etc. In addition to that, they have other ways of learning about new web sites - for example if people use their public DNS server, they will learn about pages that they visit. Sharing links on G+, Twitter, etc.
There is no way of knowing what all the existing web pages are. There may be some that are not linked from anywhere and noone publicly shares a link to them (and doesn't use their DNS, etc.) so they have no way of knowing what these pages are. Then there's the problem of the Deep Web. Hope this helps.
Crawling is not an easy task (for example Yahoo is now outsourcing crawling via Microsoft's Bing). You can read more about it in Page's and Brin's own paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine
More details about storage, architecture, etc. you can find for example on the High Scalability website: http://highscalability.com/google-architecture

Related

How do I track keywords entered in Google Search using Google Tag Manager?

I want to know what keywords brings users to our website. The result should be such that, every time a user clicks on a link of the company's website, the page URL, timestamp and keywords entered in search are recorded.
I'm not really much of a coder, but I do understand the basics of Google Tag Manager. So I'd appreciate some solutions that can allow me to implement this in GTM's interface itself.
Thanks!
You don't track them. Well, that is unless you can deploy your GTM on Google's search result pages. Which you're extremely unlikely to be able to do.
HTTPS prevents query parameters to get populated for referrers, which is what the core reason for it is.
You still can, technically, track Google search keywords for the extremely rare users who manage to use Google via http, but again, no need to do anything in GTM. GA will automatically track it with its legacy keyword tracking.
Finally, you can use Google Search Console where Google reports what keywords were used to get to your site. That information, however, is so heavily sampled that it's just not joinable to any of the GA data. It is possible, however, to join GSC with GA, but that will only lead to GA having a separate report from GSC and that's it. No real data joins.

Why web crawler must have these properties like robustness, politeness etc

Why web crawler must have robustness, politeness, scalablity, quality, freshness and extensibility ?
Robustness: a web crawler must be robust to changes in web site contents. Web search needs to retrieve and index every new web page as soon as possible. if a website just became online, the crawler needs time to go through all the front nodes at the queue in frontier before focusing on this new website. To tackle this web crawler has distributed system which index different web pages with different specification
Politeness: a web search must respect every web server's policies to re-index their web pages. if a web crawler is asked not to crawl a page aggressively by a certain web server, the crawler can put that page into a priority queue and re-index it when the queue is at top
Scalability: new webpages are added every day on the internet, web crawler must index every page asap. for this it needs fault tolerance, distributed systems, extra machines, etc. if a certain node at a web crawler has a fault, other nodes can divide its work and index the particular web pages.
Quality: web search ability to get useful web pages to every user. if the page contains entries which contain content far from user's recent searches or user's interests, the web search must use the previous user experience to predict what kind of content user might like
Freshness: web crawler's ability to fetch and index fresh copies of each page. for eg news websites are updated every second and needed to be re-index urgently. for this web crawler keep a separate priority queue for such priority based contents, to reindex such pages in a small period of time.
Extensibility: during early times, new data formats, languages, and new protocols were introduced. web-crawlers ability to cope with new and unseen data formats and new protocols is called extensibility, this suggests that web crawler architecture must be modular so that changes in one module would not affect others. if a website would contain a new data format unknown to web crawler then the web crawler can fetch the data but requires human intervention to add the data format details to the crawler index module.

Downloading all web hosts on the Internet

I'm working on a project where we work with a distributed crawler to crawl and download hosts found with web content on them. We have a few million hosts at this point, but we're realizing it's not the least expensive thing in the world. Crawling takes time and computing power, etc. etc. So instead of doing this ourselves, we're looking into if we can leverage an outside service to get URLs.
My question is, are there services out there that provide massive lists of web hosts and/or just massive lists of constantly updated URLS (which we can then parse to get hosts)? Stuff I've already looked into:
1) Search engine APIs - typically all of these search engine APIs will (understandably) not just let you download their entire index.
2) DMOZ and Alexa top 1 million - These don't have near enough sites for what we are looking to do, though they're a good start for seed lists.
Anyone have any leads? How would you solve the problem?
Maybe CommonCrawl helps. http://commoncrawl.org/
Common Crawl is a huge open database of crawled websites.

What are the pros & cons of using Google CSE vs implementing dedicated search engine for a site like stackoverflow?

I understand that same work should not be repeated when Google CSE is already there, so what may be the reasons to should consider implementing a dedicated search engine for a public facing website similar to SO(& why probably StackOverflow did that ?). Paid version of CSE(Google site Search), already eliminates several drawbacks that forced dedicated implementation. Cost may be one reason to not choose Google CSE, but what are other reasons ?
Another thing I want to ask is my site is similar kind as StackOverflow, so when Google indexes its content every now & then, won't that overload my database servers with lots of queries may be when there is peak traffic time?
I look forward to use Google Custom search API but I need to clarify whether the 1000 paid queries that I get for 5$ are valid only for 1 day or they get adjusted to extra queries(beyond free ones) on the next day & so on. Can anyone clarify on this too?
This depends on the content of your site, the frequency of the updates, and the kind of search you want to provide.
For example, with StackOverflow, there'd probably be no way to search for questions of an individual user through Google, but it can be done with an internal search engine easily.
Similarly, Google can outdate their API at any time; in fact, if past experience is any indication, Google has already done so with their Google Web Search API, where a lot of non-profits that had projects based on such API were left on the street with no Google options for continuation of their services (paying 100 USD/year for only 20'000 search queries per year, may be fine for a posh blog indeed, but greatly limits what you can actually use the search API for).
On the other hand, you probably already want to have Google index all of your pages, to get the organic search traffic, so Google CSE would probably use rather minimal resources of your server, compared to having a complete in-house search engine.
Now that Google Site Search is gone, the best search tool alternative for all the loyal Google fans is Google Custom Search (CSE)
Some of the features of Google Custom Search that I loved the most, were :-
Its free (with ads)
Ability to monetise those ads with your AdSense Account
Tons of Customization options, including removing the Google branding,
Ability to link it with Google Analytics account, for highly comprehensive analytical report,
Powerful auto correct feature to understand the real intention behind the typos,
Cons : Lacks customer Support…
Read More: https://www.techrbun.com/2019/05/google-custom-search-features.html

What are the benefits of having an updated sitemap.xml?

The text below is from sitemaps.org. What are the benefits to do that versus the crawler doing their job?
Sitemaps are an easy way for
webmasters to inform search engines
about pages on their sites that are
available for crawling. In its
simplest form, a Sitemap is an XML
file that lists URLs for a site along
with additional metadata about each
URL (when it was last updated, how
often it usually changes, and how
important it is, relative to other
URLs in the site) so that search
engines can more intelligently crawl
the site.
Edit 1: I am hoping to get enough benefits so I canjustify the development of that feature. At this moment our system does not provide sitemaps dynamically, so we have to create one with a crawler which is not a very good process.
Crawlers are "lazy" too, so if you give them a sitemap with all your site URLs in it, they are more likely to index more pages on your site.
They also give you the ability to prioritize your pages so the crawlers know how frequently they change, which ones are more important to keep updated, etc. so they don't waste their time crawling pages that haven't changed, missing ones that do, or indexing pages you don't care much about (and missing pages that you do).
There are also lots of automated tools online that you can use to crawl your entire site and generate a sitemap. If your site isn't too big (less than a few thousand urls) those will work great.
Well, like that paragraph says sitemaps also provide meta data about a given url that a crawler may not be able to extrapolate purely by crawling. The sitemap acts as table of contents for the crawler so that it can prioritize content and index what matters.
The sitemap helps telling the crawler which pages are more important, and also how often they can be expected to be updated. This is information that really can't be found out by just scanning the pages themselves.
Crawlers have a limit to how many pages the scan of your site, and how many levels deep they follow links. If you have a lot of less relevant pages, a lot of different URLs to the same page, or pages that need many steps to get to, the crawler will stop before it comes to the most interresting pages. The site map offers an alternative way to easily find the most interresting pages, without having to follow links and sorting out duplicates.

Resources