Which information is stored by Google crawler? - web

.. and how the web crawler infers the semantics of information on the website?
List out the ranking signal in separate answer.

From http://www.google.com/corporate/tech.html:
Hypertext-Matching Analysis: Our search engine also analyzes page content. However, instead of simply scanning for page-based text (which can be manipulated by site publishers through meta-tags), our technology analyzes the full content of a page and factors in fonts, subdivisions and the precise location of each word. We also analyze the content of neighboring web pages to ensure the results returned are the most relevant to a user's query.
Beyond that, your guess is as good as mine.

Semantic Closeness as a Ranking Signal
Site Traffic, # visitor, trends
Ranking factors - http://www.vaughns-1-pagers.com/internet/google-ranking-factors.htm

I guess noone really knows, that is a trade secret:)

Related

Measurements to evaluate a web search engine

I'm currently developing a small web search engine but I'm not sure how am I going evaluate it. I understand that a search engine can be evaluated by its precision and recall. In a more "localized" information retrieval system, e.g., an e-library, I can calculate both of them because I can know which stuffs are relevant to my query. But in a web-based information retrieval system, e.g., Google, it would be impossible to calculate the recall because I do not know how many web pages are relevant. This should means that F-measure and other measurements that require the number of relevant pages cannot be done.
Is everything I wrote correct? Is web search engine evaluation limited to precision only? Are there any other measurements I could use to evaluate a web search engine (other than P#k)?
You're correct that precision and recall, along with the F score / F measure are commonly-used metrics for evaluating (unranked) retrieval sets in search engine performance.
And you're also correct about the difficult or impossible nature of determining recall and precision scores for huge corpus of data such as all the web pages on the entire internet. For all search engines, small or large, I would argue that it's important to consider the role of human interaction in information retrieval: are the users using the search engine more interested in having a (ranked) list of relevant results that answers their information need or would one "top" relevant result be enough to satisfy the user's information needs? Check out the concept of "satisficing" as it pertains to information seeking for more information about how users assess when their information needs are met.
Whether you use precision, recall, mean-average precision, mean reciprocal rank, or any other of the numerous relevance and retrieval metrics it really depends on what you're trying to assess with regard to the quality of your search engine's results. I'd first try to figure out what sort of 'information needs' the users of my small search engine might have: will they be looking for a selection of relevant documents or would it be more helpful for their query needs if they had one 'best' document to satisfy their information needs? If you can better understand how your users will be using your small search engine you can then use that information to help inform which relevance model(s) will give your users results that they deem to be most useful for their information-seeking needs.
You might be interested in the free, online version of the Manning and Schütze "Introduction to Information Retrieval" text available from Stanford's NLP department which covers relevance and retrieval models, scoring and much more.
Google's Search Quality Evaluator training guide, which lists a few hundred dimensions on how Google's search results are ranked/scored, might be of interest to you too as you try to suss out your user's information-seeking goals. It's pretty neat to see all of the various factors that go into determining a web page's PageRank (Google's page ranking algorithm) score!

how schema.org can help in nlp

I am basically working on nlp, collecting interest based data from web pages.
I came across this source http://schema.org/ as being helpful in nlp stuff.
I go through the documentation, from which I can see it adds additional tag properties to identify html tag content.
It may help search engine to get specific data as per user query.
it says : Schema.org provides a collection of shared vocabularies webmasters can use to mark up their pages in ways that can be understood by the major search engines: Google, Microsoft, Yandex and Yahoo!
But I don't understand how it can help me being nlp guy? Generally I parse web page content to process and extract data from it. schema.org may help there, but don't know how to utilize it.
Any example or guidance would be appreciable.
Schema.org uses microdata format for representation. People use microdata for text analytics and extracting curated contents. There can be numerous application.
Suppose you want to create news summarization system. So you can use hNews microformats to extract most relevant content and perform summrization onit
Suppose if you have review based search engine, where you want to list products with most positive review. You can use hReview microfomrat to extract the reviews, now perform sentiment analysis on it to identify product has -ve or +ve review
If you want to create skill based resume classifier then extract content with hResume microformat. Which can give you various details like contact (uses the hCard microformat), experience, achievements , related to this work, education , skills/qualifications, affiliations
, publications , performance/skills for performance etc. You can perform classifier on it to classify CVs with particular skillsets
Thought schema.org does not helps directly to nlp guys, it provides platform to perform text processing in better way.
Check out this http://en.wikipedia.org/wiki/Microformat#Specific_microformats to see various mircorformat, same page will give you more details.
Schema.org is something like a vocabulary or ontology to annotate data and here specifically Web pages.
It's a good idea to extract microdata from Web pages but is it really used by Web developper ? I don't think so and I think that the majority of microdata are used by company such as Google or Yahoo.
Finally, you can find data but not a lot and mainly used by a specific type of website.
What do you want to extract and for what type of application ? Because you can probably use another type of data such as DBpedia or Freebase for example.
GoodRelations also supports schema.org. You can annotate your content on the fly from the front-end based on the various domain contexts defined. So, schema.org is very useful for NLP extraction. One can even use it for HATEOS services for hypermedia link relations. Metadata (data about data) for any context is good for content and data in general. Alternatives, include microformats, RDFa, RDFa Lite, etc. The more context you have the better as it will turn your data into smart content and help crawler bots to understand the data. It also leads further into web of data and in helping global queries over resource domains. In long run such approaches will help towards domain adaptation of agents for transfer learning on the web. Pretty much making the web of pages an externalized unit of a massive commonsense knowledge base. They also help advertising agencies understand publisher sites and to better contextualize ad retargeting.

Sort domains by number of public web pages?

I'd like a list of the top 100,000 domain names sorted by the number of distinct, public web pages.
The list could look something like this
Domain Name 100,000,000 pages
Domain Name 99,000,000 pages
Domain Name 98,000,000 pages
...
I don't want to know which domains are the most popular. I want to know which domains have the highest number of distinct, publicly accessible web pages.
I wasn't able to find such a list in Google. I assume Quantcast, Google or Alexa would know, but have they published such a list?
For a given domain, e.g. yahoo.com you can google-search site:yahoo.com; at the top of the results it says "About 141,000,000 results (0.41 seconds)". This includes subdomains like www.yahoo.com, and it.yahoo.com.
Note also that some websites generate pages on the fly, so they might, in fact, have infinite "pages". A given page will be calculated when asked for, and forgotten as soon as it is sent. Each can have a link to the next page. Since many websites compose their pages on the fly, there is no real difference (except that there are infinite pages, which you can't find out unless you ask for them all).
Keep in mind a few things:
Many websites generate pages dynamically, leaving a potentially infinite number of pages.
Pages are often behind security barriers.
Very few companies are interested in announcing how much information they maintain.
Indexes go out of date as they're created.
What I would be inclined to do for specific answers is mirror the sites of interest using wget and count the pages.
wget -m --wait=9 --limit-rate=10K http://domain.test
Keep it slow, so that the company doesn't recognize you as a Denial of Service attack.
Most search engines will allow you to search their index by site, as well, though the information on result pages might be confusing for more than a rough order of magnitude and there's no way to know how much they've indexed.
I don't see where they keep or have access to the database at a glance, but down the search engine path, you might also be interested in the Seeks and YaCy search engine projects.
The only organization I can think of that might (a) have the information easily available and (b) be friendly and transparent enough to want to share it would be the folks at The Internet Archive. Since they've been archiving the web with their Wayback Machine for a long time and are big on transparency, they might be a reasonable starting point.

Developing a crawler and scraper for a vertical search engine

I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be irrelevant. Most of the sites are tiny or small (a few hundred pages at the most). The products have 10 to 30 attributes.
Any ideas on how to write such a crawler and extractor. I have written a few crawlers and content extractors using usual ruby libraries, but not a full fledged search engine. I guess, crawler, from time to time, wakes up and downloads the pages from websites. Usual polite behavior like checking robots exclusion rules will be followed, of course. While the content extractor can update the database after it reads the pages. How do I synchronize crawler and extractor? How tightly should they be integrated?
Nutch builds on Lucene and already implements a crawler and several document parsers.
You can also hook it to Hadoop for scalability.
In the enterprise-search context that I am used to working in,
crawlers,
content extractors,
search engine indexes (and the loading of your content into these indexes),
being able to query that data effciently and with a wide range of search operators,
programmatic interfaces to all of these layers,
optionally, user-facing GUIs
are all seperate topics.
(For example, while extracting useful information from an HTML page VS PDF VS MS Word files are conceptually similar, the actual programming for these tasks are still very much works-in-progress for any general solution.)
You might want to look at the Lucene suite of open-source tools, understand how those fit together, and possibly decide that it would be beter to learn how to use those tools (or others, similar), than to reinvent the very big, complicate wheel.
I believe in books, so thanks to your query, I have discovered this book and have just ordered it. It looks like good take on one possible solution to the search-tool conumdrum.
http://www.amazon.com/Building-Search-Applications-Lucene-LingPipe/product-reviews/0615204252/ref=cm_cr_pr_hist_5?ie=UTF8&showViewpoints=0&filterBy=addFiveStar
Good luck and let us know what you find out and the approach you decide to take.

Does PageRank mean anything?

Is it a measure of anything that a developer or even manager can look at and get meaning from? I know at one time, it was all about the 7, 8, 9, and 10 PageRank. But is it still a valid measure of anything? If so, what can you learn from a PageRank?
Note that I'm assuming that you have other measurements that you can analyze.
PageRank is specific to Google and is a trademarked proprietary algorithm.
There are many variables in the formulas used by Google, but PageRank is primarily affected by the number of links pointing to the page, the number of internal links pointing to the page within the site and the number of pages in the site.
Thing you must consider is it's specific to a web page, not to a web site. So you need to optimize every pages.
Google sends Googlebot, its indexing robot, to spider your website, the bot is instructed not to crawl your site too deep unless it has a reasonable amount of PR (PageRank).
As to what I have experienced, the pagerank is an indicator for how many sites recently linked to your site. But it is not necessarily connected to your position on Google for example.
There were times where we increased our marketing and other sites linked to us, and the pagerank rose a bit.
I think the factors resulting in any SERP position are changing too much to put all your faith into one. Pagerank was very important, and still is to some degree but how much is a question I can't answer.
Every link you send out on a page passes some of the page's pagerank to where the link is pointing. The more links, the less pagerank passed on to each. Use rel="nofollow" in your links to focus pagerank flow in a more controlled manner.
The page rank algorithm is the probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. It is a relatively good approximation of importance of a webpage.

Resources