Why do web-developers still use meta-keywords and meta-description tags? - meta-tags

Google is not using meta-keywords tag at all because keywords are mostly used to spam search engines.
Google is not using the meta-description tag for ranking. Sometimes the meta-description tag is used for the site-snippet in search results if part of the content does not fit. But mostly meta-description is generated automatically from the content of the page and meta-description is the same as beginning of the content of the page.
Google has dropped the support of meta-keywords and meta-description tags for search ranking. Google handles about 92% of all search queries in the world. So now web-developers can stop using meta-keywords and meta-description meta tags, because spending time on them is not worth it.
Is there any real benefit for using meta-keywords and meta-description tags?
Links:
Google Webmasters Blog about meta tags support by Google;
Video with Matt Cutts about meta tags support by Google;
StatCounter Search Engines stats usage - Google handles about 92% of all search queries in the world;

No, we should carry on using meta tags, because we don't, and shouldn't be, just supporting Google. The approach should be: make documents as indexable as possible using a search-engine agnostic approach, and then put special handling in for one or two top engines - such as using Google's online tools to improve search ranking.
Google are very dominant in search at present, but there's no guarantee they will always be on top. Maybe it will be Facebook in the future, or perhaps Yahoo/Bing etc. will dramatically improve search quality, and people will switch back.
Side note: for search, I really like DuckDuckGo at the moment. Lots of nice search shortcuts (see bang operators) and a meaningful privacy policy.

We should use them because they are there. Who knows - perhaps they will be useful again in the future?
When the W3C drop them we can stop using them.
Just my opinion ofc...

keywords:
Google is not the only search engine. Google market share depends on many factors (country, age, technical know-how, …). Small percentages of big numbers are also big numbers.
There are special purpose search engines (for niches; only crawling hand-selected sites; etc.) that might still consider it.
Local search engines might use it. (Local) full text search engines anyway.
Some CMS use it for site search.
There are other consuming user-agents than search engines, e.g. parser/extractor.
description:
it can be useful even for Google, e.g. when someone searches only for the title/domain of your site, Google would often display snippets like "Login / Register … back to top … please insert CAPTCHA … " etc. If a description is provided, it could be used instead.
(the points mentionend under keywords are relevant for description, too.)

If google SEO is your only concern then meta keywords are a complete waste of time, but if you're targeting other search engines it may be worth investigating.
I believe Baidu still uses meta keywords, and that search engine is the dominant player in the Chinese market, so it'd be worth adding meta keywords if you want your site to be popular in China.
Regardless, I wouldn't go stuffing excessive numbers of irrelevant keywords in, as there is every chance that whatever search engine you're targeting will penalise you. 5-7 words summarising your page content is a good starting point.

Related

Web Crawling with seed URLs from search engine

I need to know if it is worth to build a crawler on top of the results given by a search engine.
By that means, for a given query, grab N URLs from a search engine and input them into a crawler to find more relevant pages to the search. Is there any scientific paper/experiment claiming that doing this helps gathering more relevant pages instead of only getting URLs from the search engine?
If I understood it right, you would rebuild the search engine, because it was its job to bring the most related/relevant results first over a search. And, although you did not mention directly your search engine, which I guess it is google, I would suggest you to use the advanced search options before trying anything else. Google provides an API for performing searches, which you can use in your system. But if this approach does not fit to you, it is possible to craw over google results, and even perform custom searches (for example filtering results by site, term or etc) but google would not be happy with this and would eventually block your calls. I suggest you give a try over its open API...

Search feature on website

I am interested in implementing a search feature on a website. It is a location search, so address/state/zip all should work. Which will then show results in that area and allow it to be filtered.
My question is:
What's the best approach for something like this?
There are literally dozens of ways of doing this (if not more). The exact implementation would depend on the technology stack that you use, but as a very top level overview:
you'd need to store the things you are searching for somewhere, and tag them with a lat/long location. Often, this would be in a database of some kind.
using a programming language, you would need to write a search that accepts a postcode, translates that to a lat/long and then searches the things in your database based on the distance between the location of the thing, and the location entered in the search.
if you want to support filtering, your search would need to support that too. This is often called "faceting" the search.
Working out the lat/long locations will need to be done using a GeoLocation service, there are some, such as PostCode Anywhere that will do this as a paid service, and others that are free (within reason), such as the Google Maps APIs.
There are probably some hosted services that will do what you want, you'd have to shop around.
Examples of search software that supports geolocation searching out of the box are things like Solr, Azure Search, Lucene and Elastic.

What is the name of this search term popularity component?

If you look at SharpCrafters, the front page has a cool component that shows how popular certain search terms are, with larger text for more popular terms. I've seen this around the web in different places, especially blogs. What is this called in general, and what specific implementations exist?
That would be the Tag Cloud.

Why doesn't Google offer partial search? Is it because the index would be too large?

Google/GMail/etc. doesn't offer partial or prefix search (e.g. stuff*) though it could be very useful. Often I don't find a mail in GMail, because I don't remember the exact expression.
I know there is stemming and such, but it's not the same, especially if we talk about languages other than English.
Why doesn't Google add such a feature? Is it because the index would explode? But databases offer partial search, so surely there are good algorithms to tackle this problem.
What is the problem here?
Google doesn't actually store the text that it searches. It stores search terms, links to the page, and where in the page the term exists. That data structure is indexed in the traditional database sense. I'd bet using wildcards would make the index of the index pretty slow and as Developer Art says, not very useful.
Google does search partial words. Gmail does not though. Since you ask what's the problem here, my answer is lack of effort. This problem has a solution that enables to search in constant time and linear space but not very cache friendly: Suffix Trees. Suffix Arrays is another option that is more cache-friendly and still time efficient.
It is possible via the Google Docs - follow this article:
http://www.labnol.org/internet/advanced-gmail-search/21623/
Google Code Search can search based on regular expressions, so they do know how to do it. Of course, the amount of data Code Search has to index is tiny compared to the web search. Using regex or wildcard search in the web search would increase index size and decrease performance to impractical levels.
The secret to finding anything in Google is to enter a combination of search terms (or quoted phrases) that are very likely to be in the content you are looking for, but unlikely to appear together in unrelated content. A wildcard expression does the opposite of this. Just enter the terms you expect the wildcard to match, keeping in mind that Google will do stemming for you. Back in the days when computers ran on steam, Lycos (iirc) had pattern matching, but they turned it off several years ago. I presume it was putting too much load on their servers.
Because you can't sensibly derive what is meant with car*:
Cars?
Carpets?
Carrots?
Google's algorithms compare document texts, also external inbound links to determine what a document is about. With these wildcards all these algorithms go into junk

Developing a crawler and scraper for a vertical search engine

I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be irrelevant. Most of the sites are tiny or small (a few hundred pages at the most). The products have 10 to 30 attributes.
Any ideas on how to write such a crawler and extractor. I have written a few crawlers and content extractors using usual ruby libraries, but not a full fledged search engine. I guess, crawler, from time to time, wakes up and downloads the pages from websites. Usual polite behavior like checking robots exclusion rules will be followed, of course. While the content extractor can update the database after it reads the pages. How do I synchronize crawler and extractor? How tightly should they be integrated?
Nutch builds on Lucene and already implements a crawler and several document parsers.
You can also hook it to Hadoop for scalability.
In the enterprise-search context that I am used to working in,
crawlers,
content extractors,
search engine indexes (and the loading of your content into these indexes),
being able to query that data effciently and with a wide range of search operators,
programmatic interfaces to all of these layers,
optionally, user-facing GUIs
are all seperate topics.
(For example, while extracting useful information from an HTML page VS PDF VS MS Word files are conceptually similar, the actual programming for these tasks are still very much works-in-progress for any general solution.)
You might want to look at the Lucene suite of open-source tools, understand how those fit together, and possibly decide that it would be beter to learn how to use those tools (or others, similar), than to reinvent the very big, complicate wheel.
I believe in books, so thanks to your query, I have discovered this book and have just ordered it. It looks like good take on one possible solution to the search-tool conumdrum.
http://www.amazon.com/Building-Search-Applications-Lucene-LingPipe/product-reviews/0615204252/ref=cm_cr_pr_hist_5?ie=UTF8&showViewpoints=0&filterBy=addFiveStar
Good luck and let us know what you find out and the approach you decide to take.

Resources